Software > CoreNLP Spanish FAQ

Spanish FAQ for Stanford CoreNLP, parser, POS tagger, and NER

Questions

  1. How do I use the Spanish CoreNLP pipeline?
  2. What corpus was used to train the CoreNLP Spanish models?
  3. How did you modify the AnCora corpus?
  4. How does CoreNLP tokenize Spanish text?
  5. What character encoding do you assume?
  6. What POS tag set does the parser use?
  7. What phrasal category set does the parser use?
  8. How well do the parsers work?
  9. Is there a Spanish dependency parser?
  10. Where does the Spanish-specific source code live?

Questions with answers

  1. How do I use the Spanish pipeline?

  2. You can activate the Spanish pipeline using a special properties file StanfordCoreNLP-spanish.properties, shipped in the Spanish models jar. Here's how to run the default pipeline with some file test.txt:

    $ java -cp stanford-corenlp-3.5.2.jar:stanford-spanish-corenlp-2015-01-08-models.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -properties StanfordCoreNLP-spanish.properties -file test.txt
    Adding annotator tokenize
    Adding annotator ssplit
    Adding annotator pos
    Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/spanish/spanish-distsim.tagger ... done [0.8 sec].
    Adding annotator ner
    Loading classifier from edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz ... done [2.4 sec].
    Adding annotator parse
    Loading parser from serialized file edu/stanford/nlp/models/lexparser/spanishPCFG.ser.gz ...
    Initializing lexicon scores ... The 19 open class tags are: [ nc0s000 np00000 vmis000 rg nc0p000 vmsp000 vmn0000 vmip000 aq0000 aq0000-part vmp0000 vmg0000 z0 w vmif000 nc00000 vmii000 vmic000 vmsi000 ]
    done [0.6 sec].
    
    Ready to process: 1 files, skipped 0, total 1
    Processing file test.txt ... writing to test.txt.xml {
          Annotating file test.txt [0.305 seconds]
    } [0.351 seconds]
    Processed 1 documents
    Skipped 0 documents, error annotating 0 documents
    Annotation pipeline timing information:
    TokenizerAnnotator: 0.2 sec.
    WordsToSentencesAnnotator: 0.0 sec.
    POSTaggerAnnotator: 0.0 sec.
    NERCombinerAnnotator: 0.0 sec.
    ParserAnnotator: 0.1 sec.
    TOTAL: 0.3 sec. for 8 tokens at 26.5 tokens/sec.
    Pipeline setup: 0.0 sec.
    Total time for StanfordCoreNLP pipeline: 0.4 sec.
    
  3. What corpus was used to train the CoreNLP Spanish models?

    We trained and evaluated our Spanish models on a combination of two corpora, after very heavy modifications (described further below):

    1. AnCora Spanish 3.0 corpus. This corpus consists of about 17,000 sentences, drawn from Spanish (Spain) newswire and from an older balanced Castilian Spanish corpus (3LB).
    2. The DEFT Spanish Treebank V2 (LDC2015E66). This corpus contains the full International Spanish Newswire Treebank and the full Latin American Spanish Discussion Forum Treebank (roughly 5,000 sentences in total).
      This data was added to train the new Spanish models released in CoreNLP 3.7.0.

  4. How did you modify the combined corpus?

    (For those interested in as much detail as possible on these modifications, here is a 5-page document we wrote about these changes, with lots of pretty trees.)

    We made three significant changes to the treebanks:

    • Part-of-speech tagset simplification. The default AnCora tagset has hundreds of different extremely precise tags. This may be useful for some linguistic applications, but did not bode well for even a state-of-the-art part-of-speech tagger. We reduced the tagset to 85 tags, a more manageable size that still allows for a useful amount of precision. The tags are listed in a later answer.
    • Multi-word expression expansion. The AnCora corpus is full of multi-word expressions: tokens which actually contain multiple words, linked by an underscore character. Some examples follow:
      • no_obstante (idiomatic expression)
      • Chris_Woodruff (person name)
      • stock_options (foreign phrases)
      • 31_de_julio (dates)
      • 2,74_por_ciento (quantifier expressions)

      These multi-word expressions have some fairly valid uses (e.g., for proper nouns and foreign words, where we can't be certain about the syntax of the constituent words), but also seem to be used as a crutch in places where the annotation rules were not sufficient (e.g., quantifier expressions, idiomatic expressions).

    • Word splitting. Spanish sometimes contracts meaningful free morphemes into a single word. We expand these joined forms (namely clitic pronouns and prepositional contractions) to match our tokenization practices. See the answer on tokenization for more information.

      There are some significant consequences in our tree representations. See an example of the effect of a contraction split below:

      Before expansion. The word del is marked as a preposition, although it contains a form -l which acts as an article.

      After expansion. The word del has been split into its constituent morphemes de and el. Note that el moves to a different tree constituent, into the proper position for an article.


  5. How does CoreNLP tokenize Spanish text?

    We tokenize exactly the same as we do for English, with a few exceptions. Below are listed notable features of our Spanish tokenization:

    • Clitic pronouns are separated from their verbs. Spanish verbs can carry clitic pronouns in some cases (specifically, when the verb is a gerund, non-negative imperative, or infinitive form). The tokenizer separates the pronouns attached to the end of these verb types into separate tokens.

      For example, negarse is tokenized to negar se — two separate tokens.

    • Contractions are expanded into multiple words. Spanish contractions include e.g. al, del, conmigo, and consigo. The tokenizer expands these contractions into dictionary forms a el, de el, con mí, con sí, and so on. These expansions are grammatically incorrect at the surface level, but are very useful for all tasks following tokenization (tagging and parsing, for example).
    • Parentheses are rendered =LRB= and =RRB=.
    • All single quote forms (single curly quotes, single guillemets) are normalized to '; all double quote forms (double curly quotes, double guillemets) are normalized to ".
    • Em dashes () and en dashes () are both normalized to --.
  6. What character encoding do you assume?

    We expect input in UTF-8 encoding. Behavior is otherwise undefined! (This mainly matters for the several accented letters and special punctuation characters used in Spanish.)
  7. What POS tag set does the parser use?

    Currently, the only Spanish tagger model available is the Universal Dependencies model. This uses standard UPOS tags.

    Prior to CoreNLP 4.0, we used a simplified version of the tagset used in the AnCora 3.0 corpus / DEFT Spanish Treebank. The default AnCora tagset has hundreds of different extremely precise tags. This may be useful for some linguistic applications, but did not bode well for even a state-of-the-art part-of-speech tagger. We reduced the tagset to 85 tags, a more manageable size that still allows for a useful amount of precision.

    The tags are designed to remain compatible with the EAGLES standard. In our tags, we simply null out most of the fields (using a label 0) that are not relevant for our purposes. The resulting compressed tagset is listed below.

    TagDescriptionExample(s)
    Adjectives
    ao0000Adjective (ordinal)primera, segundo, últimos
    aq0000Adjective (descriptive)populares, elegido, emocionada, andaluz
    Conjunctions
    ccConjunction (coordinating)y, o, pero
    csConjunction (subordinating)que, como, mientras
    Determiners
    da0000Article (definite)el, la, los, las
    dd0000Demonstrativeeste, esta, esos
    de0000"Exclamative" (TODO)qué (¡Qué pobre!)
    di0000Article (indefinite)un, muchos, todos, otros
    dn0000Numeraltres, doscientas
    do0000Numeral (ordinal)el 65 aniversario
    dp0000Possessivesus, mi
    dt0000Interrogativecuántos, qué, cuál
    Punctuation
    f0Other&, @
    faaInverted exclamation mark¡
    fatExclamation mark!
    fcComma,
    fcaLeft bracket[
    fctRight bracket]
    fdColon:
    feDouble quote"
    fgHyphen-
    fhForward slash/
    fiaInverted question mark¿
    fitQuestion mark?
    fpPeriod / full-stop.
    fpaLeft parenthesis(
    fptRight parenthesis)
    fraLeft guillemet / angle quote«
    frcRight guillemet / angle quote»
    fsEllipsis..., etcétera
    ftPercent sign%
    fxSemicolon;
    fzSingle quote'
    Interjections
    iInterjectionay, ojalá, hola
    Nouns
    nc00000Unknown common noun (neologism, loanword)minidisc, hooligans, re-flotamiento
    nc0n000Common noun (invariant number)hipótesis, campus, golf
    nc0p000Common noun (plural)años, elecciones
    nc0s000Common noun (singular)lista, hotel, partido
    np00000Proper nounMálaga, Parlamento, UFINSA
    Pronouns
    p0000000Impersonal sese
    pd000000Demonstrative pronounéste, eso, aquellas
    pe000000"Exclamative" pronounqué
    pi000000Indefinite pronounmuchos, uno, tanto, nadie
    pn000000Numeral pronoundos miles, ambos
    pp000000Personal pronounellos, lo, la, nos
    pr000000Relative pronounque, quien, donde, cuales
    pt000000Interrogative pronouncómo, cuánto, qué
    px000000Possessive pronountuyo, nuestra
    Adverbs
    rgAdverb (general)siempre, más, personalmente
    rnAdverb (negating)no
    Prepositions
    sp000Prepositionen, de, entre
    Verbs
    va00000Verb (unknown)should
    vag0000Verb (auxiliary, gerund)habiendo
    vaic000Verb (auxiliary, indicative, conditional)habría, habríamos
    vaif000Verb (auxiliary, indicative, future)habrá, habremos
    vaii000Verb (auxiliary, indicative, imperfect)había, habíamos
    vaip000Verb (auxiliary, indicative, present)ha, hemos
    vais000Verb (auxiliary, indicative, preterite)hubo, hubimos
    vam0000Verb (auxiliary, imperative)haya
    van0000Verb (auxiliary, infinitive)haber
    vap0000Verb (auxiliary, participle)habido
    vasi000Verb (auxiliary, subjunctive, imperfect)hubiera, hubiéramos, hubiese
    vasp000Verb (auxiliary, subjunctive, present)haya, hayamos
    vmg0000Verb (main, gerund)dando, trabajando
    vmic000Verb (main, indicative, conditional)daría, trabajaríamos
    vmif000Verb (main, indicative, future)dará, trabajaremos
    vmii000Verb (main, indicative, imperfect)daba, trabajábamos
    vmip000Verb (main, indicative, present)da, trabajamos
    vmis000Verb (main, indicative, preterite)dio, trabajamos
    vmm0000Verb (main, imperative)da, , trabaja, trabajes, trabajemos
    vmn0000Verb (main, infinitive)dar, trabjar
    vmp0000Verb (main, participle)dado, trabajado
    vmsi000Verb (main, subjunctive, imperfect)diera, diese, trabajáramos, trabajésemos
    vmsp000Verb (main, subjunctive, present), trabajemos
    vsg0000Verb (semiauxiliary, gerund)siendo
    vsic000Verb (semiauxiliary, indicative, conditional)sería, serían
    vsif000Verb (semiauxiliary, indicative, future)será, seremos
    vsii000Verb (semiauxiliary, indicative, imperfect)era, éramos
    vsip000Verb (semiauxiliary, indicative, present)es, son
    vsis000Verb (semiauxiliary, indicative, preterite)fue, fuiste
    vsm0000Verb (semiauxiliary, imperative)sea,
    vsn0000Verb (semiauxiliary, infinitive)ser
    vsp0000Verb (semiauxiliary, participle)sido
    vssf000Verb (semiauxiliary, subjunctive, future)fuere
    vssi000Verb (semiauxiliary, subjunctive, imperfect)fuera, fuese, fuéramos
    vssp000Verb (semiauxiliary, subjunctive, present)sea, seamos
    Dates
    wDateoctubre, jueves, 2002
    Numerals
    z0Numeral547.000, 04, 52,52
    zmNumeral qualifier (currency)dólares, euros
    zuNumeral qualifier (other units)km, cc
    Other
    wordEmoticon or other symbol:), ®
  8. What phrasal category set does the parser use?

    The phrasal category set is a superset of the categories used in the AnCora corpus (see the first column of their syntactic annotation documentation). We have added several grup.* constituents in order to group together newly split tokens after multi-word expression expansion (detailed above). The 29 total phrasal categories are listed below. We give examples (with spans of the phrasal constituents bolded) in the third column.

    NameDescriptionExamples
    conjConjunctionNo obstante, en el mercado …
    la fauna y la flora
    gerundiGerundresponder atacando
    grup.aAdjective group
    grup.advAdverb group
    grup.cc *Coordinating conjunction groupsino que lo pongo
    grup.cs *Subordinating conjunction groupmientras que el Ibex …
    grup.nomNoun groupun foro en Internet recogerá …
    permitieron a los mercados bursátiles europeos dar …
    grup.prep *Preposition groupa partir de hoy
    están por encima de los demás
    grup.pron *Pronoun groupEl mío está …
    grup.verbVerb groupFrancfort perdió el 0,05%
    las voces empiezan a subir de tono
    grup.w *Date groupen el pleno del pasado viernes
    desde el 19 de mayo por …
    grup.z *Numeral groupel 2,85 %
    incInserted elementEl Gobierno , apuntó Pimentel, se …
    Testimoni silenciós (foto derecha)
    infinitiuInfinitivepermitiría regular actividades …
    para no empezar a regalar un tiempo
    interjeccioInterjectionOjalá Aznar
    Bueno, hay una …
    morfema.pronominalPronominal morphemeél se reafirmó
    a oírse1 con más frecuencia
    morfema.verbalVerbal morphemese habla español
    negNegationno sabía exactamente
    participiParticipleespléndidamente decorada
    prepPrepositionla sostuvo con cuidado
    relatiuRelative pronoun (complementizer)lo que iba a hacer
    ROOTSentence root
    SClauselo que iba a hacer
    s.aAdjective phraseun hermoso trabajo
    sadvAdverbial phraseAl2 final se realizó …
    se comprometió anoche
    sentenceMain sentence
    snNoun phraseEn este sentido
    por lo que aseguró
    spPrepositional phrase… decisión tomada por las autoridades
    specSpecifierel término
    todos los objetivos

    Footnotes

    * These constituents do not appear in the original AnCora corpus.

    1 Recall that clitic pronouns are treated as separate tokens in our modified corpus.

    2 This is tokenized as A el in our modified corpus.

  9. How well do the parsers work?

  10. Is there a Spanish dependency parser?

    Yes.
  11. Where does the Spanish-specific source code live?

    The majority of the code for dealing with Spanish in particular lives in the package edu.stanford.nlp.international.spanish. Note that this package is a mix of (a) code which works with Spanish in general and (b) code which works specifically with the AnCora corpus. The purpose of a particular class is stated in the source code comments.

    Code for reading, transforming and creating new Spanish parse trees is in a separate package edu.stanford.nlp.trees.international.spanish.

For general questions, see also the Parser FAQ. Please send any other questions or feedback, or extensions and bugfixes to parser-user@lists.stanford.edu or parser-support@lists.stanford.edu.