How to use the corpus


Search for exact words
Search by lemma
Search by part-of-speech
Search for synonyms
Search with wildcards
Combine search methods
Delimit by sociodemographic information
Control display of results
Retrieve frequencies of ngrams
Save search results

Search for exact words

To search for exact words simply enter those words in the search box. For example, the search string hablo will return hablo while the search string libro will return libro.


Search by lemma

To search by lemma use an opening square bracket [ before a word. For example, the search term [hablar will search for all forms of the verbal lemma HABLAR, such as hablo, hablábamos, hablarán, etc. The search string [libro will search for the forms of the nominal lemma LIBRO, and will match libro and libros.


Search by part-of-speech

To search by part-of-speech (noun, verb, adjective, etc.) use the dropdown box in the options section. If you'd like more specificity than what is given in that dropdown box you can manually enter in part-of-speech tags listed on TreeTagger's website: ftp://ftp.ims.uni-stuttgart.de/pub/corpora/spanish-tagset.txt. (TreeTagger is a part-of-speech and lemma tagger that was used to tag the interviews in this corpus.) Those tags must be preceded by an asterick. For example, to search for a finite form of ESTAR enter *vefin, or a gerund of a lexical verb *vlger, etc. (There's no need to capitalize those tags as not doing so will return the same results.)


Search for synonyms

To search for synonyms use an equal sign = before a word. For example, the search term =escuela will search for escuela and its synonyms, such as colegio and universidad, etc., while =hablar will return hablar and synonyms such as decir, explicar, platicar.


Search with wildcards

Three wildcards are available: a percentage sign % , a question mark ? , and an underscore _ .

The percentage sign wildcard % represents either optional characters or exactly one word. It represents optional character when it is immediately next to other characters and represents exactly one word when surrounded by spaces or at the beginning or the end of the search string with a space is on the other side of the percentage sign. For example, the search term %hacer will search for the characters "hacer" preceded by optional characters and will return hacer, rehacer and quehacer, etc. The search term %hac% will search for words with the characters "hac" optionally preceded and followed by additional characters and will match words such as hacer, rehacer, quehacer, as well as hace, haciendo, hacía, and hacia (accented vowels are considered different from non-accented vowels), and even muchacho. When surrounded by spaces or at the beginning or the end of the search string with a space on the other side of the percentage sign, the percentage sign represents exactly one word. For example, the search term el % es will search for "el" followed by exactly one word followed by "es", such as el chiste es, el habla es, and el mar es, etc. Similarly, the search string el % will search for "el" followed by a word, and will return el chiste, el habla, el mar, and el domingo, el campo, el día, etc. Likewise, the search string % domingo will return el domingo and y domingo, etc.

The question mark wildcard ? represents an optional word. For example, the search string antes ? que will search for "antes" followed by an optional word followed by "que", and will return antes de que and antes que. This wildcard only works between two spaces, not at the beginning nor the end of the search string (as is the case with the percentage sign).

The underscore wildcard _ represents exactly one character. For example, the search string mexican_ will search for the letters "mexican" followed by exactly one character, such as mexicano and mexicana, while mexican_s will return mexicanos and mexicanas. More than one underscore can be used in a single word and even immediately next to each other. For example, the search string habl__ (two underscores next to each other) will search for hablar, hablan, hablas, hablen, hables, etc., but not hablo, hablaste, hablemos, hablaremos, etc.

The percentage sign and underscore can be used within the same word. For example, the search string m_xic% will search for the letter "m" followed by exactly one character followed by the letters "xic" followed by optional characters and will return matches such as México, mexicano, mexicana, mexicanos, mexicanas (searches are not case senstive). Similarly, the search string %_ste will search for optional characters followed by exactly one character followed by the letters "ste", such as hablaste, hiciste, and existe, chiste, este, éste, etc.


Combine search methods

The various manners of searching the corpus described above can be combined in a single search string. For example, the search string [ir a *v_inf will search for forms of the lemma IR followed by the word "a" followed by a verb in the infinitive, and will return matches such as fue a ver, iba a trabajar, vamos a graduar. The search string [querer que ? *v% will search for forms of the lemma QUERER followed by the word "que" followed by an optional word followed by a verb, and will return matches such as queremos que las hagan, querían que me viniera, quiero que digan.


Delimit by sociodemographic information

Searches can be delimited by the sociodemographic information of the speakers: gender, state of origin, highest level of education, age at recording, and age upon arrival to the United States. These options can be specified from within the options section of the search area. In order to select multiple education levels or age ranges the user must hold down the "Control" (or "Command") key and click on the several levels or ranges desired.


Control the display of results

The display of search results can be altered in several ways according to the needs of the user. The number of words surrounding the matches can be controlled with the dropdown box labeled "Words around match" while the number of results per page can be controlled with the dropdown box labeled "Results per page". It should be noted that the because punctuation is stored in the database the same way words are, the number of words actually displayed will be fewer than the number specified in the dropdown box when there is punctuation in the surrounding context. For convenience, the interviewer's questions and comments are displayed within the surrounding context / BETWEEN SLASHES AND IN ALL CAPS /. The table of results can be ordered by any of the columns by simply clicking on the column heading. To reorder the table in the inverse order, click on the same column heading again.


Retrieve frequencies of ngrams

To retrieve the frequencies of ngrams, namely monograms, bigrams, and trigrams, in the corpus use the Frequency List page. Monograms, bigrams, and trigrams can be entered in the same search, one ngram per line. A maximum of 1,000 ngrams can be entered at a time. The frequency numbers of ngrams can be ordered in one of three ways: as entered by the user, by alphabetical order, and in ascending order of frequency.


Save results

The results of searches, whether in the Search page or the Frequency List page, can easily be saved to a user's local hard drive or a flash drive by clicking on an export button located below the results table. Users can then open this file in a spreadsheet software (such as OpenOffice Calc, Google Docs spreadsheet, or Microsoft Excel) to continue working with the results, such as to code linguistic variables of interest to the user. When importing the file into the spreadsheet software, the field delimiter should be specified as a comma.