Word frequency lists in Spanish
We are providers of high-quality frequency word lists in Spanish (and many other languages). The lists are generated from an enormous authentic database of text (text corpora) produced by real users of Spanish. Our largest Spanish corpus contains texts with a total length of 17,000,000,000 words.
A relatively small corpus is sufficient to generate a list of the 2,000 most frequent Spanish words, or the list of 3,000 words or 5,000 words because such words appear frequently enough in any text.
However, an enormous text database (corpus) is required to ensure reliable word frequency information even for rare and infrequently used words. Such a large corpus cannot be built manually and the only viable option is using an automatic procedure of downloading content from the web. Lexical Computing developed a sophisticated approach to collecting only linguistically valuable content from the web. A series of tools is used to gather the right content which is then deduplicated and cleaned. This ensures that the statistics reflect the use of the language in real life. This blog post gives more details.
We are able to generate frequency lists of millions of unique words in Spanish. The size is dependent on the exact specifications provided by the customer. Typically, we will not include words which appear 5 times or less in the corpus. Such words are usually not linguistically valuable. Additional criteria can be specified to receive the required word database.
Enriched frequency wordlists
We are also able to provide additional information such as POS tags, lemmas, probabilities of the next word, or any other statistics or morphological information. We can also supply example sentences and other related linguistic data.
The frequency list can also be generated from specific parts of the corpus rather than the whole corpus. Each document in the corpus contains information about the top-level domain (TLD) of its origin, for example .es, .ar or .mx. This information can be used to generate frequency lists of regional varieties of Spanish.
The easiest is to register a free trial account in Sketch Engine and use the wordlist tool to generate a frequency list. The advanced tab of the wordlist tool allows for detailed criteria to be specified.
We will provide a quotation based on the exact specifications and the intended use of the wordlist. Please contact us.
The corpus will be made for download to you on a dedicated link within the agreed period of time. It normally takes a week or two to generate the data. Very complex wordlist can require extra computing power and can take longer to produce.