Spanish bigram database and n-gram models
Lexical Computing supplies high-quality bigram and n-gram databases in Spanish (and many other languages). The n-gram model is generated from an enormous database of authentic text (text corpora) produced by real users of Spanish. Our largest Spanish corpus contains texts with a total length of 17,000,000,000 words.
N-gram database quality
Even a relatively small amount of text is sufficient to generate a database of the 10,000 most frequent Spanish n-grams because such n-grams appear frequently enough in any text.
The situation is very different with medium-frequncy or low-frequency n-grams. An enormous text database (corpus) is required to ensure reliable n-gram frequency information even for rare and infrequently used bigrams or n-grams. Such large databases cannot be built manually. The only viable option is using an automatic procedure of downloading content from the web. Lexical Computing developed a sophisticated procedure for gathering only linguistically valuable texts from the web. A series of tools is applied to retrieve the right content and to perform deduplication and cleaning. This ensures that the statistics are not affected by content which is over-represented on the web. This blog post gives more details.
N-gram database size
Lexical computing can generate n-gram models of millions of unique bigrams or n-grams in Spanish. The actual size depends on the specifications. By default, we will not include any n-gram which appears fewer than 5 times in the corpus. Such n-grams are typically noise without any linguistic value. The client can specify any filtering options.
Enriched n-gram databases
The n-gram database can be enriched with additional information such as POS tags, lemmas, probabilities of the next word, or any other statistical or morphological information.
Sample n-gram model
The easiest is to register a free trial account in Sketch Engine and use the n-gram tool to generate a list of n-grams. The advanced tab of the wordlist tool allows for detailed specifications to be used.
We will provide a quotation based on the exact specifications and the intended use of the wordlist.
N-gram database download
The database will be made for download to you on a dedicated link within the agreed period of time. It normally takes a week or two to generate the data. Very complex databases can be computationally demanding and can take longer to produce.