Spanish bigram database and n-gram models

Lexical Computing supplies high-quality bigram and n-gram databases in Spanish (and many other languages). The n-gram model is generated from an enormous database of authentic text (text corpora) produced by real users of Spanish. Our largest Spanish corpus contains texts with a total length of 17,000,000,000 words.

N-gram database quality

Even a relatively small amount of text is sufficient to generate a database of the 10,000 most frequent Spanish n-grams  because such n-grams appear frequently enough in any text.

The situation is very different with medium-frequncy or low-frequency n-grams. An enormous text database (corpus) is required to ensure reliable n-gram frequency information even for rare and infrequently used bigrams or n-grams. Such large databases cannot be built manually. The only viable option is using an automatic procedure of downloading content from the web. Lexical Computing developed a sophisticated procedure for gathering only linguistically valuable texts from the web. A series of tools is applied to retrieve the right content and to perform deduplication and cleaning. This ensures that the statistics are not affected by content which is over-represented on the web. This blog post gives more details.

N-gram database size

Lexical computing can generate n-gram models of millions of unique bigrams or n-grams in Spanish. The actual size depends on the specifications. By default, we will not include any n-gram which appears fewer than 5 times in the corpus. Such n-grams are typically noise without any linguistic value. The client can specify any filtering options.

Enriched n-gram databases

The n-gram database can be enriched with additional information such as POS tags, lemmas, probabilities of the next word, or any other statistical or morphological information.

Sample n-gram model

The easiest is to register a free trial account in Sketch Engine and use the n-gram tool to generate a list of n-grams. The advanced tab of the wordlist tool allows for detailed specifications to be used.

Pricing

We will provide a quotation based on the exact specifications and the intended use of the wordlist.

N-gram database download

The database will be made for download to you on a dedicated link within the agreed period of time. It normally takes a week or two to generate the data. Very complex databases can be computationally demanding and can take longer to produce.

Spanish n-gram database

A random sample of Spanish bigrams made up of word forms with part-of-speech tags. The list can be enriched with statistical, morphological and other linguistic information and delivered in a format specified by the customer.

Spanish bigrams

Spanish bigram model sample

Download a spreadsheet with a sample of the last 100 bigrams in each thousand between 1,000 and 100,000. The list is case sensitive. Lists with specific criteria and filtering options can be generated to your requirements.