Russian n-gram databases and n-gram models

We supply Russian n-gram databases as well as many other languages. The n-gram model is generated from an enormous database of authentic text (text corpora) produced by real users of Russian. Our largest Russian corpus contains texts with a total length of 14,000,000,000 words.

Data quality

The corpus size is not an issue with n-gram models of the most frequent 10,000 n-grams in Russian. Such a database has very limited use, though. For any serious application, a much larger database, typically of millions of n-grams, is needed.

An enormous text database (corpus) is required to ensure reliable n-gram frequency even for rare and infrequently used n-grams. The only viable option of building corpora of billions of words is with the help of an automated procedure of downloading content from the web. Lexical Computing developed a sophisticated procedure for collecting only linguistically valuable content from the web. A series of tools is used to focus on the right content and to perform deduplication and cleaning. This ensures that the statistics are not skewed. This blog post gives more details.

Russian n-gram model size

We are able to generate n-gram database of millions of unique n-grams in Russian. The actual size depends on the specifications. By default, we will exclude any n-gram which appears fewer than 5 times. Such n-grams are typically noise without any linguistic value. We can accommodate any specific requirements.

Enriched n-gram databases

We can also supplement the data with additional information such as POS tags, lemmas, probabilities of the next word, or any other statistical or morphological data.

N-gram model sample

The easiest is to register a free trial account in Sketch Engine and use the n-gram tool to generate a list of n-grams. Use the advanced tab to specify detailed criteria and the format of the output.

Prices

We will provide a quotation based on the exact specifications and the intended use of the database.

N-gram database download

The database will be ready for download on a dedicated link within the agreed period of time. It usually takes a couple of weeks or to generate the data. Very complex databases can be computationally demanding and can take longer to produce.

Russian bigram database

Random items from a Russian bigram database with part-of-speech tags and frequencies. Various delivery formats are available. The data can be enriched with statistical, morphological and other linguistic information.

Russian bigram database

Russian bigram model sample

Download a spreadsheet with a sample of the last 100 bigrams in each thousand between 1,000 and 100,000. The list is case sensitive. Lists with specific criteria and filtering options can be generated to your requirements.