Russian n-gram databases and n-gram models
We supply Russian n-gram databases as well as many other languages. The n-gram model is generated from an enormous database of authentic text (text corpora) produced by real users of Russian. Our largest Russian corpus contains texts with a total length of 14,000,000,000 words.
The corpus size is not an issue with n-gram models of the most frequent 10,000 n-grams in Russian. Such a database has very limited use, though. For any serious application, a much larger database, typically of millions of n-grams, is needed.
An enormous text database (corpus) is required to ensure reliable n-gram frequency even for rare and infrequently used n-grams. The only viable option of building corpora of billions of words is with the help of an automated procedure of downloading content from the web. Lexical Computing developed a sophisticated procedure for collecting only linguistically valuable content from the web. A series of tools is used to focus on the right content and to perform deduplication and cleaning. This ensures that the statistics are not skewed. This blog post gives more details.
Russian n-gram model size
We are able to generate n-gram database of millions of unique n-grams in Russian. The actual size depends on the specifications. By default, we will exclude any n-gram which appears fewer than 5 times. Such n-grams are typically noise without any linguistic value. We can accommodate any specific requirements.
Enriched n-gram databases
We can also supplement the data with additional information such as POS tags, lemmas, probabilities of the next word, or any other statistical or morphological data.
N-gram model sample
We will provide a quotation based on the exact specifications and the intended use of the database.
N-gram database download
The database will be ready for download on a dedicated link within the agreed period of time. It usually takes a couple of weeks or to generate the data. Very complex databases can be computationally demanding and can take longer to produce.
Russian bigram model sample
Download a spreadsheet with a sample of the last 100 bigrams in each thousand between 1,000 and 100,000. The list is case sensitive. Lists with specific criteria and filtering options can be generated to your requirements.