Word lists, n-gram models, lexicons and language databases

We use our multi-billion word samples of authentic language called text corpora to generate tailor-made language databases and lexical data according to the customer’s requirements.

The size of the corpora is key for a truly representative sample of language. While it is easy to download databases of the top few thousand most frequent words in many languages, we are capable of providing lists of millions of items. Such a list could be regarded as a complete database of a lexicon of a language. Our corpora for many languages are large enough to generate a list of all words in a language.

Even a small corpus will contain enough examples of the most frequent words but the size really matters with less frequent words or subject specific vocabulary. A huge corpus is a must when information about such words is needed so that a reasonable number of examples is found and the analysis produces meaningful language data. The corpus size is critical for the quality of collocations and thesaurus.

The language databases we supply can range from a simple word frequency lists or bigram or n-gram lists model to complex lexical data combining any of the language data types we offer.

A complex lexical database can consist of base forms (lemmas) and

  • part of speech
  • grammar labels
  • all word forms
  • thesaurus (synonyms and similar words)
  • good dictionary examples

and any other statistical, morphological or linguistic information derived from the corpus.

Data samples

Word frequency lists: English, Spanish, French, Arabic, Russian, Portuguese, Hindi. Bigram databases: English, Spanish, German, Russian.

synonym database

We can deliver a synonym database (thesaurus) automatically calculated from a large text corpus or use in search solutions and other software.

Good Dictionary Examples

We use the GDEX technology to automatically identify sentences useful as example sentences for dictionaries and language teaching materials.