Word lists, n-gram databases, lexicons and other language data

We use our multi-billion word samples of authentic language called text corpora to generate tailor-made language databases and lexical data according to the customer’s requirements.

The size of the corpora is key for a truly representative sample of language. While it is easy to download databases of the top few thousand most frequent words in many languages, we are capable of providing lists of millions of items. Such a list could be regarded as a complete database of a lexicon of a language. Our corpora for many languages are large enough to generate a list of all words in a language.

Even a small corpus will contain enough examples of the most frequent words but the size really matters with less frequent words or subject specific vocabulary. A huge corpus is a must when information about such words is needed so that a reasonable number of examples is found and the analysis produces meaningful language data. The corpus size is critical for the quality of collocations and thesaurus.

The language databases we supply can range from simple word database or bigram, trigram or n-gram lists to complex lexical data combining any of the language data types we offer.

A complex lexical database can consist of the base form (lemma) and

  • part of speech
  • grammar labels
  • all word forms
  • thesaurus (synonyms and similar words)
  • good dictionary example candidates

or any other linguistic information or data the customer might need.

bigram, trigram
& n-gram database

We possess the expertise to generate bigram, trigram or n-gram database in 90+ languages for use in typing prediction and correction applications.

synonym database
thesaurus

We can deliver a synonym database (thesaurus) automatically calculated from a large text corpus or use in search solutions and other software.

Good Dictionary Examples
GDEX

We use the GDEX technology to automatically identify sentences useful as example sentences for dictionaries and language teaching materials.