Word lists, n-gram models, lexicons and language databases
We use our multi-billion word samples of authentic language called text corpora to generate tailor-made language databases and lexical data according to the customer’s requirements.
The size of the corpora is key for a truly representative sample of language. While it is easy to download databases of the top few thousand most frequent words in many languages, we are capable of providing lists of millions of items. Such a list could be regarded as a complete database of a lexicon of a language. Our corpora for many languages are large enough to generate a list of all words in a language.
Even a small corpus will contain enough examples of the most frequent words but the size really matters with less frequent words or subject specific vocabulary. A huge corpus is a must when information about such words is needed so that a reasonable number of examples is found and the analysis produces meaningful language data. The corpus size is critical for the quality of collocations and thesaurus.
A complex lexical database can consist of base forms (lemmas) and
- part of speech
- grammar labels
- all word forms
- thesaurus (synonyms and similar words)
- good dictionary examples
and any other statistical, morphological or linguistic information derived from the corpus.
We are providers of word databases, lexicons, dictionaries and language databases generated from large annotated text corpora.
Our multibillion-word corpora allow generating n-gram databases and n-gram models of billions of items in 90+ languages.
We can deliver a synonym database (thesaurus) automatically calculated from a large text corpus or use in search solutions and other software.
We possess advanced technology to identify typical word combinations and are able to generate such a database for all words in a language.
We use the GDEX technology to automatically identify sentences useful as example sentences for dictionaries and language teaching materials.