We can generate word databases and frequency lists of the most frequent word forms or lemmas (sometimes referred to as dictionaries or lexicons). The lists can be enhanced with additional information such as part of speech or other information that can be retrieved from the corpus.

Word databases of all words in a language

Our corpora are large enough to generate a database of all words in a language. The length of such database can reach millions of words. The database can be filtered based on customer’s criteria and prepared for download in a number of formats.

We can meet any formatting requirements specified by the customer.

An example of an Estonian frequency word list showing the word form, lemma, grammatical tag and frequency.

eestlased eestlane S 12529 
esindaja esindaja S 12471 
edukalt edukalt D 12419 
eestlaste eestlane S 12370 
esineb esinema V 12126 
esindajad esindaja S 11809 
ehitada ehitama V 11763

Word database, lexicon or dictionary available in these languages

Please contact us if you need a language database in another language.

Supported languages

Afrikaans
Albanian
Amharic
Arabic
Azerbaijani
Basque
Belarusian
Bengali
Bosnian
Bulgarian
Catalan
Chinese Simplified
Chinese Traditional
Croatian
Czech
Danish
Dutch
English
Estonian
Filipino
Finnish
French
Frisian
Georgian
German
Greek
Gujarati
Hausa (Boko)
Hebrew
Hindi
Hungarian
Icelandic
Igbo
Indonesian
Irish
Italian
Japanese
Kannada
Kazakh
Korean
Kyrgyz
Latin
Latvian
Lithuanian
Macedonian
Malayalam
Malay
Maltese
Maori
Mongolian
Nepali
N'Ko
Norwegian Bokmål
Norwegian
Norwegian Nynorsk
Oromo
Persian
Polish
Portuguese
Punjabi (Shahmukhi)
Romanian
Russian
Samoan
Scottish Gaelic
Serbian (Latin)
Serbian
Setswana
Slovak
Slovenian
Somali
Spanish
Swahili
Swedish
Tajik
Tamil
Tatar
Telugu
Thai
Tibetan
Tigrinya
Turkish
Turkmen
Ukrainian
Urdu
Uzbek
Vietnamese
Welsh
Yoruba