Collocations = typical word combinations

We are suppliers of high-quality collocation databases in many languages. The lists are generated automatically from an enormous authentic database of text (text corpora) produced by real users of the language. Our largest corpora in major languages contain texts with a total length of over 10,000,000,000 words. The use of authentic texts as a source ensures that the database reflects the real-life use of the language.

Collocation databases are important for language learning products and also for digital typing assistants designed to suggest more natural and idiomatically correct words.

Collocations are identified by the Word Sketch technology in Sketch Engine and supplied as a database download in a number of formats.

Data quality

A relatively small corpus is generally sufficient to generate linguistic data for the most frequent 5,000 words because such words appear frequently enough in any text.

However, an enormous text database (corpus) is required to generate the same information for rare and infrequently used words. The only viable option of building corpora of billions of words is using an automatic procedure of downloading content from the web. Lexical Computing developed a sophisticated procedure for collecting only linguistically valuable content. A series of tools is used to focus on the right content and to deduplicate and clean the downloaded content. This ensures that the statistics are not skewed. This blog post gives more details.

Collocation database size

We are able to generate collocation databases of millions of items for all major languages. The size of the database for less major languages will be affected by the size of the source corpus.  The actual size depends on the specifications provided by the customer.

Enriched collocation database

We are also able to provide additional information such as POS tags, lemmas, frequency ranks, or any other statistics or morphological information.

Collocation database sample

The easiest is to register a free trial account in Sketch Engine and use the word sketch tool to generate a wordlist. The advanced tab of the wordlist tool allows for detailed specifications to be used.

Prices

We will provide a quotation based on the exact specifications and the intended use of the database.

Collocation database download

The database will be made for download to you on a dedicated link within the agreed period of time. It normally takes a week or two to generate the data. Very complex databases can be computationally demanding and can take longer to produce.

Our natural language processing API supports the retrieval of collocations in real time.

The first few collocations of the Spanish word problema together with frequency and strength of collocation value.

object of
----------------------
resolver 179825 11.38
solucionar 116224 10.99
tener 401288 9.09
enfrentar 29200 8.82
haber 143262 8.75
causar 22885 8.41
plantear 21728 8.33
evitar 26164 8.3

A small sample of collocations generated for the word issue in English.Collocations

Collocation database in these languages

Collocations for more languages can be made available or developed on request.

Supported languages

Afrikaans
Albanian
Amazigh
Amharic
Ancient Greek
Arabic
Armenian
Azerbaijani
Basque
Belarusian
Bengali
Bosnian
Breton
Bulgarian
Burmese
Cantonese
Catalan
Cebuano
Chinese Simplified
Chinese Traditional
Croatian
Czech
Danish
Dutch
English
Esperanto
Estonian
Filipino
Finnish
French
Frisian
Georgian
German
Greek
Gujarati
Hausa (Boko)
Hebrew
Hindi
Hungarian
Icelandic
Igbo
Indonesian
Irish
Italian
Japanese
Kalaamaya
Kannada
Kazakh
Khmer
Korean
Kurdish (Kurmanji)
Kurdish (Sorani)
Kuwarra
Kyrgyz
Lao
Latin
Latvian
Limburgish
Lithuanian
Macedonian
Maduwongga
Malay
Malayalam
Maldivian
Maltese
Mankulatjarra
Manyjiljar
Maori
Marathi
Marlpa
Mirning
Mongolian
Montenegrin N'Ko
Ndebele
Nepali
Newspeak
Ngaanyatjarra
Ngaju
Ngalia
Nganta
Northern Sotho
Norwegian Bokmål
Norwegian
Norwegian Nynorsk
Nyakinyaki
Oromo
Pashto
Pintupi
Pitjantjatjara
Polish
Portuguese
Punjabi (Gurmukhi)
Punjabi (Shahmukhi)
Romanian
Russian
Samoan
Sanskrit (romanised)
Scottish Gaelic
Serbian
Serbian (Latin)
Setswana
Sinhalese
Slovak
Slovenian
Somali
Spanish
Swahili
Swazi
Swedish
Syriac
Tagalog
Tajik
Talysh
Tamil
Tatar
Telugu
Thai
Tibetan
Tigrinya
Tjalkatjarra
Tjupan
Tsonga
Turkish
Turkmen
Ukrainian
Urdu
Uzbek
Vietnamese
Wangkatja
Warlpiri
Welsh
Wudjaarri
Xhosa
Yankunytjatjara
Yiddish
Yoruba
Zulu