Languages and corpora

Already available

Data, tools and services, in most cases, are based on a large sample of language called a corpus. Word lists, n-grams, lexical databases and any other data we supply are generated from these corpora. We are constantly developing new corpora and increase the coverage of languages. At this moment, these are the languages and corpora we currently have.

Language support development

We have an ample experience in developing support for new languages and building new text corpora. If your language is currently not supported or you need new data, please request the support to be developed.

Languages and corpora already available

Language Reference corpus Part-of-speech tagging available for user corpora Word sketches available for user corpora
Afrikaans Afrikaans Wikipedia corpus 2018 (afwiki) (17,815,170 tokens; with word sketches) no no
Albanian OPUS2 Albanian (55,099,328 tokens; tagged; with word sketches) no no
Amharic Amharic Web 2013-17 (amWaC17) (30,525,876 tokens; tagged; with word sketches) no no
Arabic Arabic Web 2012 (arTenTen12, Stanford tagger) (8,322,097,229 tokens; tagged; with word sketches) yes yes
Azerbaijani Turkic web – Azerbaijani (115,280,755 tokens; with word sketches) no no
Basque Basque Web (BasqueWaC v2) (123,856,183 tokens; tagged; with word sketches) no no
Belarusian Belarusian Web 2016 (beTenTen16) (80,481,654 tokens; with word sketches) no no
Bengali Bengali Web (bnWaC) (13,752,575 tokens; tagged; with word sketches) no no
Bosnian Bosnian Web (bsWaC 1.2) (286,865,790 tokens; tagged; with word sketches) no no
Bulgarian Bulgarian Web 2012 (bgTenTen12, TreeTagger v2) (843,328,184 tokens; tagged; with word sketches) yes yes
Cantonese Cantonese Web (CantoneseWaC) (42,669,084 tokens) no no
Catalan Catalan Web 2014 (caTenTen14 v2) (210,513,880 tokens; tagged; with word sketches) yes yes
Cebuano none no no
Chinese Simplified Chinese Web 2011 (zhTenTen11) (273,808,472 tokens; tagged; with word sketches) yes yes
Chinese Traditional zhTenTen [2011] (2,106,661,021 tokens; tagged; with word sketches) yes yes
Croatian none yes yes
Cundeelee Wangka none no no
Czech Czech Web 2012 (csTenTen12 v9) (5,069,447,935 tokens; tagged; with word sketches) yes yes
Danish none yes yes
Dutch none yes yes
English English Web 2013 (enTenTen13) (22,728,686,012 tokens; tagged; with word sketches) yes yes
Estonian Estonian Web 2013 (etTenTen13) (330,045,196 tokens; tagged; with word sketches) yes yes
Filipino Filipino Web (FilipinoWaC) (31,845,404 tokens; tagged; with word sketches) no no
Finnish Finnish Web 2014 (fiTenTen14, TreeTagger v2) (1,703,429,270 tokens; tagged; with word sketches) yes yes
French French Web 2012 (frTenTen12, old word sketches) (11,444,973,582 tokens; tagged; with word sketches) yes yes
Frisian Western Frisian Web 2013 (FrisianWaC) (3,738,968 tokens; with word sketches) no no
Georgian Georgian Web (georgianWaC) (63,632,861 tokens) no no
German none yes yes
Greek Greek Web 2014 (elTenTen14) (1,959,880,741 tokens; tagged; with word sketches) yes yes
Gujarati Gujarati Web (GujarathiWaC) (22,201,247 tokens; tagged; with word sketches) no no
Hausa (Boko) Hausa Web 2015 (hausaWaC15) (6,913,007 tokens; with word sketches) no no
Hebrew Hebrew Web 2014 (heTenTen14, no POS tagging) (1,061,788,271 tokens) yes no
Hindi Hindi Web (HindiWaC v. 4) (120,600,574 tokens; tagged; with word sketches) no no
Hungarian none yes yes
Icelandic Icelandic texts [sample] (9,968,822 tokens; with word sketches) no no
Igbo Igbo Web 2015 (IgboWaC15) (396,276 tokens; with word sketches) no no
Indonesian Indonesian Web (IndonesianWaC) (109,281,359 tokens; tagged; with word sketches) no no
Irish New Corpus for Ireland (NCI Irish) (34,358,267 tokens; tagged; with word sketches) no no
Italian Italian Web 2016 (itTenTen16) (5,864,495,700 tokens; tagged; with word sketches) yes yes
Japanese Japanese Web 2011 sample (jaTenTen11, LUW) (203,674,569 tokens; tagged; with word sketches) yes yes
Kannada Kannada Web 2012 (kannadaWaC12) (16,031,481 tokens; tagged; with word sketches) no no
Kazakh Turkic web – Kazakh (175,445,327 tokens; with word sketches) no no
Khmer Khmer Web 2018 (kmTenTen18) (17,892,617 tokens; with word sketches) no no
Korean Korean Web 2012 (koTenTen12) (258,038,328 tokens; tagged; with word sketches) yes yes
Kyrgyz Turkic web – Kyrgyz (24,084,100 tokens; with word sketches) no no
Lao Lao Web 2018 (loTenTen18) (17,425,528 tokens; with word sketches) no no
Latin LatinISE historical corpus v2.2 (13,180,571 tokens; tagged; with word sketches) no no
Latvian Latvian Web 2014 (lvTenTen14) (657,522,048 tokens; tagged; with word sketches) yes no
Lithuanian Lithuanian Web 2014 (ltTenTen14) (981,517,649 tokens) no no
Macedonian OPUS2 Macedonian (49,066,513 tokens; tagged; with word sketches) no no
Malayalam Malayalam Web (malayalamWaC) (21,193,984 tokens; tagged; with word sketches) no no
Malay Malaysian Web (MalaysianWaC) (230,509,568 tokens; tagged; with word sketches) no no
Maltese Maltese MLRS Corpus (125,267,653 tokens; tagged; with word sketches) no no
Maori Maori Web (MaoriWaC) (8,351,983 tokens; with word sketches) no no
Mongolian none no no
Montenegrin none no no
Nepali Nepali National Corpus (15,137,459 tokens; tagged; with word sketches) no no
N'Ko none no no
Norwegian Bokmål Norwegian Web 2017 (noTenTen17, Bokmål) (2,904,004,732 tokens; tagged; with word sketches) yes yes
Norwegian (Mixed) Norwegian Web 2015 (noTenTen15; Bokmål and Nynorsk) (1,953,892,201 tokens; tagged; with word sketches) no yes
Norwegian Nynorsk Norwegian Web 2017 (noTenTen17, Nynorsk) (208,670,022 tokens; tagged; with word sketches) yes yes
Oromo Oromo Web 2016 (orWaC16) (5,091,696 tokens; tagged; with word sketches) no no
Persian OPUS2 Persian (5,367,401 tokens; tagged; with word sketches) no no
Polish Polish Web 2012 (plTenTen12, RFTagger) (239,929,517 tokens; tagged; with word sketches) yes yes
Portuguese Portuguese Web 2011 (ptTenTen11) (241,092,685 tokens; tagged; with word sketches) yes yes
Punjabi (Shahmukhi) none no no
Romanian Romanian Web 2016 (roTenTen16) (3,142,636,172 tokens; tagged; with word sketches) yes yes
Russian Russian Web 2011 (ruTenTen11) (18,280,486,876 tokens; tagged; with word sketches) yes yes
Samoan Samoan Web (SamoanWac1) (3,583,362 tokens; with word sketches) no no
Scottish Gaelic Scottish Gaelic Wiki 2015 (gdWiki) (1,223,562 tokens; with word sketches) no no
Serbian (Latin) none yes yes
Serbian Serbian Web (srWaC 1.2 processed by Hunpos) (562,309,740 tokens; tagged; with word sketches) yes yes
Setswana Setswana/Tswana Web (SetswanaWaC v2) (13,511,692 tokens; tagged; with word sketches) no no
Slovak Slovak Web 2011 (skTenTen11) (228,999,687 tokens; tagged; with word sketches) yes yes
Slovenian Slovenian Web 2015 (slTenTen15, TreeTagger v2) (233,326,266 tokens; tagged; with word sketches) yes yes
Somali Somali Web 2016 (soWaC16) (79,741,231 tokens; tagged; with word sketches) no no
Spanish Spanish Web 2011 (esTenTen11, Eu + Am, Freeling v4) (10,994,616,207 tokens; tagged; with word sketches) yes yes
Swahili Swahili Web 2014 (SwahiliWaC) (21,359,529 tokens; tagged; with word sketches) yes yes
Swedish none yes yes
Tagalog none no no
Tajik Tajik Web (TajikWaC) (109,805,133 tokens; tagged; with word sketches) no no
Tamil Tamil Web 2015 (TamilWaC) (32,861,569 tokens; tagged; with word sketches) no no
Tatar Tatar Mixed Corpus (131,269,704 tokens; tagged; with word sketches) no no
Telugu Telugu Web (TeluguWaC) (4,697,932 tokens; tagged; with word sketches) no no
Thai Thai Web (ThaiWaC) (108,013,897 tokens; tagged; with word sketches) no no
Tibetan Tibetan Corpus 2 (91,107,466 tokens; tagged; with word sketches) yes yes
Tigrinya Tigrinya Web 2016 (tiWaC16) (2,531,443 tokens; tagged; with word sketches) no no
Turkish Turkish Web 2012 (trTenTen12) (4,124,133,118 tokens; tagged; with word sketches) yes no
Turkmen Turkic web – Turkmen (2,536,935 tokens; with word sketches) no no
Ukrainian Ukrainian Web 2014 (ukTenTen14) (2,734,851,744 tokens) no no
Urdu Urdu Web (UrduWaC) (60,808,847 tokens) no no
Uzbek Turkic web – Uzbek (24,570,516 tokens; with word sketches) no no
Vietnamese Vietnamese Web (VietnameseWaC) (129,781,089 tokens; tagged; with word sketches) no no
Welsh Welsh Web 2013 (WelshWaC) (14,786,791 tokens; tagged; with word sketches) no no
Yoruba Yoruba Web 2015 (YorubaWaC15) (3,500,353 tokens; with word sketches) no no