Languages and corpora

Already available

Data, tools and services, in most cases, are based on a large sample of language called a corpus. Word lists, n-grams, lexical databases and any other data we supply are generated from these corpora. We are constantly developing new corpora and increase the coverage of languages. At this moment, these are the languages and corpora we currently have.

Language support development

We have an ample experience in developing support for new languages and building new text corpora. If your language is currently not supported or you need new data, please request the support to be developed.

Languages and corpora already available

Language Reference corpus Part-of-speech tagging available for user corpora Word sketches available for user corpora
Afrikaans OPUS2 Afrikaans (743,954 tokens; tagged; with word sketches) no no
Albanian OPUS2 Albanian (55,099,328 tokens; tagged; with word sketches) no no
Amharic none no no
Arabic Arabic Web 2012 (arTenTen12, Stanford tagger) (8,322,097,229 tokens; tagged; with word sketches) yes yes
Azerbaijani Turkic web – Azerbaijani (115,280,755 tokens; with word sketches) no no
Basque Basque Web (BasqueWaC v2) (123,856,183 tokens; tagged; with word sketches) no no
Belarusian none no no
Bengali Bengali Web (BengaliWaC) (13,752,575 tokens; tagged; with word sketches) no no
Bosnian Bosnian Web (bsWaC 1.2) (286,865,790 tokens; tagged; with word sketches) no no
Bulgarian Bulgarian Web 2012 (bgTenTen12, TreeTagger v2) (843,328,184 tokens; tagged; with word sketches) yes yes
Catalan Catalan Web 2014 (caTenTen14 v2) (210,513,880 tokens; tagged; with word sketches) yes no
Chinese Simplified Chinese Web 2011 (zhTenTen11, old sketch grammar) (2,106,661,021 tokens; tagged; with word sketches) yes yes
Chinese Traditional zhTenTen [2011] (2,106,661,021 tokens; tagged; with word sketches) yes yes
Croatian none yes no
Czech Czech Web 2012 (czTenTen12 v8) (5,069,447,935 tokens; tagged; with word sketches) yes yes
Danish none yes yes
Dutch Dutch Web 2014 (nlTenTen14) (3,013,056,738 tokens; tagged; with word sketches) yes yes
English English Web 2013 (enTenTen13) (22,728,686,012 tokens; tagged; with word sketches) yes yes
Estonian Estonian Web 2013 (etTenTen13) (330,045,196 tokens; tagged; with word sketches) yes no
Filipino Filipino Web (FilipinoWaC) (31,845,404 tokens; tagged; with word sketches) no no
Finnish Finnish Web 2014 (fiTenTen14, TreeTagger v2) (1,703,429,270 tokens; tagged; with word sketches) yes yes
French French Web 2012 (frTenTen12) (11,444,973,582 tokens; tagged; with word sketches) yes yes
Frisian Western Frisian Web 2013 (FrisianWaC) (3,738,968 tokens; with word sketches) no no
Georgian Georgian Web (georgianWaC) (63,632,861 tokens) no no
German none yes yes
Greek Greek Web 2014 (old version) (1,958,348,129 tokens) yes yes
Gujarati Gujarati Web (GujarathiWaC) (22,201,247 tokens; tagged; with word sketches) no no
Hausa (Boko) Hausa Web 2015 (hausaWaC15) (6,913,007 tokens; with word sketches) no no
Hebrew Hebrew Web 2014 (heTenTen14, no POS tagging) (1,061,788,271 tokens) yes no
Hindi Hindi Web (HindiWaC v. 4) (120,600,574 tokens; tagged; with word sketches) no no
Hungarian Araneum Hungaricum Maius [2014] (1,200,001,609 tokens; tagged; with word sketches) yes no
Icelandic Icelandic texts [sample] (9,968,822 tokens; with word sketches) no no
Igbo Igbo Web 2015 (IgboWaC15) (396,276 tokens; with word sketches) no no
Indonesian Indonesian Web (IndonesianWaC) (109,281,359 tokens; tagged; with word sketches) no no
Irish New Corpus for Ireland (NCI Irish) (34,358,267 tokens; tagged; with word sketches) no no
Italian Italian Web 2010 (itTenTen) (3,076,908,415 tokens; tagged; with word sketches) yes yes
Japanese Japanese Web 2011 sample (jpTenTen11, LUW) (203,674,569 tokens; tagged; with word sketches) yes yes
Kannada Kannada Web 2012 (kannadaWaC12) (16,031,481 tokens; tagged; with word sketches) no no
Kazakh Turkic web – Kazakh (175,445,327 tokens; with word sketches) no no
Korean Korean Web 2012 (koTenTen12) (560,945,022 tokens; tagged; with word sketches) yes yes
Kyrgyz Turkic web – Kyrgyz (24,084,100 tokens; with word sketches) no no
Latin LatinISE historical corpus v2.2 (13,180,571 tokens; tagged; with word sketches) no no
Latvian Latvian Web 2014 (lvTenTen14) (657,522,048 tokens; tagged; with word sketches) yes no
Lithuanian Lithuanian Web 2014 (ltTenTen14) (981,517,649 tokens) no no
Macedonian OPUS2 Macedonian (49,066,513 tokens; tagged; with word sketches) no no
Malayalam Malayalam Web (malayalamWaC) (21,193,984 tokens; tagged; with word sketches) no no
Malay Malaysian Web (MalaysianWaC) (230,509,568 tokens; tagged; with word sketches) no no
Maltese Maltese MLRS Corpus (125,267,653 tokens; tagged; with word sketches) no no
Maori Maori Web (MaoriWaC) (8,351,983 tokens; with word sketches) no no
Mongolian none no no
Nepali Nepali National Corpus (15,137,459 tokens; tagged; with word sketches) no no
N'Ko none no no
Norwegian Bokmål Norwegian Web 2015 (Bokmål) (1,364,503,936 tokens; tagged; with word sketches) yes yes
Norwegian Norwegian Web 2015 (noTenTen15; Bokmål and Nynorsk) (1,953,892,201 tokens; tagged; with word sketches) no no
Norwegian Nynorsk Norwegian Web 2015 (Nynorsk) (50,014,992 tokens; tagged; with word sketches) yes yes
Oromo none no no
Persian OPUS2 Persian (5,367,401 tokens; tagged; with word sketches) no no
Polish Polish Web 2012 (plTenTen12, WCRFT) (9,677,787,906 tokens; tagged; with word sketches) yes yes
Portuguese Portuguese Web 2011 (ptTenTen11, Freeling v3) (4,626,584,246 tokens; tagged; with word sketches) yes yes
Punjabi (Shahmukhi) none no no
Romanian Romanian Web (roWaC) (53,457,522 tokens; tagged; with word sketches) yes yes
Russian Russian Web 2011 (ruTenTen11) (18,280,486,876 tokens; tagged; with word sketches) yes yes
Samoan Samoan Web (SamoanWac1) (3,583,362 tokens; with word sketches) no no
Scottish Gaelic Scottish Gaelic Wiki 2015 (gdWiki) (1,223,562 tokens; with word sketches) no no
Serbian (Latin) none yes no
Serbian none yes no
Setswana Setswana/Tswana Web (SetswanaWaC v2) (13,511,692 tokens; tagged; with word sketches) no no
Slovak Araneum Slovacum Maius [2013] (1,200,005,746 tokens; tagged; with word sketches) yes yes
Slovenian Slovenian reference corpus (FidaPLUS v2) (738,503,145 tokens; tagged; with word sketches) yes yes
Somali none no no
Spanish Spanish Web 2011 (esTenTen11, Eu + Am, Freeling v4) (10,994,616,207 tokens; tagged; with word sketches) yes yes
Swahili Swahili Web 2014 (SwahiliWaC) (21,359,529 tokens; tagged; with word sketches) yes yes
Swedish none yes yes
Tajik Tajik Web (TajikWaC) (109,805,133 tokens; tagged; with word sketches) no no
Tamil Tamil Web 2015 (TamilWaC) (32,861,569 tokens; tagged; with word sketches) no no
Tatar Tatar Web 2015 sample (290,351 tokens) no no
Telugu Telugu Web (TeluguWaC) (4,697,932 tokens; tagged; with word sketches) no no
Thai Thai Web (ThaiWaC) (108,013,897 tokens; tagged; with word sketches) no no
Tibetan none yes no
Tigrinya none no no
Turkish Turkish Web 2012 (trTenTen12) (4,124,558,200 tokens) no no
Turkmen Turkic web – Turkmen (2,536,935 tokens; with word sketches) no no
Ukrainian Ukrainian Web 2014 (uaTenTen14) (2,734,851,744 tokens) no no
Urdu Urdu Web (UrduWaC) (60,808,847 tokens) no no
Uzbek Turkic web – Uzbek (24,570,516 tokens; with word sketches) no no
Vietnamese Vietnamese Web (VietnameseWaC) (129,781,089 tokens; tagged; with word sketches) no no
Welsh Welsh Web 2013 (WelshWaC) (14,786,791 tokens; tagged; with word sketches) no no
Yoruba Yoruba Web 2015 (YorubaWaC15) (3,500,353 tokens; with word sketches) no no