Notice: Undefined index: headers in /var/www/lexicalcomputing/wp-content/plugins/wp-web-scrapper/class.wpws.php on line 72

Languages and corpora

Already available

Data, tools and services, in most cases, are based on a large sample of language called a corpus. Word lists, n-grams, lexical databases and any other data we supply are generated from these corpora. We are constantly developing new corpora and increase the coverage of languages. At this moment, these are the languages and corpora we currently have.

Language support development

We have an ample experience in developing support for new languages and building new text corpora. If your language is currently not supported or you need new data, please request the support to be developed.

Languages and corpora already available

Language Reference corpus Part-of-speech tagging available for user corpora Word sketches available for user corpora
Afrikaans OPUS2 Afrikaans (743,954 tokens; tagged; with word sketches) no yes
Albanian OPUS2 Albanian (55,099,328 tokens; tagged; with word sketches) no no
Arabic Arabic Web 2012 (arTenTen12, Stanford tagger) (8,322,097,229 tokens; tagged; with word sketches) yes yes
Armenian none no yes
Azerbaijani Turkic web – Azerbaijani (115,280,755 tokens) no no
Basque Basque Web (BasqueWaC) (123,856,183 tokens; tagged; with word sketches) no yes
Bengali Bengali Web (BengaliWaC) (13,719,158 tokens; tagged; with word sketches) no no
Bosnian Bosnian Web 2014 (BosnianWaC14) (290,176,507 tokens; tagged) no no
Bulgarian Bulgarian Web 2012 (bgTenTen12) (846,834,715 tokens; tagged; with word sketches) yes yes
Burmese none no no
Catalan Catalan Web 2014 (caTenTen14) (4,777,786,899 tokens; tagged) no yes
Chinese Simplified Chinese Web 2011 (zhTenTen11) (2,106,661,021 tokens; tagged; with word sketches) yes yes
Chinese Traditional zhTenTen [2011] (2,106,661,021 tokens; tagged; with word sketches) yes yes
Croatian Croatian Web 2014 (hrWaC14) (1,404,262,704 tokens; tagged; with word sketches) yes yes
Czech Czech Web 2012 (czTenTen12 v8) (5,069,447,935 tokens; tagged; with word sketches) yes yes
Danish Danish Web 2014, old version (2,395,139,491 tokens; tagged; with word sketches) no yes
Dutch Dutch Web 2014 (nlTenTen14) (3,013,056,738 tokens; tagged; with word sketches) yes yes
English English Web 2012 (enTenTen12) (12,968,375,937 tokens; tagged; with word sketches) yes yes
Esperanto none no no
Estonian Estonian Web 2013 (etTenTen13) (330,045,196 tokens; tagged; with word sketches) yes yes
Filipino Filipino Web Corpus (FilipinoWaC) (31,845,404 tokens; tagged; with word sketches) no no
Finnish Finnish Web 2014 (fiTenTen14, TreeTagger v2) (1,703,429,270 tokens; tagged; with word sketches) yes no
French French Web 2012 (frTenTen12) (11,444,973,582 tokens; tagged; with word sketches) yes yes
Frisian Frisian web corpus (FrisianWaC) (3,738,968 tokens) no no
Galician none no no
Georgian Georgian Web Corpus (georgianWaC) (63,632,861 tokens) no no
German German Web 2013 (deTenTen13) (19,918,263,493 tokens; tagged; with word sketches) yes yes
Greek Greek Web 2014 (elTenTen14) (1,958,348,129 tokens) no yes
Gujarati Gujarati Web Corpus (GujarathiWaC) (22,201,247 tokens; tagged; with word sketches) no no
Hebrew Hebrew Web 2014 (heTenTen2014) (1,061,788,271 tokens) yes no
Hindi Hindi Web Corpus (HindiWaC) (65,772,188 tokens; tagged; with word sketches) no no
Hungarian Araneum Hungaricum Maius [2014] (1,200,001,609 tokens; tagged; with word sketches) yes no
Icelandic Icelandic texts [sample] (9,968,822 tokens) no no
Igbo Igbo Web corpus (IgboWaC15) (396,276 tokens) no no
Indonesian Indonesian Web Corpus (IndonesianWaC) (109,281,359 tokens; tagged; with word sketches) no no
Irish New Corpus for Ireland (NCI Irish) (34,358,267 tokens; tagged; with word sketches) no yes
Italian Italian Web 2010 (itTenTen) (3,076,908,415 tokens; tagged; with word sketches) yes yes
Japanese Japanese Web 2011 (jpTenTen11 [LUW, sample]) (203,674,569 tokens; tagged; with word sketches) yes yes
Kazakh Turkic web – Kazakh (175,445,327 tokens) no no
Korean Korean Web 2012 (koTenTen12) (560,945,022 tokens; tagged) yes yes
Kyrgyz Turkic web – Kyrgyz (24,084,100 tokens) no no
Latin LatinISE historical corpus v2 (12,995,824 tokens; tagged; with word sketches) no yes
Latvian Latvian web [2014] (658,585,131 tokens; tagged) yes yes
Limburgish none no no
Lithuanian Lithuanian Web 2014 (ltTenTen14) (981,517,649 tokens) no yes
Macedonian OPUS2 Macedonian (49,066,513 tokens; tagged; with word sketches) no no
Malayalam Mayalam Web Corpus (malayalamWaC) (21,193,984 tokens; tagged; with word sketches) no no
Malay Malayan Web Corpus (MalayWaC) (230,509,568 tokens; tagged; with word sketches) no no
Maldivian none no no
Maltese Maltese MLRS Corpus (125,267,653 tokens; tagged; with word sketches) no no
Maori Maori Web Corpus (MaoriWaC) (8,351,983 tokens) no no
Mongolian none no no
Nepali Nepali National Corpus (15,137,459 tokens; tagged) no no
Norwegian Norwegian Web 2015 (noTenTen15; Bokmål and Nynorsk) (1,953,892,201 tokens; tagged; with word sketches) no yes
Persian OPUS2 Persian (5,367,401 tokens; tagged; with word sketches) no yes
Polish Polish Web 2012 (plTenTen12) (9,677,787,906 tokens; tagged; with word sketches) yes yes
Portuguese ptTenTen11 (Freeling, v3) (4,637,901,353 tokens; tagged; with word sketches) yes yes
Romanian Romanian Web (roWaC) (53,457,522 tokens; tagged; with word sketches) yes yes
Russian Russian Web 2011 (ruTenTen11) (18,280,486,876 tokens; tagged; with word sketches) yes yes
Samoan Samoan Web corpus (SamoanWac1) (3,583,362 tokens) no no
Sanskrit (romanised) none no no
Scottish Gaelic Scottish Gaelic Wiki corpus (gdWiki) (1,223,562 tokens) no no
Serbian Serbin Web 2014 (srWaC14) (561,529,963 tokens; tagged) yes yes
Setswana Setswana/Tswana Web (SetswanaWaC v2) (13,511,692 tokens; tagged; with word sketches) no no
Slovak Araneum Slovacum Maius [2013] (1,200,005,746 tokens; tagged; with word sketches) yes yes
Slovenian Slovenian reference corpus (FidaPLUS v2) (738,503,145 tokens; tagged; with word sketches) yes yes
Spanish Spanish Web 2011 (esTenTen11, Eu + Am, Freeling v4) (10,994,616,207 tokens; tagged; with word sketches) yes yes
Swahili Swahili Web (SwahiliWaC) (21,359,529 tokens; tagged; with word sketches) yes no
Swedish Swedish Web 2014 (svTenTen14) (3,900,846,988 tokens; tagged; with word sketches) yes yes
Tajik Tajik Web (TajikWaC) (109,805,133 tokens; tagged) no no
Talysh none no no
Tamil Tamil Web (TamilWaC) (32,861,569 tokens; tagged; with word sketches) no no
Tatar Tatar Web Corpus sample (290,351 tokens) no no
Telugu Telugu Web (TeluguWaC) (4,697,932 tokens; tagged; with word sketches) no no
Thai Thai Web (ThaiWaC) (108,013,897 tokens; tagged; with word sketches) no no
Tibetan none no yes
Turkish Turkish Web 2012 (trTenTen12) (4,124,558,200 tokens) no no
Turkmen Turkic web – Turkmen (2,536,935 tokens) no no
Ukrainian Ukrainian Web 2014 (uaTenTen14) (2,734,851,744 tokens) no no
Urdu Urdu Web Corpus (UrduWaC) (60,808,847 tokens) no no
Uzbek Turkic web – Uzbek (24,570,516 tokens) no no
Vietnamese Vietnamese Web Corpus (VietnameseWaC) (129,781,089 tokens; tagged; with word sketches) no yes
Welsh WelshWaC (14,786,791 tokens; tagged; with word sketches) no no
Yiddish none no no
Yoruba Yoruba WaC [2015] (3,500,353 tokens) no no