Open-source NLP tools

Lexical Computing has a long-standing cooperation with the Natural Language Processing Centre at the Faculty of Informatics, Masaryk University, Brno, Czech Republic. Together, we developed a number of open-source NLP tools which are available for download. They have also been integrated with the Sketch Engine corpus query and management system and they are automatically applied on data within Sketch Engine so that even users without the necessary technical knowledge can benefit from them.

NLP tools for non-technical users

We integrated all the necessary NLP tools into our flagship product, Sketch Engine. Users can build and analyse large amounts of text without any technical knowledge and without installing and setting up any tools.

JusText

boilerplate removal

JusText is a HTML boilerplate removal tool producing clean text by striping navigation links, headers, footers, etc. from HTML pages and leaving only the main text containing complete sentences.

Chared

encoding detection

Chared is a tool for detecting the character encoding of a text in a known language. It contains models for a wide range of languages.

Spiderling

web spider for linguistics

Spiderling is a web spider designed for linguistics applications which can crawl text-rich parts of the web and collect data that are suitable for inclusion into text corpora. It is a key tool for our corpus building projects.

Onion

text deduplicator

Onion (ONe Instance ONly) is designed for deduplication large text collections (corpora) by measuring the similarity of paragraphs or whole documents. The duplicate texts are removed based on the threshold set by the user.

Unitok

text tokenizer

Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (vertical format), while preserving metadata in XML-like tags.

wiki2corpus

wikipedia download

wiki2corpus is a script which downloads Wikipedia articles (for a given language) and outputs them in the form of prevertical which can be further processed by other corpus tools.

NoSketch Engine

corpus query system

NoSketch Engine is an open-source corpus query system based on Sketch Engine. NoSketch Engine does not feature any of the automated corpus building tools integrated in Sketch Engine.