A text corpus can consist of one large text file with only one very long line of text. This is, however, impractical for many reasons. Since large corpora tend to be made up of many files, texts or documents, users might want to know whether a word or phrase is evenly distributed across the corpus (suggesting the word is in general use)  or only appears in a small number of such documents (suggesting the word is only used in specific contexts). Users might also want to know whether a word tends to appear in long sentences (suggesting it might be a formal word) or in very short sentences (suggesting the word tends to be used in informal spoken language). To do this, a corpus has to be equipped with marks or labels indicating the beginning and end of such parts. The labels are called structure tags and the parts are called structures. Corpus management software generally does not prescribe (and neither does Sketch Engine) what structures should be included in the corpus and what the tags should look like. It is, however, advisable to include at least the basic set marking the beginning and end of a document, paragraph and sentence.

Corpus structures without values

<s>some text </s>
<s> some text </s>
<s>some text </s>
<s>some text </s>

Corpus structures with values



Rebecca has worked with a full range of clients including BMW and Airbus. some text

some text some text

some text some text

some text some text