Sometimes these categories overlap, notably in the case of topical categories as a text can be relevant to more than one topic.
Occasionally, text collections have temporal structure, news collections being the most common example.
We examined some small text collections in 1., such as the speeches known as the US Presidential Inaugural Addresses.
This particular corpus actually contains dozens of individual texts — one per address — but for convenience we glued them end-to-end and treated them as a single text. also used various pre-defined texts that we accessed by typing This program displays three statistics for each text: average word length, average sentence length, and the number of times each vocabulary item appears in the text on average (our lexical diversity score).
It is important to consider less formal language as well.
Many corpora are designed to contain a careful balance of material in one or more genres.
Some languages have no established writing system, or are endangered.
(See 7 for suggestions on how to locate language resources.) We have seen a variety of corpus structures so far; these are summarized in 1.3.
For the moment, you can ignore the details and just concentrate on the output.
The Reuters Corpus contains 10,788 news documents totaling 1.3 million words.