Text Corpus Structure

We have seen a variety of corpus structures so far; these are summarized in Figure 2-3. The simplest kind lacks any structure: it is just a collection of texts. Often, texts are grouped into categories that might correspond to genre, source, author, language, etc. Sometimes these categories overlap, notably in the case of topical categories, as a text can be relevant to more than one topic. Occasionally, text collections have temporal structure, news collections being the most common example.

NLTK's corpus readers support efficient access to a variety of corpora, and can be used to work with new corpora. Table 2-3 lists functionality provided by the corpus readers.

Figure 2-3. Common structures for text corpora: The simplest kind of corpus is a collection of isolated texts with no particular organization; some corpora are structured into categories, such as genre (Brown Corpus); some categorizations overlap, such as topic categories (Reuters Corpus); other corpora represent language use over time (Inaugural Address Corpus).

Table 2-3. Basic corpus functionality defined in NLTK: More documentation can be found using help(nltk.corpus.reader) and by reading the online Corpus HOWTO at http://www.nltk.org/howto.

Example

Description

fileids()

The files of the corpus

fileids([categories])

The files of the corpus corresponding to these categories

categories()

The categories of the corpus

categories([fileids])

The categories of the corpus corresponding to these files

raw()

The raw content of the corpus

raw(fileids=[f1,f2,f3])

The raw content of the specified files

raw(categories=[c1,c2])

The raw content of the specified categories

words()

The words of the whole corpus

words(fileids=[f1,f2,f3])

The words of the specified fileids

words(categories=[c1,c2])

The words of the specified categories

sents()

The sentences of the specified categories

sents(fileids=[f1,f2,f3])

The sentences of the specified fileids

sents(categories=[c1,c2])

The sentences of the specified categories

abspath(fileid)

The location of the given file on disk

encoding(fileid)

The encoding of the file (if known)

open(fileid)

Open a stream for reading the given corpus file

root()

The path to the root of locally installed corpus

readme()

The contents of the README file of the corpus

We illustrate the difference between some of the corpus access methods here:

>>> raw = gutenberg.raw("burgess-busterbrown.txt")

'The Adventures of B'

>>> words = gutenberg.words("burgess-busterbrown.txt") >>> words[1:20]

['The', 'Adventures', 'of', 'Buster', 'Bear', 'by', 'Thornton', 'W', '.', 'Burgess', '1920', ']', 'I', 'BUSTER', 'BEAR', 'GOES', 'FISHING', 'Buster', 'Bear']

>>> sents = gutenberg.sents("burgess-busterbrown.txt") >>> sents[l:20]

[['I'], ['BUSTER', 'BEAR', 'GOES', 'FISHING'], ['Buster', 'Bear', 'yawned', 'as', 'he', 'lay', 'on', 'his', 'comfortable', 'bed', 'of', 'leaves', 'and', 'watched', 'the', 'first', 'early', 'morning', 'sunbeams', 'creeping', 'through', ...], ...]

Was this article helpful?

0 0

Post a comment