Brown Corpus

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on. Table 2-1 gives an example of each genre (for a complete list, see http://icame.uib.no/brown/bcm-los.html).

Table 2-1. Example document for each section of the Brown Corpus

ID

File

Genre

Description

A16

caló

news

Chicago Tribune: Society Reportage

B02

cb02

editorial

Christian Science Monitor: Editorials

C17

cc17

reviews

Time Magazine: Reviews

D12

cd12

religion

Underwood: Probing the Ethics of Realtors

E36

ce3ó

hobbies

Norling: Renting a Car in Europe

F25

cf25

lore

Boroff: Jewish Teenage Culture

G22

cg22

belles_lettres

Reiner: Coping with Runaway Technology

H15

ch15

government

US Office of Civil and Defence Mobilization: The Family Fallout Shelter

J17

cj19

learned

Mosteller: Probability with Statistical Applications

K04

ck04

fiction

W.E.B. Du Bois: Worlds of Color

L13

cm

mystery

Hitchens: Footsteps in the Night

M01

cmOl

science_fiction

Heinlein: Stranger in a Strange Land

N14

cn15

adventure

Field: Rattlesnake Ridge

P12

cp12

romance

Callaghan: A Passion in Rome

R06

crOó

humor

Thurber: The Future, If Any, of Comedy

We can access the corpus as a list of words or a list of sentences (where each sentence is itself just a list of words). We can optionally specify particular categories or files to read:

>>> from nltk.corpus import brown >>> brown.categories()

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> brown.words(categories='news')

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] >>> brown.words(fileids=['cg22'])

['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...] >>> brown.sents(categories=['news', 'editorial', 'reviews']) [['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...], ...]

The Brown Corpus is a convenient resource for studying systematic differences between genres, a kind of linguistic inquiry known as stylistics. Let's compare genres in their usage of modal verbs. The first step is to produce the counts for a particular genre. Remember to import nltk before doing the following:

>>> from nltk.corpus import brown

>>> news_text = brown.words(categories='news')

>>> fdist = nltk.FreqDist([w.lower() for w in news_text])

>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']

... print m + ':', fdist[m], can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

Your Turn: Choose a different section of the Brown Corpus, and adapt the preceding example to count a selection of wh words, such as what, when, where, who and why.

Next, we need to obtain counts for each genre of interest. We'll use NLTK's support for conditional frequency distributions. These are presented systematically in Section 2.2, where we also unpick the following code line by line. For the moment, you can ignore the details and just concentrate on the output.

>>> cfd = nltk.ConditionalFreqDist(

... for genre in brown.categories()

... for word in brown.words(categories=genre))

>>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] >>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> cfd.tabulate(conditions=genres, samples=modals) can could may might must will

news

93

86

66

38

50

389

religion

82

59

78

12

54

71

hobbies

268

58

131

22

83

264

science fiction

16

49

4

12

8

16

romance

74

193

11

51

45

43

humor

16

30

8

8

9

13

Observe that the most frequent modal in the news genre is will, while the most frequent modal in the romance genre is could. Would you have predicted this? The idea that word counts might distinguish genres will be taken up again in Chapter 6.

Was this article helpful?

0 0

Post a comment