The corpus module defines the treebank corpus reader, which contains a 10% sample of the Penn Treebank Corpus.
>>> from nltk.corpus import treebank
>>> t = treebank.parsed_sents('wsj_0001.mrg')
(NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .))
We can use this data to help develop a grammar. For example, the program in Example 8-4 uses a simple filter to find verbs that take sentential complements. Assuming we already have a production of the form VP -> SV S, this information enables us to identify particular verbs that would be included in the expansion of SV.
Example 8-4. Searching a treebank to find sentential complements.
child_nodes = [child.node for child in tree if isinstance(child, nltk.Tree)] return (tree.node == 'VP') and ('S' in child_nodes)
>>> from nltk.corpus import treebank >>> [subtree for tree in treebank.parsed_sents() ... for subtree in tree.subtrees(filter)]
[Tree('VP', [Tree('VBN', ['named']), Tree('S', [Tree('NP-SBJ', ...]), ...]), ...]
The PP Attachment Corpus, nltk.corpus.ppattach, is another source of information about the valency of particular verbs. Here we illustrate a technique for mining this corpus. It finds pairs of prepositional phrases where the preposition and noun are fixed, but where the choice of verb determines whether the prepositional phrase is attached to the VP or to the NP.
>>> entries = nltk.corpus.ppattach.attachments('training') >>> table = nltk.defaultdict(lambda: nltk.defaultdict(set)) >>> for entry in entries:
... key = entry.noun1 + '-' + entry.prep + '-' + entry.noun2 ... table[key][entry.attachment].add(entry.verb)
>>> for key in sorted(table): ... if len(table[key]) > 1:
... print key, 'N:', sorted(table[key]['N']), 'V:', sorted(table[key]['V'])
Among the output lines of this program we find offer-from-group N: ['rejected'] V: ['received'], which indicates that received expects a separate PP complement attached to the VP, while rejected does not. As before, we can use this information to help construct the grammar.
The NLTK corpus collection includes data from the PE08 Cross-Framework and Cross Domain Parser Evaluation Shared Task. A collection of larger grammars has been prepared for the purpose of comparing different parsers, which can be obtained by downloading the large_grammars package (e.g., python -m nltk.downloader large_grammars).
The NLTK corpus collection also includes a sample from the Sinica Treebank Corpus, consisting of 10,000 parsed sentences drawn from the Academia Sinica Balanced Corpus of Modern Chinese. Let's load and display one of the trees in this corpus. >>> nltk.corpus.sinica_treebank.parsed_sents().draw()
DM VH11 Nab iû DM VH16 Nab
Was this article helpful?