Fundamental Data Types

Despite its complexity, the TIMIT Corpus contains only two fundamental data types, namely lexicons and texts. As we saw in Chapter 2, most lexical resources can be represented using a record structure, i.e., a key plus one or more fields, as shown in Figure 11-3. A lexical resource could be a conventional dictionary or comparative wordlist, as illustrated. It could also be a phrasal lexicon, where the key field is a phrase rather than a single word. A thesaurus also consists of record-structured data, where we look up entries via non-key fields that correspond to topics. We can also construct special tabulations (known as paradigms) to illustrate contrasts and systematic variation, as shown in Figure 11-3 for three verbs. TIMIT's speaker table is also a kind of lexicon.

Lexicon

Abstraction: fielded records

Eg: dictionary wake: weik, [v], cease to sleep... walk: wo:k, [v], progress by lifting and setting down each foot...

Eg: comparative wordlist wake; aufwecken; acordar walk; gehen; andar write; schreiben; enscrever

Text

Abstraction: time series

key

field

field

field

field

key

field

field

field

field

token

token

token

attrs

attrs

attrs

time-

Eg: written text

A long time ago. Sun and Moon lived together. They were good brothers. ...

Eg: POS-tagged text A/DT long/JJ time/NN ago/RB,/, Sun/NNP and/CC Moon/NNP lived/VBD together/RB,/,

Eg: verb paradigm wake woke woken write wrote written wring wrung wrung

Eg: interlinear text

Ragaipa irai vateri ragai -pa ira -i vate-ri

Figure 11-3. Basic linguistic data types—lexicons and texts: Amid their diversity, lexicons have a record structure, whereas annotated texts have a temporal organization.

At the most abstract level, a text is a representation of a real or fictional speech event, and the time-course of that event carries over into the text itself. A text could be a small unit, such as a word or sentence, or a complete narrative or dialogue. It may come with annotations such as part-of-speech tags, morphological analysis, discourse structure, and so forth. As we saw in the IOB tagging technique (Chapter 7), it is possible to represent higher-level constituents using tags on individual words. Thus the abstraction of text shown in Figure 11-3 is sufficient.

Despite the complexities and idiosyncrasies of individual corpora, at base they are collections of texts together with record-structured data. The contents of a corpus are often biased toward one or the other of these types. For example, the Brown Corpus contains 500 text files, but we still use a table to relate the files to 15 different genres. At the other end of the spectrum, WordNet contains 117,659 synset records, yet it incorporates many example sentences (mini-texts) to illustrate word usages. TIMIT is an interesting midpoint on this spectrum, containing substantial free-standing material of both the text and lexicon types.

0 0

Post a comment