Obtaining Data from Word Processor Files

Word processing software is often used in the manual preparation of texts and lexicons in projects that have limited computational infrastructure. Such projects often provide templates for data entry, though the word processing software does not ensure that the data is correctly structured. For example, each text may be required to have a title and date. Similarly, each lexical entry may have certain obligatory fields. As the data grows in size and complexity, a larger proportion of time may be spent maintaining its consistency.

How can we extract the content of such files so that we can manipulate it in external programs? Moreover, how can we validate the content of these files to help authors create well-structured data, so that the quality of the data can be maximized in the context of the original authoring process?

Consider a dictionary in which each entry has a part-of-speech field, drawn from a set of 20 possibilities, displayed after the pronunciation field, and rendered in 11-point bold type. No conventional word processor has search or macro functions capable of verifying that all part-of-speech fields have been correctly entered and displayed. This task requires exhaustive manual checking. If the word processor permits the document to be saved in a non-proprietary format, such as text, HTML, or XML, we can sometimes write programs to do this checking automatically.

Consider the following fragment of a lexical entry: "sleep [sli:p] v.i. condition of body and mind...". We can key in such text using MSWord, then "Save as Web Page," then inspect the resulting HTML file:

<p class=MsoNormal>sleep <span style='mso-spacerun:yes'> </span> [<span class=SpellE>sli:p</span>] <span style='mso-spacerun:yes'> </span> <b><span style='font-size:ll.0pt'>v.i.</span></b> <span style='mso-spacerun:yes'> </span>

<i>a condition of body and mind ...<o:p></o:p></i> </p>

Observe that the entry is represented as an HTML paragraph, using the <p> element, and that the part of speech appears inside a <span style=' font-size:ll.0pt' > element. The following program defines the set of legal parts-of-speech, legal_pos. Then it extracts all 11-point content from the dict.htm file and stores it in the set used_pos. Observe that the search pattern contains a parenthesized sub-expression; only the material that matches this subexpression is returned by re.findall. Finally, the program constructs the set of illegal parts-of-speech as the set difference between used_pos and legal_pos:

>>> legal_pos = set(['n', 'v.t.', 'v.i.', 'adj', 'det'])

>>> pattern = re.compile(r"'font-size:ll.0pt'>([a-z.]+)<")

>>> document = open("dict.htm").read()

>>> used_pos = set(re.findall(pattern, document))

>>> illegal_pos = used_pos.difference(legal_pos)

This simple program represents the tip of the iceberg. We can develop sophisticated tools to check the consistency of word processor files, and report errors so that the maintainer of the dictionary can correct the original file using the original word processor.

Once we know the data is correctly formatted, we can write other programs to convert the data into a different format. The program in Example 11-1 strips out the HTML markup using nltk.clean_html(), extracts the words and their pronunciations, and generates output in "comma-separated value" (CSV) format.

Example 11-1. Converting HTML created by Microsoft Word into comma-separated values.

html = open(html_file).read() html = re.sub(r'<p', SEP + '<p', html) text = nltk.clean_html(html) text = ' '.join(text.split()) for entry in text.split(SEP): if entry.count(' ') > 2:

>>> writer = csv.writer(open("dictl.csv", "wb")) >>> writer.writerows(lexical_data("dict.htm"))

Was this article helpful?

0 0

Post a comment