Python has some libraries that are useful for visualizing language data. The Matplotlib package supports sophisticated plotting functions with a MATLAB-style interface, and is available from

So far we have focused on textual presentation and the use of formatted print statements to get output lined up in columns. It is often very useful to display numerical data in graphical form, since this often makes it easier to detect patterns. For example, in Example 3-5, we saw a table of numbers showing the frequency of particular modal verbs in the Brown Corpus, classified by genre. The program in Example 4-10 presents the same information in graphical format. The output is shown in Figure 4-4 (a color figure in the graphical display).

Example 4-10. Frequency of modals in different sections of the Brown Corpus.

colors = 'rgbcmyk' # red, green, blue, cyan, magenta, yellow, black def bar_chart(categories, words, counts):

"Plot a bar chart showing counts for each word by category" import pylab ind = pylab.arange(len(words)) width = 1 / (len(categories) + 1) bar_groups = []

for c in range(len(categories)):

bars =*width, counts[categories[c]], width, color=colors[c % len(colors)]) bar_groups.append(bars) pylab.xticks(ind+width, words)

pylab.legend([b[0] for b in bar_groups], categories, loc='upper left') pylab.ylabel('Frequency')

pylab.title('Frequency of Six Modal Verbs by Genre')

>>> genres = ['news', 'religion', 'hobbies', 'government', 'adventure']

>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']

>>> cfdist = nltk.ConditionalFreqDist(

... for word in nltk.corpus.brown.words(categories=genre)

... counts[genre] = [cfdist[genre][word] for word in modals] >>> bar_chart(genres, modals, counts)

From the bar chart it is immediately obvious that may and must have almost identical relative frequencies. The same goes for could and might.

It is also possible to generate such data visualizations on the fly. For example, a web page with form input could permit visitors to specify search parameters, submit the form, and see a dynamically generated visualization. To do this we have to specify the

Agg backend for matplotlib, which is a library for producing raster (pixel) images O. Next, we use all the same PyLab methods as before, but instead of displaying the result on a graphical terminal using, we save it to a file using pylab.savefig() ©. We specify the filename and dpi, then print HTML markup that directs the web browser to load the file.

>>> import matplotlib >>> matplotlib.use('Agg') >>> pylab.savefig('modals.png') © >>> print 'Content-Type: text/html' >>> print

>>> print '<img src="modals.png"/>'

Frequency of Six Modal Verbs by Genre

350 300


150 100 50

0 can couid may might must will

Figure 4-4. Bar chart showing frequency of modals in different sections of Brown Corpus: This visualization was produced by the program in Example 4-10.

Was this article helpful?

0 0

Post a comment