## Histograms

Histograms are charts that show the frequency, or occurrence, of values. In matplotlib, the function hist() is used to calculate and draw the histogram chart. At a minimum, you must supply an array of values. You can control the number of cells in a histogram by specifying them as follows: hist(values, numcells). Alternatively, you can specify the histogram bins hist(values, bins), where bins is a list holding histogram bin values. The return value from hist() is a tuple of probabilities, bins, and patches. Patches are used to create the bars; I'll go into more detail in the "Patches" section later in the chapter.

The function hist() has other customization options, including the histogram orientation (vertical or horizontal), the alignment of bars, and more. Again, refer to the interactive help: help(hist).

Example: GDP, Histogram

We turn again to the GDP ranks from the CIA World Factbook; this time we plot a histogram of the N largest economies. Again, we use the read_world_data() function implemented in the previous example (see Listing 6-9).

Listing 6-9. Plotting GDP Histogram

# a script to plot GDP histogram from pylab import *

# initialize variables; N is the number of countries, B is the bin size N,B = 50, 1000

# plot the histogram prob, bins, patches = hist(gdp, arange(o, max(gdp)+B, B), align='center')

# annotate with text for i, p in enumerate(prob):

percent = int(float(p)/N*100) # only annotate non-zero values if percent:

text(bins[i], p, str(percent)+'%', rotation=45, va='bottom', ha='center')

ylabel('Number of countries') xlabel('Income, billions of dollars') title('GDP histogram, %d largest economies' % N)

# some axis manipulations xlim(-B/2, xlim()[l]-B/2)

Figure 6-10 shows the resulting graph.

GDP histogram, 50 largest economies

GDP histogram, 50 largest economies Income, billions of dollars

Figure 6-10. GDP histogram, N largest economies

Income, billions of dollars

Figure 6-10. GDP histogram, N largest economies

Again, the script should prove quite readable. I'd like to turn your attention to what might appear to be an odd modification I've made to the x-axis using the call to function xlim(). The purpose of this call is to modify the default behavior of the x-axis ranges. The motivation behind this modification is that since I've chosen 'center' for the histogram bins, the automatic x-axis range includes negative values, because the leftmost bin is centered at zero but has a width, part of it in the negative x-axis. I didn't like this behavior and chose to override it by manually setting the axis. Instead of setting a fixed number, I've first retrieved the current axis by calling xlim(), and then modified the x-axis by subtracting and adding half the bin width, B/2, to the axis.

As a general rule, when you modify default behavior like this, try to use parameters as much as possible (in the preceding example, using the parameter B, not the value 1000, and retrieving current values with xlim()); this will allow for more flexible scripts that cater to a wider range of input values.