Since we are going to build a reporting system that produces statistical reports about the behavior of our system, let's look at some of the statistical functions that we will be using.
Quite possibly, the most commonly used function is for calculating the average value of a series of elements. The NumPy library provides two functions to calculate the average of all numbers in an array: mean() and average().
The mean() function calculates a simple mathematical mean of any given set of numbers.
>>> a = np.arange(10.) >>> a array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]) >>> np.mean(a)
The average() function accepts an extra parameter, which allows you to provide weights that will be used to calculate the average value of an array. Keep in mind that the array of weights must be the same length as the primary array.
>>> np.average(a, weights=np.array([1, 1, 1, 5, 10]))
You may wonder why you would use a weighted average. One of the most popular use cases is when you want to make some elements more significant than the others, especially if the elements are listed in a time sequence. Using the preceding example, let's assume that the numbers we used initially (5, 5, 5, 6, 6) represent the system load readings, and the readings were obtained every minute. Now we can calculate the average (or the arithmetic mean) by simply adding all the numbers together and then dividing them by the total number of elements in the array (this is what the mean() function does). In our example, that result is 5.4. However, the last readings—the most recent—are usually of greater interest and importance. Therefore, we use weights in the calculation that effectively tell the average() function which numbers are more important to us. As you can see from the result, the last two values of 6 more heavily influenced the end result once we indicated their importance.
The less known and used statistical functions are variance and standard deviation. Both of these indicators are closely related to each other and are measures of how spread out a distribution is. Simply stated, these are the functions that measure variability of a dataset. The variance is calculated as an average of the square of the distance of each data point from the mean. In mathematical terms, the variance shows the statistical dispersion of data. As an example, let's assume we have a set of random data in an array: [1, 4, 3, 5, 6, 2]. The mean value of this array is 3.5. Now we need to calculate a squared distance from the mean for each element in the array. The squared distance is calculated as (value-mean)2. So, for example, the first value is (1 - 3.5)2 = (-2.5)2 = 6.25. The rest of the values are as follows: [6.25, 0.25, 0.25, 2.25, 6.25, 2.25]. All we need to do now to get the variance of the original array is calculate the mean of these numbers, which has a value of 2.9 (rounded) in our case. Here's how to perform all those calculations with a single NumPy function call:
>>> a array([ 1., 4., 3., 5., 6., 2.]) >>> np.var(a)
We established that this figure indicates the average squared distance from the mean, but because the value is squared, it is a bit misleading. This is because it is not the actual distance, but rather an emphasized value of it. We now need to get the square root of this value to get it back in line with the rest of the values. The resulting value represents the standard deviation of a dataset. The square root of 2.9 is roughly equal to 1.7. This means that most elements in the array are not further than 1.7 from the mean, which is 3.5 in our case. Any element outside this range is an exception to the normal expected value. Figure 11-1 illustrates this concept. In the diagram, four out of the six elements are within the standard deviation, and two readings are outside the range. Keep in mind that due to the way the standard deviation is calculated, there are always going to be some values in a dataset that are at a distance from the mean that is greater than the standard deviation of the set.
Figure 11-1. Mean and standard deviation of a dataset
The NumPy library provides a convenience function to calculate the standard deviation value for any array:
>>> a = np.array([1., 4., 3., 5., 6.,2.]) >>> a array([ 1., 4., 3., 5., 6., 2.]) >>> np.std(a)
The dataset in our examples so far is reasonably random and has far too few data points. Most real-world data, although seemingly random, follows a distribution known as the normal distribution. For example, the average height of people in a nation might be, let's say, 5 feet 11 inches (which is roughly 1.80 meters). The majority of the population would have a height close to this value, but as we go further away, we'll observe that fewer and fewer individuals fall in that range. The distribution peaks at the mean value and gradually diminishes, going to each side from the mean value. The distribution pattern has a bell shape and is defined by two parameters: the mean value of the dataset (the midpoint of the distribution) and the standard deviation (which defines the "sloppiness" of the graph). The bigger the standard deviation, the more "flat" the graph is going to be, and that means that the distribution is scattered more across the range of possible values. Because the distribution is described by the standard deviation value, some interesting observations can be made:
• Approximately 68% of the data fall within one standard deviation distance from the mean.
• Approximately 95% of the data fall within two standard deviation distances from the mean.
• Nearly all (99.7%) of the data falls within three standard deviation distances from the mean.
To bring this into perspective, let's look at the analysis of a much larger dataset. I generated a set of random data that is normally distributed. The mean (in mathematical texts, usually annotated as ^ or mu) is 4, and the standard deviation (also known as o or sigma) is 0.9. The dataset consists of 10,000 random numbers that follow the normal distribution pattern. I then put all these numbers into the appropriate buckets depending on their value, 28 buckets in total. The bucket (or the bar on the graph) value is a sum of all the numbers that fall into the bucket's range. To make it more meaningful, I then normalized the bucket values, so the sum of all buckets is equal to 1. As such, the bucket value now represents the chance or the percentage of the numbers appearing in the dataset.
You can see the resulting histogram of the number distribution in Figure 11-2. The bars are enclosed by the approximation function line, which just helps you to visualize the form of the normal distribution. The vertical line on the horizontal axis at the 4 mark indicates the mean value of all the numbers in the dataset. From that line, we have three standard deviation bands: one sigma value distance, two sigma value distances, and three sigma value distances. As you can see, this visually proves that nearly all data is contained within three standard deviation distances from the mean.
There are few things to bear in mind. First, the graph shape nearly perfectly resembles the theoretical shape of the normal distribution pattern. This is because I've chosen a large dataset. With smaller datasets, the values are more random, and the data does not precisely follow the theoretical shape of the distribution. Therefore, it is important to operate on large datasets if you want to get meaningful results. Second, the normal distribution is designed to model processes that can have any values from -infinity to +infinity. Therefore, it may not be well suited for processes that have only positive results.
Let's say that you want to measure the average car speed on a highway. Obviously, the speed cannot be negative, but the normal distribution allows for that. That is to say that the theoretical model allows, albeit with extremely low probability, a negative speed. However, in practice, if the mean is further than four or five standard deviation distances from the 0 value, it is quite safe to use the normal distribution model.
We've spent a lot of time discussing and analyzing one scientific phenomenon, but how does that relate to system administration, the subject of this book? As I've mentioned, most of the natural processes are random events, but they all usually cluster around some values. Take the average speed of the cars on a highway. There is a speed limit, but that does not mean that all cars are going to travel at that speed—some will go faster, and some will go slower. But there is a good chance that the average speed will be at or below the speed limit. Also, most cars will be traveling at speeds close to the average. The further you go to each side of this average, the fewer cars will be traveling at those speeds. If you measure the speed of a reasonably big set of cars, you will get the speed distribution shape, which should resemble the ideal pattern of the normal distribution graph.
This model also applies to system usage. Your server or servers are going to perform work only when users request them to do something. Similar to the car speeds on a highway, the system load will average around some value.
I've chosen the distribution function parameters (the mean and standard deviation) so that they model a load pattern on an imaginary four-CPU server. As you can see in Figure 11-2, the load average peaks at 4, which is fairly normal for a busy, but not overloaded, system. Let's assume that the server is constantly busy and does not follow any day/night load-variation patterns. Although the load is pretty much constant, there will always be some variation, but the further you go from the mean, the less chance you have of hitting that reading. For example, it's rather unlikely (32% chance to be precise) that the next reading will be either less than (roughly) 3 or greater than (roughly) 5. Similarly, this rule applies to readings below and above 2 and 6, respectively—actually, the chances of hitting those readings are less than 5%.
What does this tell us? Well, knowing the distribution probabilities, we can dynamicallyset the alert thresholds. Obviously, we're not too concerned about the values going too low, as this wouldn't do any harm to the system (although indirectly, it might indicate some issues). Most interesting are the upper values in the set. We know that two out of every three readings will fall in the first band (one standard deviation distance from the mean to each side). A much higher percentage falls into the second band; in fact, it will be the majority of the readings—more than 95%. You may make a decision that all those readings are normal, and the system is behaving normally. However, if you encounter a reading that theoretically happens only 5% of the time, you may want to get a warning message. Readings that occur only 0.3% of the time are of concern, as they are far from normal system behavior, so you should start investigating immediately.
In other words, we just learned how to define what is "normal" system behavior and how to measure the "abnormalities." This is a really powerful tool to determine the warning and error thresholds for any monitoring system (such as Nagios) that you may be using in your day-to-day job. We will use this mechanism in our application, which will update thresholds automatically.
The complementary function to the standard deviation and variance functions is the histogram calculation function. It is used to sort the numbers into buckets according to their value. I used this function to calculate the size of the bars in the normal distribution pattern in Figure 11-2. This function accepts the an array of the values that it needs to sort, and optionally, the number of bins (the default is 10) and whether the values should be normalized (the default is not to normalize). The result is a tuple of two arrays: one containing the bin size and the other the bin boundaries. Here is an example:
>>> h, b = np.histogram(a, bins=8, normed=True, new=True) >>> h array([ 0.00238784, 0.02268444, 0.12416748, 0.30444912, 0.37966596, 0.26146807, 0.08834994, 0.01074526])
>>> b array([-3.63950476, -2.80192639, -1.96434802, -1.12676964, -0.28919127, 0.5483871 , 1.38596547, 2.22354385, 3.06112222])
The function numpy.random.randn(<count>) is used to generate a normal distribution set with the mean of 0 and the standard deviation of 1.
Was this article helpful?