## Underlying Probabilistic Model

Another way of understanding the naive Bayes classifier is that it chooses the most likely label for an input, under the assumption that every input value is generated by first choosing a class label for that input value, and then generating each feature, entirely independent of every other feature. Of course, this assumption is unrealistic; features are often highly dependent on one another. We'll return to some of the consequences of this assumption at the end of this section. This simplifying assumption, known as the naive Bayes assumption (or independence assumption), makes it much easier to combine the contributions of the different features, since we don't need to worry about how they should interact with one another.

Based on this assumption, we can calculate an expression for P(label\features), the probability that an input will have a particular label given that it has a particular set of features. To choose a label for a new input, we can then simply pick the label l that maximizes P(l\features).

To begin, we note that P(label\features) is equal to the probability that an input has a particular label and the specified set of features, divided by the probability that it has the specified set of features:

(2) P(label\features) = P(features, label)/P(features)

Next, we note that P(features) will be the same for every choice of label, so if we are simply interested in finding the most likely label, it suffices to calculate P(features, label), which we'll call the label likelihood.

If we want to generate a probability estimate for each label, rather than just choosing the most likely label, then the easiest way to compute P(features) is to simply calculate the sum over labels of P(features, label):

(3) P(features) = T^bel e labels P(features, label)

The label likelihood can be expanded out as the probability of the label times the probability of the features given the label:

(4) P(features, label) = P(label) x P(features\label)

Furthermore, since the features are all independent of one another (given the label), we can separate out the probability of each individual feature:

(5) P(features, label) = P(label) x nf efeaturesP(f\label)

This is exactly the equation we discussed earlier for calculating the label likelihood: P(label) is the prior probability for a given label, and each P(f\label) is the contribution of a single feature to the label likelihood.