Another instance where accuracy scores can be misleading is in "search" tasks, such as information retrieval, where we are attempting to find documents that are relevant to a particular task. Since the number of irrelevant documents far outweighs the number of relevant documents, the accuracy score for a model that labels every document as irrelevant would be very close to 100%.
It is therefore conventional to employ a different set of measures for search tasks, based on the number of items in each of the four categories shown in Figure 6-3:
• True positives are relevant items that we correctly identified as relevant.
• True negatives are irrelevant items that we correctly identified as irrelevant.
• False positives (or Type I errors) are irrelevant items that we incorrectly identified as relevant.
• False negatives (or Type II errors) are relevant items that we incorrectly identified as irrelevant.
Figure 6-3. True and false positives and negatives.
Given these four numbers, we can define the following metrics:
• Precision, which indicates how many of the items that we identified were relevant, is TP/(TP+FP).
• Recall, which indicates how many of the relevant items that we identified, is TP/(TP+FN).
• The F-Measure (or F-Score), which combines the precision and recall to give a single score, is defined to be the harmonic mean of the precision and recall (2 x Precision x Recall)/(Precision+Recall).
Was this article helpful?