How to Evaluate Annotator Performance

  • Using the Output to Evaluate Performance, the CSV output of the images and annotations allows evaluation of the whole image classifier (WIC). By varying a threshold on the WIC Confidence, a precision and recall curve may be generated. This depends on the definition of ground truth. The most straightforward definition is that an image is a positive in the ground truth if the annotation CSV has at least one ground truth positive annotation in this image. The image is a negative if there are no such annotations. This definition could be varied to measure the performance of the WIC as a function of the number of animals in the image.

  • The CSV output of the annotations provides the basis for evaluating the labeling accuracy of the ML detection model by comparing its detected annotations to those of the ground truth. This may also be done to evaluate the labeling accuracy of an assignee. Further, the same measure — e.g. the F1 score — can be used to compare two assignees, two different ML models, or an assignee and an ML model. In this case, the accuracy measure would instead be a measure of consistency.

  1. To illustrate how this might be done, consider the comparison between the ML model and the ground truth. Each row of the CSV where the assignee is the ML model of interest constitutes a detected annotation. Each row of the CSV whose Ground Truth field is true, constitutes a ground truth annotation. For the detected annotation, if there is a ground truth annotation found in the same image with the same Classification and whose Box has sufficient overlap with the Box of the detected annotation, then the detected annotation is a true positive. If there is no such ground truth annotation, the detected annotation is a false positive. More importantly, a ground truth annotation is only allowed to be assigned one detected annotation. Ground truth annotations for which there is no assigned detected annotation are considered to be false negatives. From these definitions of true positives, false positives, and false negatives, values for precision, recall, and F1 may be calculated.

  • A key point here is the notion of sufficient overlap. The amount of overlap between Boxes is usually measured in terms of intersection over union (IOU). The intersection is the area where Boxes cross paths with each other. Union is the area where the Boxes unite. IOU is the ratio of these. These are straightforward to compute because the axes of the Boxes are parallel to the horizontal and vertical axes of the images. If a single IOU threshold is chosen it is typically 0.5, but often a plot is given of the effect of using different IOU scores. Three final notes are important here:
    • When multiple possible ground truth annotations (same image, same label) are found for a detected annotation, the ground truth annotation having the largest IOU with the detected annotation should be chosen.
    • It is recommended that the exclusion line is not used in the performance evaluation of assignees and ML models.
    • Further experiments with the machine learning model and its output could fine-tune it for a variety of use cases. Typically these are based on the summary statistic of the mean average precision (mAP), which is widely discussed in the detection literature. This is beyond the scope of Scout 1.0.