Thoughts on MSCOCO AP evaluation metrics
AP (Average precision) is the standard metric for evaluating object detectors (https://cocodataset.org/#detection-eval) and in this post, I want to talk about some things to keep in mind when using it as your measuring stick.
Some intuition first
To understand what AP actually means one first needs to get acquainted with the Precision-Recall Curve.
Let's take an example
On the y-axis you have precision (how accurate your detections are), and on the x-axis, you have recall (how many instances you can correctly detect), with both ranging from 0 to 100%. The curve captures the quintessential tradeoff before detecting more objects vs getting fewer false positives by changing your confidence threshold (aka operating point).
The most intuitive way to use the curve is to start by asking either one of two questions:
- What is the minimum Precision I want to achieve in the field, or to put it in another way, what is the maximum false positive rate I am willing to accept? For instance, if you were willing to accept only one false positive per 10 correct detections, i.e. a maximum of 10% False Positive Rate, then you would need to have at least 90% precision.
- What is the minimum True positive rate I am willing to accept? For instance, if you wanted to ensure your model picked up at least 90% of the objects you want do detect then you would be looking at a minimum recall value of 90%.
If you were willing to accept no more than 2 False Positives per 10 True Positives (i.e. 20% FPR or 80% Precision), than a simple intersection check will enable you to know what is the Recall rate you should expect to get:
This is how it would look like for a minimum of 90% Precision:
If you want to figure out the expected Precision for a minimum Recall rate you would just start on the x-axis first, draw a vertical line, figure out where it would intersect with the curve, and from there it's just a matter of reading the corresponding value on the y-axis.
So, as you can see, the PR curve is a great tool to help you explore the precision vs recall tradeoff by enabling you to estimate the value of either metric by placing a minimum requirement on the other.
Area Under the Curve
Now, the final AP score for a given class at a given iOU (or iOU range) is just the area under the PR curve for that configuration, as depicted in the following image
If your model was perfect (as in, able to detect all instances correctly without a single FP), this is how your curve would look like
The Area Under the Curve (AUC) would be 1.0.
Even though a higher AUC is generally better, two models with the same AUC can exhibit very different behaviors.
In the graph above, both curves have the same AUC and would thus be considered equally good if one were to look only at the final AP score, but beneath the apparent similarity lies quite a significant difference in how the models would operate.
Model 1 is able to consistently maintain 100% Precision for the first 40% of the targets, while Model 2 would incur a 20% FPR just to detect 30% of the targets.
In a lot of inference systems in production, the model outputs might be acted upon directly without first being processed by a smoothing/tracking step. This is typical in situations where the NN inference pass is triggered by motion, and you have to make a decision with only a small set of frames, therefore having little time to build a sufficiently good model of the world for tracking purposes.
A different way of evaluating object detection models would be to use precision-bound AUC, where you discard everything below the minimum acceptable precision.
Here is an example of how the two models from above would compare if we were to enforce a minimum of 90% precision.
In this case, Model 1 would clearly come out on top, since it performs better on the desired operation domain (at max 20% FPR).
If we were to use a stronger restriction of 90% minimum precision, the difference between the two would be even more pronounced.
I believe precision-bound AUC can be very useful to compare different models/iterations especially when a lower FPR is desirable.
Object size brackets
MSCOCO AP evaluation metrics produce AP values for three different size brackets: small, medium, and large.
- Small: Between 0 and 32 * 32 pixels
- Medium: Between 32 * 32 and 96 * 96 pixels
- Large: Anything above 96 * 96 pixels
One typically trains an object detector with a given input resolution size in mind. Let's say you are building a 300 * 300 MBV2-SSD; in that case, every image fed to the network would first be resized to 300 * 300. There are a lot of techniques nowadays that use multi-scale training and testing regimes, but generally speaking, you always resize the images to your desired input size.
It's important to notice that each bounding box in a dataset is placed in a different size bracket based on its image-space size, without considering what its final size would be after resizing its image to the desired network input size.
This has some interesting implications, especially if you dealing with a heterogeneous test set made up of images with different resolutions and aspect ratios.
Let's look at an example of how two images with one bounding box each would be pre-processed in the forward pass if we were dealing with a 300x300 object detector.
The above image depicts two scenarios, where a bounding box with the same size (and therefore within the same size bracket on the final evaluation results) in the original image can end up looking very different from the network's perspective and, understandably, be considerably harder to detect given its reduced area, leading to misleading numbers at the end. If we were to consider the bounding box size after input resizing it would end up in a different bracket.
This is one of the reasons why it is important to distinguish between the target's original size versus its size on the network input, also known as the number of Pixels On Target (POT).
Consider truncation on your APs
Another thing to keep in mind when evaluating the accuracy of your model for the different size brackets is that, in many instances, truncated objects might make up the bulk of your test/validation set even if the truncated object itself is quite large.
I find it useful to be able to skip truncated GT (ground-truth) bounding boxes if it's not an important requirement for your use case. This can be easily done even without metadata just by ignoring all detections and ground-truth bounding boxes that lie within a given margin of the image edge.
Let me know what you think! You can reach me on Linkedin or Twitter.