The detection of small objects in high resolution images is challenging. There is a simple reason for this: the original image is usually scaled down to fit the model's input tensor size. This results in small objects being even smaller, at the point that they become unrecognizable.
The following figure shows this concept. The original Full HD image shows a ball, which we want our detector to find. We scale down the frame to 640x640, which is what our model accepts, and letter-box it in order to preserve aspect ratio. As you can see, the ball is now so small that is barely a few pixels in area. The detector will have a really hard time nailing that prediction.
This post describes one possible method to improve detection performance of small objects using tiled predictions.
Improving Detection Performance with Tiled Predictions
The genius of tiled prediction lies in its simplicity. Instead of scaling down the image, we simply split the original frame into windows which are fed to the model. By doing so, the objects in the scene retain their original size. You'll hear this referred to as sliced, windowed or tiled detection.
The following figure exemplifies the concept above. Instead of scaling down the original image, we crop out a series of 640x640 windows and pass those to the model, individually. By doing, so the player and ball (in the example below) retain their original size and are more easily detected. Of course, we eventually need to consolidate these predictions, but we'll address that later on.
Did you notice the potential problem? We had the bad luck that the ball ended up at the very boundary between two tiles. This, of course, will also affect the detection performance.
Again, the solution to this is rather simple: allow some overlap between the tiles. The following figure shows how this concept helps the boundary problem.
Consolidating Predictions
The remaining part is to combine the predictions from the different windows into a single, unified result. In its simplest form, you can attempt to combine the tile position within the image with the prediction position within the tile. With simple arithmetic you can compute the global prediction. This may also provide a simple heuristic to match duplicate predictions from adjacent tiles.
If you ever embark in this coding challenge, you'll quickly realize that these simple heuristics are not enough in complex cases. The predictions in two different tiles may, even if its of the same object, result in significantly different bounding boxes. Merging those cases together is not so trivial anymore.
In order to overcome this problem other advanced heuristics are typically used. Among the most common are employing non-maximum-suppresion (NMS), non-maximum-merging (NMM), or any of their variants, in order to find equivalences between the tiles.
Existing solutions implementing sliced detection report an increase in average precision of 6.8% when applied to the Visdrone and xView aerial object detection datasets.
Performance Penalty
It's time to address the elephant in the room: the performance penalty of using tiled inference. We are no longer performing a single inference, but many of them for a single image. It is true that we save some time by avoiding the scaling and letterboxing process, however we replace it by cropping multiple times. Summed to all of this, we need to account for the post processing time that the consolidation takes.
Truth is, there's not way around this. You may be able to combine all tiles together into a single batch and process them all at once. While they are still going to be predicted individually, feeding the GPU a single batch may speed up things by utilizing more of the available hardware, if any.
Introducing: SAHI
SAHI stands for Slicing Aided Hyper Inference and was first published on 2022 in the IEEE International Conference on Image Processing. From the paper abstract:
In this work, an open-source framework called Slicing Aided Hyper Inference (SAHI) is proposed that provides a generic slicing aided inference and fine-tuning pipeline for small object detection.
SAHI implements the techniques discussed in the previous sections abstracted in a simple Python package. Make sure you visit their GitHub repository and check the resources in there as they are awesome.
Improving Detection with SAHI
Its finally time to play with SAHI and tiled predictions! The following section is meant to be an interactive notebook for you to run the commands, modify them and get familiar with the tool.
Make sure you press the 🚀 icon at the top right followed by Live Code to run the commands right within this same page!
Oh, and... be patient! The server hosting this notebook is not so powerful so commands may take a while. Enjoy!
Closing Remarks
Detecting small objects can be challenging due to the initial image scaling
Tiled inference is a technique to process the image in windows to avoid scaling
SAHI is a simple Python package that implements tiled inference
SAHI is generic and can be used with any detector (with a little adaptation, of course)
You can easily improve your detection performance using SAHI
Use it when real-time performance is not your bottleneck