top of page
Writer's pictureMichael Gruner

Improving Detection Performance with SAHI and Tiled Predictions

Updated: Apr 16


The detection of small objects in high resolution images is challenging. There is a simple reason for this: the original image is usually scaled down to fit the model's input tensor size. This results in small objects being even smaller, at the point that they become unrecognizable.


The following figure shows this concept. The original Full HD image shows a ball, which we want our detector to find. We scale down the frame to 640x640, which is what our model accepts, and letter-box it in order to preserve aspect ratio. As you can see, the ball is now so small that is barely a few pixels in area. The detector will have a really hard time nailing that prediction.


Two images of a soccer penalty, side by side. On both images a red arrow points to the ball. On the left, the original image that is measured 1920x1080. To the right, the same image but scaled down. The ball is almost unrecognizable.
Figure. The effects of scaling an image for a deep learning model. Left: the original image where the ball is barely visible. Right: the scaled image, the ball is a few pixels wide.

This post describes one possible method to improve detection performance of small objects using tiled predictions.


Improving Detection Performance with Tiled Predictions


The genius of tiled prediction lies in its simplicity. Instead of scaling down the image, we simply split the original frame into windows which are fed to the model. By doing so, the objects in the scene retain their original size. You'll hear this referred to as sliced, windowed or tiled detection.


The following figure exemplifies the concept above. Instead of scaling down the original image, we crop out a series of 640x640 windows and pass those to the model, individually. By doing, so the player and ball (in the example below) retain their original size and are more easily detected. Of course, we eventually need to consolidate these predictions, but we'll address that later on.


Two images of a soccer penalty side-by-side. The image on the left is larger and is measured as 1920x1080. It has a grid in it. The image on the right corresponds to the first tile of the image. It is measured as 640x640. The tile shows a player and a ball at the very edge.
Figure. Naive image tiling. The cropped out windows are passed to the model so objects retain their original size.

Did you notice the potential problem? We had the bad luck that the ball ended up at the very boundary between two tiles. This, of course, will also affect the detection performance.


Two images side by side. The images are a continuation of each other and show a player ready to kick a penalty in a soccer field. The player is almost centered in the left image. The ball is split in the boundary between the two images.
Figure. Two adjacent tiles that share a portion of the ball.

Again, the solution to this is rather simple: allow some overlap between the tiles. The following figure shows how this concept helps the boundary problem.


Two images of a soccer penalty side-by-side. The image on the left is larger and is measured as 1920x1080. It has a series of overlapping boxes in it. The image on the right corresponds to the second tile of the image. It is measured as 640x640. The tile shows a player and a ball at the very edge.
Figure: Example of overlapping tiles. If an object eventually lands in a boundary, the overlap will allow the object to be detected from another tile.

Consolidating Predictions


The remaining part is to combine the predictions from the different windows into a single, unified result. In its simplest form, you can attempt to combine the tile position within the image with the prediction position within the tile. With simple arithmetic you can compute the global prediction. This may also provide a simple heuristic to match duplicate predictions from adjacent tiles.

Two adjacent tiles extracted from the original images. They share a common prediction which is the player. Both predictions are combined into a single one in the original image.
Figure: The concept of consolidating predictions from adjacent windows into the original image.

If you ever embark in this coding challenge, you'll quickly realize that these simple heuristics are not enough in complex cases. The predictions in two different tiles may, even if its of the same object, result in significantly different bounding boxes. Merging those cases together is not so trivial anymore.


In order to overcome this problem other advanced heuristics are typically used. Among the most common are employing non-maximum-suppresion (NMS), non-maximum-merging (NMM), or any of their variants, in order to find equivalences between the tiles.


Existing solutions implementing sliced detection report an increase in average precision of 6.8% when applied to the Visdrone and xView aerial object detection datasets.

Performance Penalty


It's time to address the elephant in the room: the performance penalty of using tiled inference. We are no longer performing a single inference, but many of them for a single image. It is true that we save some time by avoiding the scaling and letterboxing process, however we replace it by cropping multiple times. Summed to all of this, we need to account for the post processing time that the consolidation takes.


Truth is, there's not way around this. You may be able to combine all tiles together into a single batch and process them all at once. While they are still going to be predicted individually, feeding the GPU a single batch may speed up things by utilizing more of the available hardware, if any.


Introducing: SAHI

https://github.com/obss/sahi

SAHI stands for Slicing Aided Hyper Inference and was first published on 2022 in the IEEE International Conference on Image Processing. From the paper abstract:

In this work, an open-source framework called Slicing Aided Hyper Inference (SAHI) is proposed that provides a generic slicing aided inference and fine-tuning pipeline for small object detection.

SAHI implements the techniques discussed in the previous sections abstracted in a simple Python package. Make sure you visit their GitHub repository and check the resources in there as they are awesome.


Improving Detection with SAHI


Its finally time to play with SAHI and tiled predictions! The following section is meant to be an interactive notebook for you to run the commands, modify them and get familiar with the tool.

Make sure you press the 🚀 icon at the top right followed by Live Code to run the commands right within this same page!

Oh, and... be patient! The server hosting this notebook is not so powerful so commands may take a while. Enjoy!



Closing Remarks

  • Detecting small objects can be challenging due to the initial image scaling

  • Tiled inference is a technique to process the image in windows to avoid scaling

  • SAHI is a simple Python package that implements tiled inference

  • SAHI is generic and can be used with any detector (with a little adaptation, of course)

  • You can easily improve your detection performance using SAHI

  • Use it when real-time performance is not your bottleneck

668 views0 comments
bottom of page