RidgeRun has been working on using deep learning for action recognition. Some examples are the assembly line action recognition and the action recognition for MMA (Mixed Martial Arts) projects. Normally, in action recognition, the network should process the subjects of interest only and not the entirety of the picture during both, training and inference. That said, the first step to a successful action recognition application is to properly identify these subjects in the scene.
In the case of action recognition for MMA, the subjects of interest are the fighters. RidgeRun developed a deep learning model that automatically detects fighters in crowded scenes such as MMA fight events or training gyms. The rest of the actors in the image sequence are ignored. This serves as a preprocessing step for the action detection neural network which will be used in an automated scoring application. Extracting the fighter by hand would be extremely costly and time-consuming. In this blog, we will go over the details of mixed martial arts fighter detection using DeepLearning.
Once we selected a neural network architecture, trained it with a dataset that we created and validated it, we proceeded to deploy it to a Jetson Nano. To do so we transformed the network into a DeepStream engine, and executed it using GStreamer. The resulting pipeline runs at 41FPS with an IOU of 0.44. On top of this the pipeline also uses a NVIDIA tracker, specifically the IOU tracker, which identifies fighters throughout the video and increments robustness against spurious occlusions or missed detections.
The present document describes the architecture selection process, the dataset preparation, and the comparison of performance metrics on different scenarios to determine the best model to detect MMA fighters, considering not only accuracy but also runtime performance on embedded systems.
Dataset
The dataset is comprised of 3830 images with the following distribution:
3181 images with 2 fighters in the scene.
521 images with 1 fighter in the scene.
128 images with no fighters in the scene.
In all images, other people such as spectators, referees, etc, are included so the network can learn to ignore them when looking for fighters. The number of fighters in the images varies from 0 to 2 in order to avoid the network from forcibly looking for a fixed amount of subjects in the scene.
The dataset was originally augmented using shear, blur, noise, and exposure variations to generate new samples. From all the images, 172 were reserved for testing the network. This ensures the performance is evaluated with unseen images. The following are some samples taken from the training set. It can be noted that they contain some augmentation effects such as noise and rotation.
Figure 1. Training set samples
Architecture Exploration
The goal of this stage is to test different neural network architectures used for object detection and determine their performance in terms of precision and system load, towards an embedded application. The architectures tested for this project are listed in the table below.
Table 1. Architecture selected for performance measurement
Family | Architecture |
---|---|
Mobilenet | Mobilenet SSD v2 |
RCNN | Faster RCNN inception |
YOLO | YOLOv3 and YOLOv5 |
Performance results of fighter detection
We divided the experimentation of the different neural networks into 3 different stages.
First Stage
Running the neural networks in an environment with a virtually unconstrained resources, calculating every model's IOU and inference time.
Second Stage
The models that have a higher average IOU and lower inference time are selected for the second stage. During this stage the models are executed using PyTorch in a Jetson Nano to benchmark the CPU, RAM, and GPU utilization of each model. The one that has the least resources utilization and the lowest inference time is selected.
Third Stage
The selected model is optimized by converting it into a DeepStream Engine and the same metrics of the previous stage are gathered.
First Stage
For this stage we have the following results when it comes to the inference time:
Figure 2. Inference time of all networks in the x86 system
As you can see, the TensorFlow models MobileNet SSD 2 and Faster RCNN inception are slower than all the different YOLO networks, being the RCNN model the slowest of them all. As expected, the bigger YOLO models, like YOLOv5m and Yolov3 are the slower models of the YOLO family, but still outperform the TensorFlow models. When it comes to the smaller networks, Yolov5n and Yolov3 tiny are the fastest models, being Yolov3 tiny the fastest of all the YOLO networks.
The following figure presents the average IOU of each implementation:
Figure 3. IOU of all the neural networks
The network with the highest IOU is Yolov5m, followed by Yolov3 and Yolov5s.
The faster networks don't fall behind by that much, having a difference of 0.04 (Yolov5n) and 0.14 (Yolov3 tiny) when compared to the network with the best IOU. The Tensorflow networks fall behind when compared to all the other networks, having the lowest IOU of them all. taking this and the inference time into account, these networks are discarded and are not going to be used in the second stage of testing.
Second Stage
In this stage, we ran the models in a Jetson Nano to check their performance in a system where resources are more constrained. The following image shows the inference time of each model:
Figure 4. Average inference time in Jetson Nano
As you can see, the bigger networks, Yolov5m and Yolov3, are penalized because of their size, landing them in the area of the slower models in this stage. Yolov5n and Yolov3 tiny are the fastest networks, but this time the difference between them is smaller, just 8ms.
The CPU and RAM utilization of every single one of the models can be seen in these images:
Figure 5. CPU and Ram utilization of selected models in second stage
The slower models, Yolov3 and Yolov5m, even though they are bigger models, consume less CPU and RAM. This effect can be easily explained. The inference time of these networks is a lot bigger when compared to the other networks, which means that the system has to process frames at a slower rate, making it use fewer resources.
When it comes to GPU utilization these are the generated graphs:
Figure 6. GPU usage of the selected models in the second stage
The network Yolov5n does not reach a 100% utilization of all the GPU resources and has an overall lower utilization of GPU when compared to the other networks. The slower networks use a lot more of the GPU, and their inference takes longer, which means a longer utilization of the GPU. The valleys that you can see in all the graphs of all the networks are the time between inferences. Yolov5n has the lowest utilization of GPU, a better IOU, and a similar inference time, CPU, and RAM utilization to Yolov3 tiny, which is why it was selected for the last stage of experimentation.
Third Stage
In this stage, we transformed the Yolov5n PyTorch model into a DeepStream engine. This model was deployed using GStreamer and the DeepStream plug-ins. The GStreamer pipeline used runs at 41 FPS, which means that it takes 24.39ms from reading the frame from memory to displaying it, making it 3.31 times faster than the PyTorch model. This improvement comes with a cost, the new IOU calculated for this model is 0.44, making it less precise than its PyTorch counterpart. The new model uses more CPU and RAM resources as you can see in this image:
Figure 7. CPU an RAM utilization of the engine and Yolov5n models
When it comes to the GPU resources you can check the following image:
Figure 8. GPU usage of engine model
This new model converted to a DeepStream Engine can actually make use of all the GPU resources that the Jetson Nano has. This new network running at 41 FPS in a Jetson Nano with an IOU of 0.44 is considered a success.
Final network selection
YOLO stands for You Only Look Once. This means that, after the network divides the image into n squares, it makes a single pass to predict what objects the boxes contain and to infer bounding boxes for them. Typically, the model will output a several bounding boxes per object and a post-processing stage removes these duplicates.
The process of removing duplicate predictions for a single object is known as non-max suppression. The concept behind non-max suppression is quite simple. First, the prediction with the highest confidence is selected. Then, the IOU (intersection over union) with this box and the other ones is cumputed. Predictions with IOU over a selected threshold are removed and the process is repeated until only independent bounding boxes remain. This efectively gets rid of poor prediction duplicates. On the other hand, non-max suppression makes it harder to perform detection of overlapping objects.
An example of the final processing can be seen in the following image:
Figure 9. Yolov5n inference example in deepstream
This engine runs in DeepStream using the NVIDIA inference plugins and runs at 41 fps with a video resolution of 1080x1920 pixels. The inference is done at a resolution of 412x412 pixels so the video is resized and the bounding boxes are normalized to the original resolution. This is done automatically by the pipeline elements, making the implementation of the neural network even easier.
Comments