ML at Parallel Domain
In recent years learning-based perception models have made the jump from the research lab to viable commercial products that are deployed in the real-world. While machine learning approaches continue to improve year after year, they remain limited by the data that is available during training. Even though a robust ML model is able to generalize what it has learned to unseen data, imbalances in class-distribution still pose a significant challenge to model performance.
Because traditional driving datasets are collected from the real world, they mirror this class imbalance. Cyclists, for example, account for only ~4% of objects in KITTI [1] and ~0.2% in nuImages by Motional[2]. Cars on the other hand occur 18x more frequently in KITTI and 200x more frequently in nuImages. This results in models that are better at detecting cars than cyclists. The best performing models on common object detection benchmarks show 13-15% lower average precision on cyclists than on cars. Failure to detect a cyclist can very easily lead to a fatal accident. Cyclist fatalities in the United States have increased 36% since 2010.
When generating synthetic data we have fine-grained control over the class distribution. We show that we can leverage this control to improve the performance of an off-the-shelf 2D object detector without the need for any architectural changes by incorporating high quality synthetic data into the training loop.
To examine the impact of adding synthetic data we compare results on two publicly available datasets of different sizes: KITTI and nuImages. We generate separate synthetic datasets for each, named PD-K and PD-N. For each dataset we match the sensor configuration as well as environmental factors such as time-of-day and weather conditions of the reference data. We increase the relative frequency of cyclists in each of the synthetic datasets to ~13-15%.
Dataset | # training images | # of cyclists (train) | # validation images | # of cyclists (val) |
---|---|---|---|---|
KITTI | 3.7K | 734 | 3.7K | 893 |
nuImages | 67.3K | 1168 | 16.4K | 238 |
PD-K (KITTI) | 11.1K | 43435 | n/a | n/a |
PD-N (nuImages) | 67.3K | 129236 | n/a | n/a |
Table 1: Sizes of the datasets and the number of cyclist annotations in each.
In order to obtain comparable results, we limit the classes in nuImages to match those available in KITTI, i.e. Cars, Pedestrians and Cyclists. We then train an off-the-shelf YOLOv3 [3][4] model and measure performance by mAP and AP on a per-class basis.
First we establish a baseline to see how the detector performs when trained on real data only. To improve upon this baseline result we employ two simple training procedures.
During Joint-Training we concatenate real and synthetic datasets and randomly sample batches from the combined dataset. For fine tuning we first train a model purely on PD data and then train all layers with a reduced learning rate on the corresponding real world dataset.
When trained in this way, we see a clear performance boost for both techniques on each dataset. On KITTI we see performance improvements across all three classes, with the largest improvement on Cyclists. While the overall performance boost gets smaller with the increased amount of real world data in the nuImages dataset, the performance gains on the Cyclist class are still significant. We also notice that finetuning seems to outperform joint-training when less labelled real world data is available, but the reverse is true on the larger nuImages dataset.
Train | Validation | mAP | Cyclist AP | Car AP | Pedestrian AP |
---|---|---|---|---|---|
KITTI | KITTI | 53.51 | 34.52 | 80.40 | 45.62 |
PD-K + KITTI (3.0 / 1.0) | KITTI | 67.53 | 57.08 | 84.36 | 61.14 |
PD-K → KITTI | KITTI | 70.23 | 60.51 | 87.09 | 63.08 |
Train | Validation | mAP | Cyclist AP | Car AP | Pedestrian AP |
---|---|---|---|---|---|
nuImages | nuImages | 64.37 | 39.50 | 83.20 | 70.42 |
PD-N + nuImages (1.0 / 1.0) | nuImages | 68.53 | 52.11 | 82.86 | 70.63 |
PD-N → nuImages | nuImages | 66.26 | 49.41 | 80.74 | 68.64 |
Table 2: Performance on the KITTI and nuImages validation sets using the default training pipeline. Joint-Training is indicated with a ‘+’ sign, finetuning with an arrow. In parentheses we indicate the dataset ratios of joint-training relative to the real-world training dataset.
Another potential gain from using Parallel Domain synthetic data is to reduce the amount of real world data that is required to be collected in order to achieve a similar level of performance. In a further experiment we found that we can reduce the amount of labelled real-world data by 2/3 while maintaining similar performance to our baseline nuImages model when performing finetuning on nuImages with PD-N. In similar fashion we can reduce dataset size by at least 50% when using joint-training.
Train | Validation | mAP | Cyclist AP | Car AP | Pedestrian AP |
---|---|---|---|---|---|
nuImages | nuImages | 64.37 | 39.50 | 83.20 | 70.42 |
PD-N + nuImages (0.5 / 0.5) | nuImages | 64.00 | 45.73 | 79.99 | 66.28 |
PD-N → nuImages (22.5K) | nuImages | 63.24 | 46.09 | 77.93 | 65.69 |
Table 3: Performance on the nuImages validation set using the default training pipeline. Joint-Training is indicated with a ‘+’ sign, finetuning with an arrow. In parentheses we indicate the dataset ratios of joint-training relative to the real-world training dataset or the number of training samples used.
Adding synthetic data is not the only way to deal with class-imbalances. A common approach is to oversample underrepresented classes in the existing dataset and thus assign them more weight during training. In order to evaluate this approach we tested several different strategies and we will give details below on an approach based on rebalancing image crops.
Instead of training on full images we extract image crops around object instances. We randomly select objects such that we approximate a uniform distribution across all classes. This approach has two advantages:
Crop-based rebalancing adds a significant performance boost on nuImages (64% mAP without vs 75% mAP with cropping), but fails to increase performance on KITTI (53% mAP without vs 51% mAP with cropping). As seen below, even with this improvement, adding synthetic data continues to provide an additional performance boost even when using crop based class-rebalancing.
Train | Validation | mAP | Cyclist AP | Car AP | Pedestrian AP |
---|---|---|---|---|---|
KITTI | KITTI | 51.32 | 31.61 | 79.50 | 42.84 |
PD-K+KITTI (3.0 / 1.0) | KITTI | 63.07 | 51.64 | 80.46 | 57.12 |
PD-K -> KITTI | KITTI | 68.06 | 55.23 | 86.43 | 62.52 |
nuImages | nuImages | 75.72 | 57.15 | 87.84 | 82.17 |
PD-N+nuImages (1.0 / 1.0) | nuImages | 78.27 | 64.44 | 88.05 | 82.32 |
PD-N -> nuImages | nuImages | 77.44 | 62.40 | 87.88 | 80.74 |
Table 4: Performance on the KITTI validation set using cropping and class-balancing. See table 1 for details about naming conventions.
Despite making up only 0.2% of miles travelled [5], Cyclists account for 2.1 percent of all traffic fatalities in the US in 2017 [6]. Autonomous vehicles have the potential to make our roads safer for all road users, but modern perception models are often negatively impacted by class imbalances found in the real-world. In this post, we showed how to significantly improve the ability of an off-the-shelf object detector to detect Cyclists by using PD synthetic data.
In conclusion we showed three things:
If you’re interested in the more experimental results, check out our detailed write-up on Weights and Biases. If you want to see how much Parallel Domain’s data can improve your models, contact us!
We plan to provide the data used in this experiment as an open dataset for non-commercial usage. If you are interested in a commercial license for this data or a version of a similar dataset in your own sensors and ontology, contact us!