Michael Galarnyk, Nate Cibik, Omar Maher, Phillip Thomas
There are many ways synthetic data can improve perception applications. The problem is that there aren’t a lot of widely known resources out there for learning how to get the best possible results when using synthetic data. In this post, we’ll share synthetic data best practices that we learned from working with many perception teams and our internal research, which includes:
In most cases, good synthetic data should visually resemble real-world sensor data and labels. This is important to ensure generalization. The data should reflect a similar distribution of locations, textures, lighting, backgrounds, objects, and agents (e.g., vehicles or pedestrians) that a model would encounter in real-world situations and your test sets. Below are three things we recommend perception teams consider.
Visual discrepancies between real and synthetic data don’t always result in reduced training performance, but generally, the closer the match, the better the results. A quick visual check on the generated data could help ensure that. This can be done using various tools or through our Parallel Domain web visualizer. The synthetic dataset should also be generated to match the environmental conditions of the real dataset (e.g., weather conditions and time of day).
Before trying to align real and synthetic data labels, we recommend teams become as aware as possible of mistakes in their real data. This will make it easier to identify any post-processing required to align synthetic data annotations with the labeling format of the target real dataset. It is important to note that problems can arise when synthetic data has more detailed annotations than real data. Consider the bounding box example above, having a synthetic dataset box both the rider and the bike versus the real dataset having them in a merged box can lead to problems. One other problem we have seen is having bounding boxes that are far away, tiny, or heavily occluded. A way around this is to filter to match the human annotations in the real dataset. For task-specific label alignment recommendations, check out the documentation.
To achieve the best results and avoid overfitting, aim for a balanced representation of all relevant classes and scenarios in your synthetic data set. For example, suppose you are using synthetic data to focus on a specific edge case, such as emergency vehicles. In this scenario, it’s crucial not to have all of the synthetic scenes include this edge case. This emergency vehicle edge case can be represented in five out of ten images in the synthetic data, but it is important to ensure that the synthetic data includes some variety and non-edge case scenes (e.g., scenes without emergency vehicles).
In order to maximize your model training, it is important to properly mix and sample your real and synthetic data which you can learn about below.
The best way to mix synthetic data into your real data depends on three questions.
The table below provides a summary of the optimal training strategies that we have identified through our own research, as well as through observing the success of many of our customers.
Some of our other findings include:
Check out the training with synthetic data documentation for more specific training method recommendations.
It is common knowledge that test sets should be balanced. However, in practice, we have learned that balancing is not straightforward and can go wrong in a number of different ways. Sometimes teams just don’t have enough data. A more common occurrence is that teams are not aware of the kind of dimensions they should balance over. For example, when a team is using synthetic data to focus on specific edge cases like emergency vehicles, their original train test split may not have been split by the number of emergency vehicles. In this case, it might have made more sense to move the majority of the existing emergency vehicle samples over to the test set to ensure that improvement by synthetic data was actually measurable.
Balanced test sets should include data representing the full range of variations in the real-world scenario, including a diverse range of object appearances, lighting conditions, backgrounds, weather conditions, object interactions, and times of the day. These important considerations can help the data contain a balanced distribution of the scenarios to test the model against.
From our internal research as well as working with perception teams, we have found that data sampling can be used to ensure even distributions of different attribute classes. For tasks like optical flow, we have found that sampling based on the magnitude of flow vectors, help models a lot to generalize better. Something similar could be useful for depth estimation where one could sample based on surface normals and distance values. For classification tasks, to ensure a uniform distribution, classes shown the least to the model so far should be sampled. In our cyclist detection blog post, this was accomplished by always cropping around the rarest class in a given image. If you would like to learn more about data sampling, cropping, and training optical flow models, you check out our documentation.
Label alignment is an important approach to handling the domain gap. To maximize your model’s performance with synthetic data, you should consider several other approaches to reducing the domain gap, like the image fidelity gap. Below are some techniques and resources we have found useful in addressing this gap.
The Toyota Research Institute has published multiple papers (1,2) using our data that, while they focus on unsupervised learning when no real data is available, the general approach should be applicable to other settings as well. The common ideas in both papers are that instead of directly training a model on the task of interest, use an auxiliary training task in a self-supervised manner that can be solved with only real data. For example, use depth estimation as a step toward semantic segmentation. If you like to learn more, check out this section of our documentation on domain adaptation.
An exciting area of research focuses on matching the visual appearance of target domain images. This includes GAN-based approaches from the research area of Style Transfer but also non-model-based approaches that focus on matching color value statistics. This includes:
We have compared GAN based approaches with statistical approaches like histogram matching and found that they all yield similar results. Matching color statistics is the most stable approach and the simplest to implement. This means that even though enhancing photorealism enhancement looks impressive, it is less stable, more complicated, and only has similar results.
When training perception models with synthetic data, there are several best practices that you should consider to achieve optimal results. This includes:
Hopefully, these practices help you best utilize your data. If you want to stay updated with the latest synthetic data best practices, check out our documentation. If you’d like to try out a sample of synthetic data, download our open dataset. Lastly, if you are a perception team ready to break free from the constraints of real data – let’s have a conversation!