March 13 2023

Synthetic Data Best Practices for Perception Applications

Michael Galarnyk, Nate Cibik, Omar Maher, Phillip Thomas

There are many ways synthetic data can improve perception applications. The problem is that there aren’t a lot of widely known resources out there for learning how to get the best possible results when using synthetic data. In this post, we’ll share synthetic data best practices that we learned from working with many perception teams and our internal research, which includes: 

  • Generating Good Synthetic Data
  • Training with Synthetic Data
  • Handling the Domain Gap

Generating Good Synthetic Data

In most cases, good synthetic data should visually resemble real-world sensor data and labels. This is important to ensure generalization. The data should reflect a similar distribution of locations, textures, lighting, backgrounds, objects, and agents (e.g., vehicles or pedestrians) that a model would encounter in real-world situations and your test sets. Below are three things we recommend perception teams consider.

Visual Discrepancies

Visual discrepancies between real and synthetic data don’t always result in reduced training performance, but generally, the closer the match, the better the results. A quick visual check on the generated data could help ensure that. This can be done using various tools or through our Parallel Domain web visualizer. The synthetic dataset should also be generated to match the environmental conditions of the real dataset (e.g., weather conditions and time of day). 

Label Alignment

A synthetic dataset boxing both the rider and the bike versus having the real dataset them in a merged box can lead to problems.

Before trying to align real and synthetic data labels, we recommend teams become as aware as possible of mistakes in their real data. This will make it easier to identify any post-processing required to align synthetic data annotations with the labeling format of the target real dataset. It is important to note that problems can arise when synthetic data has more detailed annotations than real data. Consider the bounding box example above, having a synthetic dataset box both the rider and the bike versus the real dataset having them in a merged box can lead to problems. One other problem we have seen is having bounding boxes that are far away, tiny, or heavily occluded. A way around this is to filter to match the human annotations in the real dataset.  For task-specific label alignment recommendations, check out the documentation.

Ensure your Synthetic Data is Balanced 

To achieve the best results and avoid overfitting, aim for a balanced representation of all relevant classes and scenarios in your synthetic data set. For example, suppose you are using synthetic data to focus on a specific edge case, such as emergency vehicles. In this scenario, it’s crucial not to have all of the synthetic scenes include this edge case. This emergency vehicle edge case can be represented in five out of ten images in the synthetic data, but it is important to ensure that the synthetic data includes some variety and non-edge case scenes (e.g., scenes without emergency vehicles).

Training with Synthetic Data 

In order to maximize your model training, it is important to properly mix and sample your real and synthetic data which you can learn about below. 

How to Mix Datasets

The best way to mix synthetic data into your real data depends on three questions. 

  • Can you align the labels of your synthetic and real-world data? 
  • Do you have a larger synthetic dataset, a larger real dataset, or are they equally sized (roughly the same order of magnitude)? 
  • Is the synthetic dataset targeted at fixing performance on an edge case that is missing in your existing training set (Rare Case Dataset) or is your goal to get a little bit of everything (General Purpose Dataset)?

The table below provides a summary of the optimal training strategies that we have identified through our own research, as well as through observing the success of many of our customers.

Table of recommendations based on dataset size and type.

Some of our other findings include:

  • If your real training dataset is sufficiently large, the biggest improvement potential with synthetic data is on classes, actions, and scenes that are underrepresented in the real data. 
  • Unless your real dataset is small, we generally see more performance improvement when training with targeted rare-case synthetic datasets

Check out the training with synthetic data documentation for more specific training method recommendations. 

Ensure your Test Set is Balanced 

It is common knowledge that test sets should be balanced. However, in practice, we have learned that balancing is not straightforward and can go wrong in a number of different ways. Sometimes teams just don’t have enough data. A more common occurrence is that teams are not aware of the kind of dimensions they should balance over. For example, when a team is using synthetic data to focus on specific edge cases like emergency vehicles, their original train test split may not have been split by the number of emergency vehicles. In this case, it might have made more sense to move the majority of the existing emergency vehicle samples over to the test set to ensure that improvement by synthetic data was actually measurable.

Balanced test sets should include data representing the full range of variations in the real-world scenario, including a diverse range of object appearances, lighting conditions, backgrounds, weather conditions, object interactions, and times of the day. These important considerations can help the data contain a balanced distribution of the scenarios to test the model against.

Data Sampling

Our synthetic data significantly improves performance on rare classes such as bicycles with no changes to model architecture (image source).

From our internal research as well as working with perception teams, we have found that data sampling can be used to ensure even distributions of different attribute classes. For tasks like optical flow, we have found that sampling based on the magnitude of flow vectors, help models a lot to generalize better. Something similar could be useful for depth estimation where one could sample based on surface normals and distance values. For classification tasks, to ensure a uniform distribution, classes shown the least to the model so far should be sampled. In our cyclist detection blog post, this was accomplished by always cropping around the rarest class in a given image. If you would like to learn more about data sampling, cropping, and training optical flow models, you check out our documentation.

Handling the Domain Gap

Label alignment is an important approach to handling the domain gap. To maximize your model’s performance with synthetic data, you should consider several other approaches to reducing the domain gap, like the image fidelity gap. Below are some techniques and resources we have found useful in addressing this gap.

Domain Transfer via Multi-task Learning and Geometric Priors

The Toyota Research Institute has published multiple papers (1,2) using our data that, while they focus on unsupervised learning when no real data is available, the general approach should be applicable to other settings as well. The common ideas in both papers are that instead of directly training a model on the task of interest, use an auxiliary training task in a self-supervised manner that can be solved with only real data. For example, use depth estimation as a step toward semantic segmentation. If you like to learn more, check out this section of our documentation on domain adaptation

Style Transfer

Style transfer using Fourier transform (image source).

An exciting area of research focuses on matching the visual appearance of target domain images. This includes GAN-based approaches from the research area of Style Transfer but also non-model-based approaches that focus on matching color value statistics. This includes:  

We have compared GAN based approaches with statistical approaches like histogram matching  and found that they all yield similar results. Matching color statistics is the most stable approach and the simplest to implement. This means that even though enhancing photorealism enhancement looks impressive, it is less stable, more complicated, and only has similar results. 

Conclusion

When training perception models with synthetic data, there are several best practices that you should consider to achieve optimal results. This includes:

  • Properly aligning labels which if you get right, you will generally have the basis to get good results. 
  • Generate balanced synthetic data with minimal visual discrepancies.
  • Identify the right training strategy based on label alignment feasibility, the sizes and types of datasets, and the specific use case of interest.
  • Consider reducing the domain gap using style transfer. 

Hopefully, these practices help you best utilize your data. If you want to stay updated with the latest synthetic data best practices, check out our documentation. If you’d like to try out a sample of synthetic data, download our open dataset. Lastly, if you are a perception team ready to break free from the constraints of real data – let’s have a conversation!

Other Articles

Sign up for our newsletter