How to best train with synthetic data

The best plan for training with synthetic data can be decided by answering three questions:

  • Is it possible to align the synthetic and real data labels?
  • How do the real and synthetic datasets compare in size? Do you have a larger synthetic dataset, a larger real dataset, or are they equally sized (roughly the same order of magnitude)?
  • Is the PD dataset targeted at fixing performance on an edge case that is missing in your existing training set (rare case dataset), or is the content similar to your existing dataset (general purpose dataset)?

Unless your real dataset is small, we generally see more performance improvement when training with targeted rare-case synthetic datasets. If your existing training dataset is sufficiently large, the clearest benefit of synthetic data is usually improved performance on classes, actions, and scenes that are underrepresented in the real data.

How to mix datasets

Pre-training on synthetic data and then fine-tuning on real samples leads to improved performance in most use cases. This is a good way to set up a baseline before running additional experiments. These preliminary pre-training + finetuning experiments can be carried out without aligning labels or image spaces, but if the number of classes in the source and target dataset is different, it is necessary to reset/reshape the final layers of your model before fine-tuning.

Depending on your answers to the three questions in the previous section, we recommend the following:

Label AlignmentDataset SizesDataset TypeJoint Train or Pretrain + Finetune?Data Sampling?
Aligned Label SpaceLarger syntheticGeneralPretrain + FinetuneYes
RarePretrain + FinetuneYes
Equal sizeGeneralJoint TrainingOptional
RareJoint TrainingYes
Larger realGeneralNot recommendedNot recommended
RareJoint TrainingYes
Unaligned Label SpaceDoesn't matterDoesn't matterPretrain + FinetuneYes

Summary Table (details below)

General Recommendations

  • If the model architecture allows it, train on image crops instead of full frames. This helps the model generalize better and helps to facilitate domain transfer by reducing divergence in background classes. While the features of interest, e.g. emergency vehicles, are well represented in PD data, there is often a larger domain gap in the environments they are placed in.

  • In joint training with equally sized datasets, it is often better to oversample the real data so that the model sees real data more often. We have found the best real/synthetic ratios to be between 75%/25% and 50%/50%.

  • Changes that improve performance on synthetic-only models generally translate to jointly trained models as well.

  • Use Data Sampling to:

    • tackle class imbalance in classification tasks. Using cropping allows for better class balancing as it can be done on a per-instance basis instead of a per-frame basis, and crops can be targeted to support an even distribution.

    • ensure content diversity during training. This can be done by either using existing Labels (like bounding boxes or semantic segmentation), or just by using knowledge about the content of certain scenes in your dataset.

  • We have found that the harder/denser a task is, the more additional actions on domain adaptation are required. For example, a task like 2D Bounding Box Detection works by just mixing datasets (see our blog post on cyclist detection) in our experience. Harder tasks like semantic segmentation require more domain adaptation with approaches we list on the next page under Tackling the Render Gap.

Specific Recommendations

  1. Larger Synthetic + General Purpose Dataset
    Pre-train with synthetic dataset and fine-tune on real. In this case, your model can learn most from an evenly balanced PD dataset and then use a real dataset to fine-tune to the target domain. If your existing dataset is very unbalanced, we recommend doing data sampling during fine-tuning to make sure rare classes are not unlearned.

  2. Larger Synthetic + Rare Case Dataset
    In the case of a small existing real dataset, we generally recommend a well-balanced general purpose dataset for training. However, if you only have a small amount of real data and a large amount of synthetic data, pre-training on synthetic and fine-tuning on real will give you the best results. Data sampling will help avoid over-fitting to the rare classes.

  3. Equally Sized + General Purpose Dataset
    Jointly train on both datasets with equal batch splits.
    As mentioned before, you are more likely to get good results by adding a set of synthetic datasets, targeted at specific edge cases, to your training.

  4. Equally Sized + Rare Case Dataset
    Jointly train on both datasets. To ensure you don't bias your model too much toward the over-sampled rare class, we recommend applying data sampling in this case. Aiming for an even distribution among classes should help the most.

  5. Larger Real + General Purpose Dataset
    We don't recommend this type of mix, since it's likely that the resulting joint dataset would only gain information that was already present, and hence models won't have new complementary information to learn from.

  6. Larger Real + Rare Case Dataset
    Jointly train with data sampling. Make sure the synthetic data does not vanish compared to your dataset. Same as for the equally sized case, we recommend doing data sampling here to make sure the predominant classes in the real dataset don’t outweigh the rare synthetic class.

Data Sampling

Data sampling is a family of techniques that aim to ensure even distributions of different attribute classes during training to avoid biasing the model to the existing distribution in your dataset. For classification tasks, this is usually done by sampling rare class examples. In our work on cyclist detection, this was done by always cropping around the rarest class in a given image.

For other tasks like depth estimation or optical flow, data sampling can be done by ensuring even distribution between day/night scenes or urban/suburban and highway scenes, or any other axis of diversity in the data.

Alternatively, if you have a PD dataset that contains a very rare class, you can make sure that those makeup x% of your batches used for training (x depends on your dataset sizes and the importance of this rare class to your task).

Ensure your test sets is balanced

By evaluating the model on a balanced test set, the model's performance can be measured in a comprehensive way, and any biases or weaknesses in the model can be identified.

A balanced test set should include data that represents the full range of variations in the real-world scenario. This includes a diverse range of object appearances, lighting conditions, backgrounds, weather conditions, object interactions, and times of the day. This is important so the data would include a balanced distribution of the scenarios the developers are interested to test the model against.

Ensure your synthetic data is balanced

The same thing applies to your synthetic data too: when using synthetic data to train perception models, it's important to balance the synthetic data to avoid overfitting. If you are using synthetic data to focus on a specific edge case, such as emergency vehicles, it's crucial not to have all of the synthetic scenes include this edge case.

Having a disproportionate amount of data for one particular class or scenario can cause the model to overfit on that case and decrease accuracy in detecting other classes or scenarios. To achieve the best results, aim for a balanced representation of all relevant classes and scenarios in your synthetic data set. You can have the desired edge case represented by 5X or more in synthetic vs. real data, but you want to ensure that the former includes some variety and non-edge case scenes as well.