The Parallel Domain Team
Pedestrians are among the most vulnerable users of our road systems. Despite years of progress in automotive safety, pedestrian fatalities around the world remain alarmingly high. The number of pedestrian fatalities in the United States has increased since 2009, leading to a 5-year average of ~6,500 pedestrian deaths during the 5 most recent years captured by the National Center for Statistics and Analysis (NCSA), an office of the National Highway Traffic Safety Administration (NHTSA) (2017 to 2021). During this period, pedestrians accounted for ~17% of total traffic fatalities (NCSA). These numbers are not just statistics; they’re a call to action for developers of Advanced Driver Assistance Systems (ADAS) and autonomous vehicles.
Recognizing the urgency, the National Highway Traffic Safety Administration (NHTSA) finalized a rule in 2024 mandating that all new passenger cars and light trucks be equipped with automatic emergency braking (AEB) systems, including pedestrian AEB, by September 2029. The message is clear: pedestrian protection is no longer optional. It’s becoming a regulatory requirement, and perception systems must be tested rigorously to ensure they can perform reliably in the real world.
But how can teams meet this demand at scale? Real-world testing is expensive, slow, and often fails to capture rare but critical edge cases. Simulation provides a powerful alternative, but only if it can be trusted. In this blog post, we put PD Replica Sim to the test by benchmarking it against real-world data from the Waymo Open Dataset. We focused on pedestrian detection using multiple model architectures and industry-standard metrics to answer a fundamental question: How closely does performance in Parallel Domain Replica Sim align with the real world?
Our results show that pedestrian detection models evaluated using PD Replica Sim performed within a 10% margin of their performance on real-world data. This 10% sim-to-real gap is the maximum delta between perception performance on real-data and PD Replica Sim data amongst three perception models we tested against (Dino r50, GLIP, and Yolo v8). A lower gap is better as it shows a smaller difference between a perception model running on real-world data and that from PD Replica Sim.
This finding reinforces the value of simulation for validating safety-critical perception systems and supporting the rollout of next-generation ADAS technologies. This is especially powerful because simulations can be run in parallel, allowing developers to test hundreds or thousands of times faster and at a larger scale than real-world testing.
We define this sim-to-real gap margin as the average of the absolute deltas of the F1-scores, PR-AUC, and mAP scores, across multiple off-the-shelf perception models. This extends our prior research which relied upon mAP scores for reporting class performance. Below are more details related to the datasets, performance metrics, and results.
We extended our research methodology both in terms of the number of metrics we evaluate and the number of models we test against. We evaluated models using three key metrics:
Then using those metrics, with equal weight, we averaged the absolute deltas together to calculate an overall sim-to-real gap as shown in the equation below.
Let:
mAPReal = mAP on Real-world data
mAPPD = mAP on PD Replica
AUCReal = PR-AUC on Real-world data for confidence intervals between 0.3 and 1.0
AUCPD = PR-AUC on PD Replica for confidence intervals between 0.3 and 1.0
F1Real = F1-score on Real-world data
F1PD = F1-score on PD Replica
Then:
Overall Sim-to-Real Gap = ( |mAPReal – mAPPD| + |AUCReal – AUCPD| + |F1Real – F1PD| ) / 3
We curated 26 residential scenes from the Waymo Open Dataset validation set, each containing 2 to 20 pedestrians. Scenes were selected to align closely with the visual and contextual complexity of PD Replica environments. Only front-camera views were used, and we sampled 8 to 10 frames per scene, resulting in 245 frames total.
Using PD Replica, we generated 30 non-sequential frames from 8 different Replica locations. In each frame:
To increase the robustness of our assessment we evaluated a broad range of model architectures to encompass different types of neural network architectures. All models used off-the-shelf COCO-pretrained weights from OpenMMLab’s mmDetection library and were not fine-tuned for the autonomous driving domain. We do not fine-tune the models and only consider the object class ‘person’. Below are the models we utilized:
Model | Dino r50 | GLIP | Yolo v8 |
Model type | Single-stage Vision Transformer | Single-stage Vision-language Transformer | Single-stage CNN |
Backbone | Swin / ViT | Swin / CLIP ViT | Custom backbone from Ultralytics |
Pre-trained dataset | Coco | O365, GoldG, CC3M, SBU | Coco |
Baseline mean Average Precicion (mAP) (on Coco dataset) | 50.1 | 55.2 | 54.0 |
This single-stage, vision transformer had the largest overall sim-to-real gap of 0.1 or 10%. This further breaks down into a mAP delta of -0.098, a PR-AUC delta of 0.121, and an average F1-score delta of 0.081. You can see from the below of the precision and recall curve that divergence occurs at lower confidence interval levels, but generally follow the same curve.
This single-stage, vision-language transformer which had the best baseline mAP score on Coco dataset, had an overall sim-to-real gap of 0.07 or 7%. This further breaks down into a mAP delta of -0.014, a PR-AUC delta of 0.121, and an average F1-score delta of 0.072. The below of the precision and recall curve follows the same general shape with PD Replica slightly out performing real-world data for detection, especially at lower confidence intervals as seen in the Dino r50 model.
This widely used single-stage, CNN has the lowest overall sim-to-real gap of 0.03 or 3%. This further breaks down into a mAP delta of 0.029, a PR-AUC delta of 0.001, and an average F1-score delta of 0.034. The below of the precision and recall curve follows the same general shape, with the model performing quite high on precision at confidence intervals above 0.2.
Across the three architectures, including recent models like YOLO v8 and GLIP, PD Replica Sim held remarkably close to real-world data performance. Specific breakdowns for mAP, PR-AUC, and F1-Score differences are contained within the appendix.
The smallest deltas appeared at higher confidence thresholds, indicating that high-confidence predictions made in Replica Sim are highly trustworthy. Additionally, PD Replica tended to yield higher recall, likely due to minor visual artifacts leading to more bounding box proposals, while maintaining comparable precision.
These findings are critical for any team looking to scale validation of pedestrian detection systems. The 10% performance delta confirms that PD Replica Sim data can serve as a reliable proxy for real-world testing.
Additionally, we find that a 2D perception model is very likely to behave similarly for object prediction at high confidence thresholds. Lastly, the precision and recall curves are a useful tool to understand your model performance with respect to PD Replica data. Developers can generate a small PD Replica test set to compare to your real data and establish up to which confidence threshold you can trust your performance to hold true.
With upcoming regulatory pressure for mandatory pedestrian AEB, teams need tools that allow them to test quickly, safely, and thoroughly. PD Replica delivers that capability. Whether you’re training a new detection model or validating an existing one, simulation is no longer just a preliminary step, it’s a trustable environment for real-world results.
Want to explore how Replica Sim can accelerate your perception development?
Some of the discrepancy observed in the model performance is likely due to the difference in complexity of our two datasets. We do not have an exact match between real Waymo dataset and our simulated PD Replica data, thus, while close enough, there will always be some divergence resulting from it.
We are currently investigating the performance of our PD Replica in an pairwise comparison fashion, looking to mimic time of day, pedestrian density, and locations. Stay tuned as we continue our research efforts to better evaluate and quantify the sim-to-real gap.
Confidence threshold – percentage representing how much the model is confident the bounding box predicted and it label are accurate
Precision – The fraction of correct detections among all detections
Recall – The fraction of all actual objects in an image that the model successfully detected with a correct bounding box.
mAP – mean average precision: average over the sum of Precision scores weighted by their Recall increase for each confidence threshold. Blend in all confidence thresholds in one metric.
F1-score: (Precision * Recall) / (Precision + Recall) computed at specific confidence thresholds individually