Industries

Company

Product

Insights

Industries

Automotive Drone

Company

Overview Careers Join Us

July 01 2025

Protecting our Most Vulnerable: Can Simulation Reliably Test Pedestrian Detection Models?

The Parallel Domain Team

Pedestrians are among the most vulnerable users of our road systems. Despite years of progress in automotive safety, pedestrian fatalities around the world remain alarmingly high. The number of pedestrian fatalities in the United States has increased since 2009, leading to a 5-year average of ~6,500 pedestrian deaths during the 5 most recent years captured by the National Center for Statistics and Analysis (NCSA), an office of the National Highway Traffic Safety Administration (NHTSA) (2017 to 2021). During this period, pedestrians accounted for ~17% of total traffic fatalities (NCSA). These numbers are not just statistics; they’re a call to action for developers of Advanced Driver Assistance Systems (ADAS) and autonomous vehicles.

Source: NCSA (2022)

Recognizing the urgency, the National Highway Traffic Safety Administration (NHTSA) finalized a rule in 2024 mandating that all new passenger cars and light trucks be equipped with automatic emergency braking (AEB) systems, including pedestrian AEB, by September 2029. The message is clear: pedestrian protection is no longer optional. It’s becoming a regulatory requirement, and perception systems must be tested rigorously to ensure they can perform reliably in the real world.

But how can teams meet this demand at scale? Real-world testing is expensive, slow, and often fails to capture rare but critical edge cases. Simulation provides a powerful alternative, but only if it can be trusted. In this blog post, we put PD Replica Sim to the test by benchmarking it against real-world data from the Waymo Open Dataset. We focused on pedestrian detection using multiple model architectures and industry-standard metrics to answer a fundamental question: How closely does performance in Parallel Domain Replica Sim align with the real world?

PD Replica Sim: Results

Our results show that pedestrian detection models evaluated using PD Replica Sim performed within a 10% margin of their performance on real-world data. This 10% sim-to-real gap is the maximum delta between perception performance on real-data and PD Replica Sim data amongst three perception models we tested against (Dino r50, GLIP, and Yolo v8). A lower gap is better as it shows a smaller difference between a perception model running on real-world data and that from PD Replica Sim.

This finding reinforces the value of simulation for validating safety-critical perception systems and supporting the rollout of next-generation ADAS technologies. This is especially powerful because simulations can be run in parallel, allowing developers to test hundreds or thousands of times faster and at a larger scale than real-world testing.
We define this sim-to-real gap margin as the average of the absolute deltas of the F1-scores, PR-AUC, and mAP scores, across multiple off-the-shelf perception models. This extends our prior research which relied upon mAP scores for reporting class performance. Below are more details related to the datasets, performance metrics, and results.

PD Replica Sim Methodology and Metrics

We extended our research methodology both in terms of the number of metrics we evaluate and the number of models we test against. We evaluated models using three key metrics:

Mean Average Precision (mAP) – detection accuracy across all confidence thresholds.
Precision-Recall AUC (PR-AUC) – trade-off between false positives and false negatives for the confidence range between 1.0 and 0.3.
F1-score – balance of precision and recall at individual confidence levels.

Then using those metrics, with equal weight, we averaged the absolute deltas together to calculate an overall sim-to-real gap as shown in the equation below.

Let:

mAP_Real= mAP on Real-world data

mAP_PD = mAP on PD Replica

AUC_Rea_l = PR-AUC on Real-world  data for confidence intervals between 0.3 and 1.0

AUC_PD = PR-AUC on PD Replica for confidence intervals between 0.3 and 1.0

F1_Real = F1-score on Real-world data

F1_PD = F1-score on PD Replica

Then:

Overall Sim-to-Real Gap = ( |mAP_Real – mAP_PD| + |AUC_Real – AUC_PD| + |F1_Real – F1_PD| ) / 3

Waymo Dataset: The Real-World Baseline

We curated 26 residential scenes from the Waymo Open Dataset validation set, each containing 2 to 20 pedestrians. Scenes were selected to align closely with the visual and contextual complexity of PD Replica environments. Only front-camera views were used, and we sampled 8 to 10 frames per scene, resulting in 245 frames total.

PD Replica Dataset: Controlled Simulation within Digital Twins

Using PD Replica, we generated 30 non-sequential frames from 8 different Replica locations. In each frame:

Pedestrian density was randomly set between 20% and 70%.
The ego vehicle was placed 15–30 meters from the first pedestrian.
No additional traffic was added, isolating the pedestrian detection task.

Left - Waymo real-world dataset, Right - PD Replica Sim dataset

Model Zoo Approach

To increase the robustness of our assessment we evaluated a broad range of model architectures to encompass different types of neural network architectures. All models used off-the-shelf COCO-pretrained weights from OpenMMLab’s mmDetection library and were not fine-tuned for the autonomous driving domain. We do not fine-tune the models and only consider the object class ‘person’. Below are the models we utilized:

Model	Dino r50	GLIP	Yolo v8
Model type	Single-stage Vision Transformer	Single-stage Vision-language Transformer	Single-stage CNN
Backbone	Swin / ViT	Swin / CLIP ViT	Custom backbone from Ultralytics
Pre-trained dataset	Coco	O365, GoldG, CC3M, SBU	Coco
Baseline mean Average Precicion (mAP) (on Coco dataset)	50.1	55.2	54.0

Results per model

Dino r50

This single-stage, vision transformer had the largest overall sim-to-real gap of 0.1 or 10%. This further breaks down into a mAP delta of -0.098, a PR-AUC delta of 0.121, and an average F1-score delta of 0.081. You can see from the below of the precision and recall curve that divergence occurs at lower confidence interval levels, but generally follow the same curve.

Dino r50 precision and recall curve at different confidence intervals

GLIP

This single-stage, vision-language transformer which had the best baseline mAP score on Coco dataset, had an overall sim-to-real gap of 0.07 or 7%. This further breaks down into a mAP delta of -0.014, a PR-AUC delta of 0.121, and an average F1-score delta of 0.072. The below of the precision and recall curve follows the same general shape with PD Replica slightly out performing real-world data for detection, especially at lower confidence intervals as seen in the Dino r50 model.

GLIP precision and recall curve at different confidence intervals

Yolo v8

This widely used single-stage, CNN has the lowest overall sim-to-real gap of 0.03 or 3%. This further breaks down into a mAP delta of 0.029, a PR-AUC delta of 0.001, and an average F1-score delta of 0.034. The below of the precision and recall curve follows the same general shape, with the model performing quite high on precision at confidence intervals above 0.2.

Yolo v8 precision and recall curve at different confidence intervals

Results: Simulation Holds Its Own

Across the three architectures, including recent models like YOLO v8 and GLIP, PD Replica Sim held remarkably close to real-world data performance. Specific breakdowns for mAP, PR-AUC, and F1-Score differences are contained within the appendix.

Overall Sim-to-Real Gap: 2% to 10%
mAP Difference: 1% to 3%
PR-AUC Difference: 0% to 12%
F1-Score Difference: 3% to 8%

The smallest deltas appeared at higher confidence thresholds, indicating that high-confidence predictions made in Replica Sim are highly trustworthy. Additionally, PD Replica tended to yield higher recall, likely due to minor visual artifacts leading to more bounding box proposals, while maintaining comparable precision.

Performance across all metrics, models, and datasets.

What This Means for ADAS & AV Development

These findings are critical for any team looking to scale validation of pedestrian detection systems. The 10% performance delta confirms that PD Replica Sim data can serve as a reliable proxy for real-world testing.

Additionally, we find that a 2D perception model is very likely to behave similarly for object prediction at high confidence thresholds. Lastly, the precision and recall curves are a useful tool to understand your model performance with respect to PD Replica data. Developers can generate a small PD Replica test set to compare to your real data and establish up to which confidence threshold you can trust your performance to hold true.

With upcoming regulatory pressure for mandatory pedestrian AEB, teams need tools that allow them to test quickly, safely, and thoroughly. PD Replica delivers that capability. Whether you’re training a new detection model or validating an existing one, simulation is no longer just a preliminary step, it’s a trustable environment for real-world results.

Want to explore how Replica Sim can accelerate your perception development?

Schedule a Demo

Limitations

Some of the discrepancy observed in the model performance is likely due to the difference in complexity of our two datasets. We do not have an exact match between real Waymo dataset and our simulated PD Replica data, thus, while close enough, there will always be some divergence resulting from it.

We are currently investigating the performance of our PD Replica in an pairwise comparison fashion, looking to mimic time of day, pedestrian density, and locations. Stay tuned as we continue our research efforts to better evaluate and quantify the sim-to-real gap.

Glossary

Confidence threshold – percentage representing how much the model is confident the bounding box predicted and it label are accurate
Precision – The fraction of correct detections among all detections
Recall – The fraction of all actual objects in an image that the model successfully detected with a correct bounding box.
mAP – mean average precision: average over the sum of Precision scores weighted by their Recall increase for each confidence threshold. Blend in all confidence thresholds in one metric.
F1-score: (Precision * Recall) / (Precision + Recall) computed at specific confidence thresholds individually

Industries

Company

Protecting our Most Vulnerable: Can Simulation Reliably Test Pedestrian Detection Models?

Source: NCSA (2022)

PD Replica Sim: Results

PD Replica Sim Methodology and Metrics

Waymo Dataset: The Real-World Baseline

PD Replica Dataset: Controlled Simulation within Digital Twins

Left - Waymo real-world dataset, Right - PD Replica Sim dataset

Model Zoo Approach

Results per model

Dino r50

Dino r50 precision and recall curve at different confidence intervals

GLIP

GLIP precision and recall curve at different confidence intervals

Yolo v8

Yolo v8 precision and recall curve at different confidence intervals

Results: Simulation Holds Its Own

Performance across all metrics, models, and datasets.

What This Means for ADAS & AV Development

Schedule a Demo

Limitations

Glossary

Other Articles

Foretellix and Parallel Domain Partner to Bring Hyper Realistic Digital Twins to AV Simulation

From Roads to Replicas: A Conversation with Zenseact

The Power of “Shift Left”

Sign up for our newsletter