January 21 2021
Why I Joined Parallel Domain: Building the API for Synthetic Data
Written by Jon Wilfong, CRO
I’ve spent the last decade helping companies adopt APIs that solve real business problems. Most recently, I spent 3 years at Scale AI helping machine learning teams access high quality training data via easy-to-use APIs.
I had the pleasure of working with both an amazing team and amazing customers at Scale. So when Vinuth Rai, the Head of Product at Parallel Domain and a former customer of Scale during his time at Toyota, reached out to me, I assumed it was just another simulation company wanting to talk about partnering with a data labeling provider. After 5 minutes chatting with Vinuth, however, I realized that Parallel Domain wasn’t working on simulation. They were working on something far cooler: The API for Synthetic Data. This wasn’t a partnership call. It was a recruiting pitch.
The API for Synthetic Data
Most computer vision projects happening in the world right now rely on supervised learning techniques, which require massive amounts of labeled data. During my time at Scale, there were consistent themes that emerged during customer conversations around labeled data:
- How can I get close to perfect label quality?
- How can I get my data back faster?
- How can I get more data within my budget?
Better. Faster. Cheaper.
There are optimization limits to all three of those questions when you rely on workers or machines to label real sensor data. Quality, speed, and cost are all opposing forces when attempting to label real world data–the same physical inefficiencies and limitations that apply to the real world apply to data labeling.
This is what made Parallel Domain so exciting to me. Parallel Domain is turning this real-world data pipeline problem into a pure software solution. Imagine a data pipeline that combines perfect label accuracy with tremendous scalability, all generated on a timescale and level of consistency we associate with silicon, not humans. For the customer, that means data in minutes, not months. And not just any data – exactly the distribution of weather, bicyclists, and sensors you requested.
The more I dug into this future, the more I started to realize the questions that never even reach a data labeling partner. Questions like:
- “How can I collect the data that I need?”
- “How can I lower my data collection costs?”
- “How can I increase the data class variance I need to improve model performance?”
- “How can I test new sensor models & positions virtually vs. real-world updates to the robot?”
- “How can I get the ground truth labels that humans just can’t annotate, like dense depth or optical flow?”
Guess what? Parallel Domain is already solving these challenges for some of the world’s leading autonomous vehicle, robotics, and drone teams.
At Parallel Domain we take a holistic view of data for machine learning. Reimagining the end-to-end data pipeline means that our customers can go from “I need data of pedestrians at night” to “my model now detects pedestrians at night” without leaving their desks. That’s the promise of synthetic data.
So why doesn’t everyone just use synthetic data vs. real-world labeled data?
To date, the majority of computer vision systems have been built on real-world data pipelines. For many, collecting and labeling real-world data is the default behavior. It’s a known quantity. In some ways, it’s the path of least resistance. Model not performing? Collect and label more.
There is proven value in real-world data, but it’s also clear that this approach exhibits diminishing results over time. It suffers from bottlenecks that slow the pace of iteration and innovation. Ask most machine learning researchers and they will openly admit that current data pipelines do not scale to the degree that they need in the long term.
Parallel Domain’s platform addresses these bottlenecks in development. It gives ML engineers the data they need, from start to finish, whether quickly training initial models or generating the diversity needed to squeeze out the last 3% of model performance. Our customers utilize these complementary strengths, finding that the best performing and most cost effective computer vision models are now built on a combination of real-world + Parallel Domain synthetic data.
Why I joined Parallel Domain
I didn’t join Parallel Domain to replace data labeling pipelines. I joined Parallel Domain to boost data pipelines and, ideally, help teams optimize their entire collection, curation, and training strategy. I joined to give ML researchers access to the API-based knobs & dials needed to generate datasets that defy the constraints of reality.
The team at Parallel Domain already has the unique set of talents required to execute on this better than anyone else in the business. Collective decades of experience in the autonomous vehicle and computer graphics space have allowed them to create something truly game-changing. Most importantly, they have a founder with a vision. I’m here to help build a technical sales and engagement team that accelerates the adoption of this technology; a group of exceptional people that customers see as an extension of their own team.
Someone once told me that sales is a transfer of enthusiasm…
- Working with autonomous vehicle ML teams ✓
- Procedurally generated state of the art graphics ✓
- Machine Learning and APIs! ✓
I’ve never been so enthused.