Is fake data the real deal when training algorithms? – The Guardian
Youre at the wheel of your car but youre exhausted. Your shoulders start to sag, your neck begins to droop, your eyelids slide down. As your head pitches forward, you swerve off the road and speed through a field, crashing into a tree.
But what if your cars monitoring system recognised the tell-tale signs of drowsiness and prompted you to pull off the road and park instead? The European Commission has legislated that from this year, new vehicles be fitted with systems to catch distracted and sleepy drivers to help avert accidents. Now a number of startups are training artificial intelligence systems to recognise the giveaways in our facial expressions and body language.
These companies are taking a novel approach for the field of AI. Instead of filming thousands of real-life drivers falling asleep and feeding that information into a deep-learning model to learn the signs of drowsiness, theyre creating millions of fake human avatars to re-enact the sleepy signals.
Big data defines the field of AI for a reason. To train deep learning algorithms accurately, the models need to have a multitude of data points. That creates problems for a task such as recognising a person falling asleep at the wheel, which would be difficult and time-consuming to film happening in thousands of cars. Instead, companies have begun building virtual datasets.
Synthesis AI and Datagen are two companies using full-body 3D scans, including detailed face scans, and motion data captured by sensors placed all over the body, to gather raw data from real people. This data is fed through algorithms that tweak various dimensions many times over to create millions of 3D representations of humans, resembling characters in a video game, engaging in different behaviours across a variety of simulations.
In the case of someone falling asleep at the wheel, they might film a human performer falling asleep and combine it with motion capture, 3D animations and other techniques used to create video games and animated movies, to build the desired simulation. You can map [the target behaviour] across thousands of different body types, different angles, different lighting, and add variability into the movement as well, says Yashar Behzadi, CEO of Synthesis AI.
Using synthetic data cuts out a lot of the messiness of the more traditional way to train deep learning algorithms. Typically, companies would have to amass a vast collection of real-life footage and low-paid workers would painstakingly label each of the clips. These would be fed into the model, which would learn how to recognise the behaviours.
The big sell for the synthetic data approach is that its quicker and cheaper by a wide margin. But these companies also claim it can help tackle the bias that creates a huge headache for AI developers. Its well documented that some AI facial recognition software is poor at recognising and correctly identifying particular demographic groups. This tends to be because these groups are underrepresented in the training data, meaning the software is more likely to misidentify these people.
Niharika Jain, a software engineer and expert in gender and racial bias in generative machine learning, highlights the notorious example of Nikon Coolpixs blink detection feature, which, because the training data included a majority of white faces, disproportionately judged Asian faces to be blinking. A good driver-monitoring system must avoid misidentifying members of a certain demographic as asleep more often than others, she says.
The typical response to this problem is to gather more data from the underrepresented groups in real-life settings. But companies such as Datagen say this is no longer necessary. The company can simply create more faces from the underrepresented groups, meaning theyll make up a bigger proportion of the final dataset. Real 3D face scan data from thousands of people is whipped up into millions of AI composites. Theres no bias baked into the data; you have full control of the age, gender and ethnicity of the people that youre generating, says Gil Elbaz, co-founder of Datagen. The creepy faces that emerge dont look like real people, but the company claims that theyre similar enough to teach AI systems how to respond to real people in similar scenarios.
There is, however, some debate over whether synthetic data can really eliminate bias. Bernease Herman, a data scientist at the University of Washington eScience Institute, says that although synthetic data can improve the robustness of facial recognition models on underrepresented groups, she does not believe that synthetic data alone can close the gap between the performance on those groups and others. Although the companies sometimes publish academic papers showcasing how their algorithms work, the algorithms themselves are proprietary, so researchers cannot independently evaluate them.
In areas such as virtual reality, as well as robotics, where 3D mapping is important, synthetic data companies argue it could actually be preferable to train AI on simulations, especially as 3D modelling, visual effects and gaming technologies improve. Its only a matter of time until you can create these virtual worlds and train your systems completely in a simulation, says Behzadi.
This kind of thinking is gaining ground in the autonomous vehicle industry, where synthetic data is becoming instrumental in teaching self-driving vehicles AI how to navigate the road. The traditional approach filming hours of driving footage and feeding this into a deep learning model was enough to get cars relatively good at navigating roads. But the issue vexing the industry is how to get cars to reliably handle what are known as edge cases events that are rare enough that they dont appear much in millions of hours of training data. For example, a child or dog running into the road, complicated roadworks or even some traffic cones placed in an unexpected position, which was enough to stump a driverless Waymo vehicle in Arizona in 2021.
With synthetic data, companies can create endless variations of scenarios in virtual worlds that rarely happen in the real world. Instead of waiting millions more miles to accumulate more examples, they can artificially generate as many examples as they need of the edge case for training and testing, says Phil Koopman, associate professor in electrical and computer engineering at Carnegie Mellon University.
AV companies such as Waymo, Cruise and Wayve are increasingly relying on real-life data combined with simulated driving in virtual worlds. Waymo has created a simulated world using AI and sensor data collected from its self-driving vehicles, complete with artificial raindrops and solar glare. It uses this to train vehicles on normal driving situations, as well as the trickier edge cases. In 2021, Waymo told the Verge that it had simulated 15bn miles of driving, versus a mere 20m miles of real driving.
An added benefit to testing autonomous vehicles out in virtual worlds first is minimising the chance of very real accidents. A large reason self-driving is at the forefront of a lot of the synthetic data stuff is fault tolerance, says Herman. A self-driving car making a mistake 1% of the time, or even 0.01% of the time, is probably too much.
In 2017, Volvos self-driving technology, which had been taught how to respond to large North American animals such as deer, was baffled when encountering kangaroos for the first time in Australia. If a simulator doesnt know about kangaroos, no amount of simulation will create one until it is seen in testing and designers figure out how to add it, says Koopman. For Aaron Roth, professor of computer and cognitive science at the University of Pennsylvania, the challenge will be to create synthetic data that is indistinguishable from real data. He thinks it is plausible that were at that point for face data, as computers can now generate photorealistic images of faces. But for a lot of other things, which may or may not include kangaroos I dont think that were there yet.
Excerpt from:
Is fake data the real deal when training algorithms? - The Guardian
- A.I. VS HUMAN ROAST BATTLE to Pit Machine Learning Against Live Rapper in SF - BroadwayWorld - June 16th, 2026 [June 16th, 2026]
- Machine learning gives the U.S. a 1% chance of winning the World Cup final in its own backyard - Fortune - June 16th, 2026 [June 16th, 2026]
- Machine Learning Reveals Genes That Help Yeasts Resist Stress - Department of Energy (.gov) - June 16th, 2026 [June 16th, 2026]
- Machine Learning Reveals AED Impact on LGG Prognosis - Bioengineer.org - June 16th, 2026 [June 16th, 2026]
- Introducing the Third Generation of Apples Foundation Models - Apple Machine Learning Research - June 12th, 2026 [June 12th, 2026]
- Machine learning model predicts T2D risk up to 10 years before onset - Managed Healthcare Executive - June 12th, 2026 [June 12th, 2026]
- GPU as a Service Market to Reach USD 14.4 Billion by 2033 at 16.0% CAGR, Fueled by Generative AI, Machine Learning, and Cloud Infrastructure Expansion... - June 12th, 2026 [June 12th, 2026]
- Machine learning-guided design of mechanoadaptive bioglues for multitissue trauma and first-aid applications - Nature - June 12th, 2026 [June 12th, 2026]
- OUCRU scientists are using machine learning to forecast the next dengue outbreak - tropicalmedicine.ox.ac.uk - June 12th, 2026 [June 12th, 2026]
- IIT Roorkee invites applications for 11th Batch of Data Science, Machine Learning & Generative AI Programme - Elets Technomedia - June 12th, 2026 [June 12th, 2026]
- RAG Is Not Machine Learning, and the ML Toolkit Solves the Wrong Problem - Towards Data Science - June 3rd, 2026 [June 3rd, 2026]
- A reality check on the AI jobs hysteria - Machine Learning Week US - June 3rd, 2026 [June 3rd, 2026]
- STMicroelectronics Releases Vibration Sensor With Integrated Machine Learning for Industrial Monitoring - geneonline.com - June 3rd, 2026 [June 3rd, 2026]
- NAVER LABS Europe is offering a 2026 Research Internship in Large Language Models, focusing on AI Alignment, Controlled Generation, and Machine... - May 29th, 2026 [May 29th, 2026]
- Q&A: A Machine-Learning-Based Tool to Enhance Clinical Care of Patients With Multiple Sclerosis - Physician's Weekly - May 29th, 2026 [May 29th, 2026]
- Evaluating the Diagnostic Performance of AI and Machine Learning in Sickle Cell Disease Detection: A Systematic Review - Cureus - May 29th, 2026 [May 29th, 2026]
- HTC-19 Update: Artificial Intelligence and Machine Learning - Chromatography Online - May 29th, 2026 [May 29th, 2026]
- Multimodal phenotypic classification of generalized anxiety and panic using structural MRI data and psychosocial factors: machine learning results... - May 29th, 2026 [May 29th, 2026]
- Machine Learning Personalizes Depression Treatment with the Help of Wearable Technology - UC San Diego Today - May 27th, 2026 [May 27th, 2026]
- How Machine Learning Makes Complex Knowledge Useable in Real-World Conditions - Supply & Demand Chain Executive - May 25th, 2026 [May 25th, 2026]
- How Airbnbs machine-learning tools aim to prevent Memorial Day weekend parties in Las Vegas - FOX5 Vegas - May 25th, 2026 [May 25th, 2026]
- Artificial Intelligence and Machine Learning in Hospital Quality Management, Patient Safety, and Accreditation Readiness: A Systematic Review and... - May 25th, 2026 [May 25th, 2026]
- Machine learning accelerates analysis of fusion materials - Technology Org - May 25th, 2026 [May 25th, 2026]
- Dr. Kaveh Heidary Presents Innovations in AI, Machine Learning and Multispectral Imaging - aamu.edu - May 25th, 2026 [May 25th, 2026]
- Comparison of Prognostic Performance Between a Machine Learning Model and Manually Measured Grey-White-Matter Ratio on Early Brain Computed Tomography... - May 25th, 2026 [May 25th, 2026]
- Machine learning proves that graphene is hydrophobic - Phys.org - May 13th, 2026 [May 13th, 2026]
- Machine learning algorithm predicts AMD stock price on May 31, 2026 - Finbold - May 13th, 2026 [May 13th, 2026]
- Genetic association and machine learning improve the prediction of type 1 diabetes risk - Nature - May 1st, 2026 [May 1st, 2026]
- What Can We Expect From Machine Learning Predictions in Daily Clinical Neurology? - Neurology Live - May 1st, 2026 [May 1st, 2026]
- How Spam Filters Paved the Way for Adversarial Machine Learning - 150sec - May 1st, 2026 [May 1st, 2026]
- Real-Time Estimation of Numerical Rating Scale (NRS) Scores Using Machine Learning-Based Facial Expression Analysis: A Proof-of-Concept Study - Cureus - May 1st, 2026 [May 1st, 2026]
- Heriot-Watt researcher warns gen AI in machine learning carries serious and underestimated risks - EdTech Innovation Hub - May 1st, 2026 [May 1st, 2026]
- HS-SPME/GCMS and Machine Learning Enable Volatile Fingerprinting and Classification of Commercial Vinegars - Chromatography Online - April 12th, 2026 [April 12th, 2026]
- Role of Artificial Intelligence and Machine Learning in Diagnosing Knee Lesions: Where Are We Now? - Cureus - April 12th, 2026 [April 12th, 2026]
- CMML2AML: machine-learning discovery of co-mutations and specific single mutations predictive of blast transformation in chronic myelomonocytic... - April 12th, 2026 [April 12th, 2026]
- Machine-learning-based reconstruction of Ming-dynasty defensive corridors in Yuxian - Nature - April 12th, 2026 [April 12th, 2026]
- Have you published a disruptive paper? New machine-learning tool helps you check - Physics World - April 12th, 2026 [April 12th, 2026]
- Microsoft is automatically updating Windows 11 24H2 to 25H2 using machine learning - TweakTown - April 5th, 2026 [April 5th, 2026]
- Inside the Magic of Machine Learning That Powers Enemy AI in Arc Raiders - 80 Level - April 3rd, 2026 [April 3rd, 2026]
- We analyzed Philly street scenes and identified signs of gentrification using machine learning trained on longtime residents observations - The... - April 3rd, 2026 [April 3rd, 2026]
- Boston University To Apply Machine Learning To Alzheimers Biomarker And Cognitive Data - Quantum Zeitgeist - April 3rd, 2026 [April 3rd, 2026]
- Sony buys machine-learning company to help "enhance gameplay visuals, improve rendering techniques, and unlock new levels of visual... - April 3rd, 2026 [April 3rd, 2026]
- The Machine Learning Stack Is Being Rebuilt From Scratch Here's What Developers Need to Know in 2026 - HackerNoon - April 3rd, 2026 [April 3rd, 2026]
- Closing the Revenue Gap: Leveraging Machine Learning to Solve the $260 Billion Denial Crisis - vocal.media - April 3rd, 2026 [April 3rd, 2026]
- Machine Learning for Pharmaceuticals Set to Witness Rapid - openPR.com - April 3rd, 2026 [April 3rd, 2026]
- You Must Address These 4 Concerns To Deploy Predictive AI - Machine Learning Week US - March 30th, 2026 [March 30th, 2026]
- Google and the rise of space-based machine learning - Latitude Media - March 30th, 2026 [March 30th, 2026]
- Researchers use machine learning and social network theory to identify formation patterns in digital forums - techxplore.com - March 30th, 2026 [March 30th, 2026]
- Mayo Clinic Study Uses Wearables and Machine Learning to Predict COPD Rehab Participation - HIT Consultant - March 30th, 2026 [March 30th, 2026]
- Machine learning at the edge in retail: constraints and gains - IoT News - March 26th, 2026 [March 26th, 2026]
- AI agents are flashy, but machine learning still pays the bills - TechRadar - March 26th, 2026 [March 26th, 2026]
- Single-cell imaging and machine learning reveal hidden coordination in algae's response to light stress - Phys.org - March 26th, 2026 [March 26th, 2026]
- Machine learning analysis of CT scans - National Institutes of Health (.gov) - March 22nd, 2026 [March 22nd, 2026]
- TransUnion Machine Learning Fraud Tools Tested Against Weak Share Price Momentum - simplywall.st - March 22nd, 2026 [March 22nd, 2026]
- Machine learning could help predict how people with depression respond to treatment - Medical Xpress - March 22nd, 2026 [March 22nd, 2026]
- KR approves machine learning-based fuel reduction methodology - Smart Maritime Network - March 22nd, 2026 [March 22nd, 2026]
- Available solar energy in Andalusia will increase through the end of the century, machine learning model finds - Tech Xplore - March 22nd, 2026 [March 22nd, 2026]
- How Machine Learning Is Reshaping Environmental Policy and Water Governance - Devdiscourse - March 22nd, 2026 [March 22nd, 2026]
- Chemistry student uses machine learning to transform gene therapy production - The University of North Carolina at Chapel Hill - March 13th, 2026 [March 13th, 2026]
- AI and Machine Learning - City of Brownsville to build smart city safety solution - Smart Cities World - March 13th, 2026 [March 13th, 2026]
- AI and Machine Learning - London borough overhauls public safety infrastructure - Smart Cities World - March 13th, 2026 [March 13th, 2026]
- Titan Technology Corp. Responds to Alberta Innovates RFP AI, Machine Learning and Automation Services - TradingView - March 13th, 2026 [March 13th, 2026]
- Vietnam FPT's AI automation solution secures new machine learning patent on overseas market - VnExpress International - March 13th, 2026 [March 13th, 2026]
- AI Healthcare Technology: The Power of Machine Learning Diagnosis in Modern Medicine - Tech Times - March 13th, 2026 [March 13th, 2026]
- Future Perspectives: Key Trends Shaping the Machine Learning Market in Financial Services Until 2030 - openPR.com - March 13th, 2026 [March 13th, 2026]
- How to Build an Autonomous Machine Learning Research Loop in Google Colab Using Andrej Karpathys AutoResearch Framework for Hyperparameter Discovery... - March 13th, 2026 [March 13th, 2026]
- The Arc in Arc Raiders have multiple "brains," and they all love pursuing you because Embark gives them "rewards" in real-time via... - March 13th, 2026 [March 13th, 2026]
- OnPoint AI to Present its Augmented Reality and Machine Learning Surgical Platform at the 2026 Canaccord Genuity Musculoskeletal Conference - Yahoo... - February 27th, 2026 [February 27th, 2026]
- TD Bank continues to develop AI, machine learning tools - Auto Finance News - February 27th, 2026 [February 27th, 2026]
- AI and Machine Learning - Tech companies team to scale private 5G and physical AI - Smart Cities World - February 27th, 2026 [February 27th, 2026]
- AI and Machine Learning in Dating Apps: Smarter Matchmaking Algorithms - Programming Insider - February 27th, 2026 [February 27th, 2026]
- Machine-Learning App Helps Anesthesiologists Navigate Critical Surgical Equipment in Real Time - Carle Illinois College of Medicine - February 24th, 2026 [February 24th, 2026]
- Fractal Launches PiEvolve, an Evolutionary Agentic Engine for Autonomous Machine Learning and Scientific Discovery - Yahoo Finance - February 24th, 2026 [February 24th, 2026]
- How Brain Data and Machine Learning Could Transform the Aging Industry - gritdaily.com - February 24th, 2026 [February 24th, 2026]
- AI and machine learning trends for Arizona leaders to watch in healthcare delivery and traveler services - AZ Big Media - February 24th, 2026 [February 24th, 2026]
- AI and machine learning are the future of Wi-Fi management: WBA report - Telecompetitor - February 22nd, 2026 [February 22nd, 2026]
- Machine learning streamlines the complexities of making better proteins - Science News - February 20th, 2026 [February 20th, 2026]
- WBA Publishes Guidance on Artificial Intelligence and Machine Learning for Intelligent Wi-Fi - ARC Advisory Group - February 20th, 2026 [February 20th, 2026]
- Machine learning-predicted insulin resistance is a risk factor for 12 types of cancer - Nature - February 20th, 2026 [February 20th, 2026]
- Exploring Machine Learning at the DOF - University of the Philippines Diliman - February 20th, 2026 [February 20th, 2026]