Is fake data the real deal when training algorithms? – The Guardian
Youre at the wheel of your car but youre exhausted. Your shoulders start to sag, your neck begins to droop, your eyelids slide down. As your head pitches forward, you swerve off the road and speed through a field, crashing into a tree.
But what if your cars monitoring system recognised the tell-tale signs of drowsiness and prompted you to pull off the road and park instead? The European Commission has legislated that from this year, new vehicles be fitted with systems to catch distracted and sleepy drivers to help avert accidents. Now a number of startups are training artificial intelligence systems to recognise the giveaways in our facial expressions and body language.
These companies are taking a novel approach for the field of AI. Instead of filming thousands of real-life drivers falling asleep and feeding that information into a deep-learning model to learn the signs of drowsiness, theyre creating millions of fake human avatars to re-enact the sleepy signals.
Big data defines the field of AI for a reason. To train deep learning algorithms accurately, the models need to have a multitude of data points. That creates problems for a task such as recognising a person falling asleep at the wheel, which would be difficult and time-consuming to film happening in thousands of cars. Instead, companies have begun building virtual datasets.
Synthesis AI and Datagen are two companies using full-body 3D scans, including detailed face scans, and motion data captured by sensors placed all over the body, to gather raw data from real people. This data is fed through algorithms that tweak various dimensions many times over to create millions of 3D representations of humans, resembling characters in a video game, engaging in different behaviours across a variety of simulations.
In the case of someone falling asleep at the wheel, they might film a human performer falling asleep and combine it with motion capture, 3D animations and other techniques used to create video games and animated movies, to build the desired simulation. You can map [the target behaviour] across thousands of different body types, different angles, different lighting, and add variability into the movement as well, says Yashar Behzadi, CEO of Synthesis AI.
Using synthetic data cuts out a lot of the messiness of the more traditional way to train deep learning algorithms. Typically, companies would have to amass a vast collection of real-life footage and low-paid workers would painstakingly label each of the clips. These would be fed into the model, which would learn how to recognise the behaviours.
The big sell for the synthetic data approach is that its quicker and cheaper by a wide margin. But these companies also claim it can help tackle the bias that creates a huge headache for AI developers. Its well documented that some AI facial recognition software is poor at recognising and correctly identifying particular demographic groups. This tends to be because these groups are underrepresented in the training data, meaning the software is more likely to misidentify these people.
Niharika Jain, a software engineer and expert in gender and racial bias in generative machine learning, highlights the notorious example of Nikon Coolpixs blink detection feature, which, because the training data included a majority of white faces, disproportionately judged Asian faces to be blinking. A good driver-monitoring system must avoid misidentifying members of a certain demographic as asleep more often than others, she says.
The typical response to this problem is to gather more data from the underrepresented groups in real-life settings. But companies such as Datagen say this is no longer necessary. The company can simply create more faces from the underrepresented groups, meaning theyll make up a bigger proportion of the final dataset. Real 3D face scan data from thousands of people is whipped up into millions of AI composites. Theres no bias baked into the data; you have full control of the age, gender and ethnicity of the people that youre generating, says Gil Elbaz, co-founder of Datagen. The creepy faces that emerge dont look like real people, but the company claims that theyre similar enough to teach AI systems how to respond to real people in similar scenarios.
There is, however, some debate over whether synthetic data can really eliminate bias. Bernease Herman, a data scientist at the University of Washington eScience Institute, says that although synthetic data can improve the robustness of facial recognition models on underrepresented groups, she does not believe that synthetic data alone can close the gap between the performance on those groups and others. Although the companies sometimes publish academic papers showcasing how their algorithms work, the algorithms themselves are proprietary, so researchers cannot independently evaluate them.
In areas such as virtual reality, as well as robotics, where 3D mapping is important, synthetic data companies argue it could actually be preferable to train AI on simulations, especially as 3D modelling, visual effects and gaming technologies improve. Its only a matter of time until you can create these virtual worlds and train your systems completely in a simulation, says Behzadi.
This kind of thinking is gaining ground in the autonomous vehicle industry, where synthetic data is becoming instrumental in teaching self-driving vehicles AI how to navigate the road. The traditional approach filming hours of driving footage and feeding this into a deep learning model was enough to get cars relatively good at navigating roads. But the issue vexing the industry is how to get cars to reliably handle what are known as edge cases events that are rare enough that they dont appear much in millions of hours of training data. For example, a child or dog running into the road, complicated roadworks or even some traffic cones placed in an unexpected position, which was enough to stump a driverless Waymo vehicle in Arizona in 2021.
With synthetic data, companies can create endless variations of scenarios in virtual worlds that rarely happen in the real world. Instead of waiting millions more miles to accumulate more examples, they can artificially generate as many examples as they need of the edge case for training and testing, says Phil Koopman, associate professor in electrical and computer engineering at Carnegie Mellon University.
AV companies such as Waymo, Cruise and Wayve are increasingly relying on real-life data combined with simulated driving in virtual worlds. Waymo has created a simulated world using AI and sensor data collected from its self-driving vehicles, complete with artificial raindrops and solar glare. It uses this to train vehicles on normal driving situations, as well as the trickier edge cases. In 2021, Waymo told the Verge that it had simulated 15bn miles of driving, versus a mere 20m miles of real driving.
An added benefit to testing autonomous vehicles out in virtual worlds first is minimising the chance of very real accidents. A large reason self-driving is at the forefront of a lot of the synthetic data stuff is fault tolerance, says Herman. A self-driving car making a mistake 1% of the time, or even 0.01% of the time, is probably too much.
In 2017, Volvos self-driving technology, which had been taught how to respond to large North American animals such as deer, was baffled when encountering kangaroos for the first time in Australia. If a simulator doesnt know about kangaroos, no amount of simulation will create one until it is seen in testing and designers figure out how to add it, says Koopman. For Aaron Roth, professor of computer and cognitive science at the University of Pennsylvania, the challenge will be to create synthetic data that is indistinguishable from real data. He thinks it is plausible that were at that point for face data, as computers can now generate photorealistic images of faces. But for a lot of other things, which may or may not include kangaroos I dont think that were there yet.
Excerpt from:
Is fake data the real deal when training algorithms? - The Guardian
- Machine-Learning Approach to Increase the Potency and Overcome the Hemolytic Toxicity of Gramicidin S - ACS Publications - July 24th, 2025 [July 24th, 2025]
- Machine learning-based academic performance prediction with explainability for enhanced decision-making in educational institutions - Nature - July 24th, 2025 [July 24th, 2025]
- Can External Validation Tools Can Improve Annotation Quality for LLM-as-a-Judge - Apple Machine Learning Research - July 24th, 2025 [July 24th, 2025]
- How to use learning curves to evaluate the sample size for malaria prediction models developed using machine learning algorithms - Malaria Journal - July 24th, 2025 [July 24th, 2025]
- Development and validation of a dynamic early warning system with time-varying machine learning models for predicting hemodynamic instability in... - July 24th, 2025 [July 24th, 2025]
- Early and non-destructive prediction of the differentiation efficiency of human induced pluripotent stem cells using imaging and machine learning -... - July 24th, 2025 [July 24th, 2025]
- Algorithmica Reports 35% Return in First Fiscal Year, Driven by Machine Learning Trading Technology - PR Newswire - July 24th, 2025 [July 24th, 2025]
- New research using machine learning further links increase in earthquakes, quake intensity, in Raton Basin to wastewater injections - The... - July 24th, 2025 [July 24th, 2025]
- Early modern text transcription revolutionized by ethical machine learning tools - Archaeology News Online Magazine - July 22nd, 2025 [July 22nd, 2025]
- Role of Artificial Intelligence and Machine Learning in Conservative Dentistry and Endodontics: A Review - Cureus - July 22nd, 2025 [July 22nd, 2025]
- NTT Researchers Advance AI and Machine Learning Accuracy, Security and Cost Effectiveness at ICML 2025 - Business Wire - July 22nd, 2025 [July 22nd, 2025]
- Exploring Phase Stability and Transport Properties of Emerging Thermoelectric Materials: Machine Learning and Experimental Insights - ACS Publications - July 22nd, 2025 [July 22nd, 2025]
- Google expands Ad Manager partner guidelines with machine learning restrictions - PPC Land - July 22nd, 2025 [July 22nd, 2025]
- Leveraging Generative AI into Wargaming and Machine Learning to Shape War Termination Scenarios in Ukraine - oodaloop.com - July 22nd, 2025 [July 22nd, 2025]
- Predictive AI Too Hard To Use? GenAI Makes It Easy - Machine Learning Week 2025 - July 22nd, 2025 [July 22nd, 2025]
- Wheat is becoming more climate-resilient through nature-based plant breeding and machine learning - Phys.org - July 22nd, 2025 [July 22nd, 2025]
- Machine learning enhanced ultra-high vacuum system for predicting field emission performance in graphene reinforced aluminium based metal matrix... - July 22nd, 2025 [July 22nd, 2025]
- Machine learning-guided evolution of pyrrolysyl-tRNA synthetase for improved incorporation efficiency of diverse noncanonical amino acids - Nature - July 22nd, 2025 [July 22nd, 2025]
- Dietary intervention optimized using machine learning could lower risk of dementia - Medical Xpress - July 20th, 2025 [July 20th, 2025]
- Application of machine learning algorithms and SHAP explanations to predict fertility preference among reproductive women in Somalia - Nature - July 20th, 2025 [July 20th, 2025]
- From Reactive to Predictive: Forecasting Network Congestion with Machine Learning and INT - Towards Data Science - July 20th, 2025 [July 20th, 2025]
- Artificial intelligence and machine learning in the development of vaccines and immunotherapeuticsyesterday, today, and tomorrow - Frontiers - July 20th, 2025 [July 20th, 2025]
- How Machine Learning is Revolutionizing Threat Detection for Businesses in Real-Time - Eye On Annapolis - July 20th, 2025 [July 20th, 2025]
- Identification of clinical diagnostic and immune cell infiltration characteristics of acute myocardial infarction with machine learning approach -... - July 20th, 2025 [July 20th, 2025]
- Predicting the mechanical performance of industrial waste incorporated sustainable concrete using hybrid machine learning modeling and parametric... - July 20th, 2025 [July 20th, 2025]
- Integrative multi-omics and machine learning reveal critical functions of proliferating cells in prognosis and personalized treatment of lung... - July 20th, 2025 [July 20th, 2025]
- Systematic measurement and machine learning-based profile characterization of community noise in a medium-large city in the United States - Nature - July 20th, 2025 [July 20th, 2025]
- Prediction of birthweight with early and mid-pregnancy antenatal markers utilising machine learning and explainable artificial intelligence - Nature - July 20th, 2025 [July 20th, 2025]
- A comprehensive machine learning for high throughput Tuberculosis sequence analysis, functional annotation, and visualization - Nature - July 20th, 2025 [July 20th, 2025]
- AI and Machine Learning Skills Are Make or Break for Developers: 71% of Tech Leaders Wont Hire Without Them - The National Law Review - July 20th, 2025 [July 20th, 2025]
- Quality-of-life scale machine learning approach to predict immunotherapy response in patients with advanced non-small cell lung cancer - Frontiers - July 20th, 2025 [July 20th, 2025]
- Inversion and validation of soil water-holding capacity in a wild fruit forest, using hyperspectral technology combined with machine learning - Nature - July 20th, 2025 [July 20th, 2025]
- Machine Learning in Drug Discovery Market to Witness Exponential Growth: Key Players, $250M Eli Lilly Deal & Regional Insights for 2025-2034 -... - July 18th, 2025 [July 18th, 2025]
- Automated seafood freshness detection and preservation analysis using machine learning and paper-based pH sensors - Nature - July 18th, 2025 [July 18th, 2025]
- Do You Know What It Means To Train a Machine Learning Model? - LSU - July 18th, 2025 [July 18th, 2025]
- Establishment of an interpretable MRI radiomics-based machine learning model capable of predicting axillary lymph node metastasis in invasive breast... - July 18th, 2025 [July 18th, 2025]
- A Machine Learning-Reconstructed Dataset of River Discharge, Temperature, and Heat Flux into the Arctic Ocean - Nature - July 18th, 2025 [July 18th, 2025]
- Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths | Schizophrenia -... - July 18th, 2025 [July 18th, 2025]
- Development and validation of machine learning-based diagnostic models using blood transcriptomics for early childhood diabetes prediction - Frontiers - July 18th, 2025 [July 18th, 2025]
- Fatigue and stamina prediction of athletic person on track using thermal facial biomarkers and optimized machine learning algorithm - Nature - July 18th, 2025 [July 18th, 2025]
- Identifying the crucial oncogenic mechanisms of DDX56 based on a machine learning-based integration model of RNA-binding proteins - Nature - July 18th, 2025 [July 18th, 2025]
- AI and Machine Learning Skills Are Make or Break for Developers: 71% of Tech Leaders Wont Hire Without Them - Yahoo Finance - July 18th, 2025 [July 18th, 2025]
- Developing an explainable machine learning and fog computing-based visual rating scale for the prediction of dementia progression - Nature - July 18th, 2025 [July 18th, 2025]
- Prognosis of air quality index and air pollution using machine learning techniques - Nature - July 18th, 2025 [July 18th, 2025]
- Integrating vision transformer-based deep learning model with kernel extreme learning machine for non-invasive diagnosis of neonatal jaundice using... - July 18th, 2025 [July 18th, 2025]
- PlayStation 6 Likely to Feature 24 GB RAM for Advanced Ray Tracing and Machine Learning Without Raising Costs - Wccftech - July 18th, 2025 [July 18th, 2025]
- Machine Learning-Assisted Iterative Screening for Efficient Detection of Drug Discovery Starting Points - ACS Publications - July 16th, 2025 [July 16th, 2025]
- 2025 IT Camp on AI & Machine Learning for Beginners to be held August 5 - Southeastern Oklahoma State University - July 16th, 2025 [July 16th, 2025]
- Utilizing machine learning to predict MRI signal outputs from iron oxide nanoparticles through the PSLG algorithm - Nature - July 16th, 2025 [July 16th, 2025]
- Developing a machine-learning model to enable treatment selection for neoadjuvant chemotherapy for esophageal cancer - Nature - July 16th, 2025 [July 16th, 2025]
- Advancing crop recommendation system with supervised machine learning and explainable artificial intelligence - Nature - July 16th, 2025 [July 16th, 2025]
- Predicting clozapine-induced adverse drug reaction biomarkers using machine learning - Nature - July 16th, 2025 [July 16th, 2025]
- Postoperative complication severity prediction in penile prosthesis implantation: a machine learning-based predictive modeling study - Nature - July 16th, 2025 [July 16th, 2025]
- The Future of AI & Machine Learning: Perspective on Shaping Tomorrows Business Landscape - Vocal - July 16th, 2025 [July 16th, 2025]
- Machine Learning: Your Ticket to a Thriving Career in the Tech World - The Impressive Times - July 14th, 2025 [July 14th, 2025]
- Integrative analysis of multi-omics data and gut microbiota composition reveals prognostic subtypes and predicts immunotherapy response in colorectal... - July 14th, 2025 [July 14th, 2025]
- Comprehensive multi-omics and machine learning framework for glioma subtyping and precision therapeutics - Nature - July 14th, 2025 [July 14th, 2025]
- Development and validation of a machine learning-based nomogram for survival prediction of patients with hilar cholangiocarcinoma after... - July 12th, 2025 [July 12th, 2025]
- Geochemical-integrated machine learning approach predicts the distribution of cadmium speciation in European and Chinese topsoils - Nature - July 12th, 2025 [July 12th, 2025]
- Machine learning-based construction of a programmed cell death-related model reveals prognosis and immune infiltration in pancreatic adenocarcinoma... - July 12th, 2025 [July 12th, 2025]
- Application of supervised machine learning and unsupervised data compression models for pore pressure prediction employing drilling, petrophysical,... - July 12th, 2025 [July 12th, 2025]
- Machine learning identifies lipid-associated genes and constructs diagnostic and prognostic models for idiopathic pulmonary fibrosis - Orphanet... - July 12th, 2025 [July 12th, 2025]
- An evaluation methodology for machine learning-based tandem mass spectra similarity prediction - BMC Bioinformatics - July 12th, 2025 [July 12th, 2025]
- The Rise of AI in Trading: Machine Learning and the Stock Market - Disruption Banking - July 12th, 2025 [July 12th, 2025]
- Integrative analysis identifies IL-6/JUN/MMP-9 pathway destroyed blood-brain-barrier in autism mice via machine learning and bioinformatic analysis -... - July 12th, 2025 [July 12th, 2025]
- Interpretive prediction of hyperuricemia and gout patients via machine learning analysis of human gut microbiome - BMC Microbiology - July 10th, 2025 [July 10th, 2025]
- Machine learning-based identification of key factors and spatial heterogeneity analysis of urban flooding: a case study of the central urban area of... - July 10th, 2025 [July 10th, 2025]
- Developing machine learning frameworks to predict mechanical properties of ultra-high performance concrete mixed with various industrial byproducts -... - July 10th, 2025 [July 10th, 2025]
- Small Drones Market Trend Analysis and Forecast Report 2025-2034 | AI and Machine Learning Revolutionizing Autonomous Operations, Trade Tariffs Push... - July 10th, 2025 [July 10th, 2025]
- When a model touches millions: Hatim Kagalwala on accuracy accountability, and applied machine learning - Dataconomy - July 10th, 2025 [July 10th, 2025]
- New Study Uses Gait Data and Machine Learning for Early Detection of Anxiety and Depression - AZoSensors - July 10th, 2025 [July 10th, 2025]
- Machine Learning and the Evolution of Mobile Apps - CIO Applications - July 10th, 2025 [July 10th, 2025]
- Artificial Intelligence, Machine Learning, and Big Data in Thailand: Legal and Regulatory Developments 2025 - Lexology - July 10th, 2025 [July 10th, 2025]
- Karen Hao on how the AI boom became a new imperial frontier - Machine Learning Week 2025 - July 8th, 2025 [July 8th, 2025]
- Machine Learning and AI in Enhancing Image Analysis of 3D Samples - Drug Target Review - July 8th, 2025 [July 8th, 2025]
- Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027 - Machine Learning Week 2025 - July 8th, 2025 [July 8th, 2025]
- Explainable machine learning model for predicting the transarterial chemoembolization response and subtypes of hepatocellular carcinoma patients - BMC... - July 8th, 2025 [July 8th, 2025]
- Identification and validation of glucocorticoid receptor and programmed cell death-related genes in spinal cord injury using machine learning - Nature - July 8th, 2025 [July 8th, 2025]
- Multiclass leukemia cell classification using hybrid deep learning and machine learning with CNN-based feature extraction - Nature - July 6th, 2025 [July 6th, 2025]
- Predictive modeling and machine learning show poor performance of clinical, morphological, and hemodynamic parameters for small intracranial aneurysm... - July 6th, 2025 [July 6th, 2025]