Is fake data the real deal when training algorithms? – The Guardian
Youre at the wheel of your car but youre exhausted. Your shoulders start to sag, your neck begins to droop, your eyelids slide down. As your head pitches forward, you swerve off the road and speed through a field, crashing into a tree.
But what if your cars monitoring system recognised the tell-tale signs of drowsiness and prompted you to pull off the road and park instead? The European Commission has legislated that from this year, new vehicles be fitted with systems to catch distracted and sleepy drivers to help avert accidents. Now a number of startups are training artificial intelligence systems to recognise the giveaways in our facial expressions and body language.
These companies are taking a novel approach for the field of AI. Instead of filming thousands of real-life drivers falling asleep and feeding that information into a deep-learning model to learn the signs of drowsiness, theyre creating millions of fake human avatars to re-enact the sleepy signals.
Big data defines the field of AI for a reason. To train deep learning algorithms accurately, the models need to have a multitude of data points. That creates problems for a task such as recognising a person falling asleep at the wheel, which would be difficult and time-consuming to film happening in thousands of cars. Instead, companies have begun building virtual datasets.
Synthesis AI and Datagen are two companies using full-body 3D scans, including detailed face scans, and motion data captured by sensors placed all over the body, to gather raw data from real people. This data is fed through algorithms that tweak various dimensions many times over to create millions of 3D representations of humans, resembling characters in a video game, engaging in different behaviours across a variety of simulations.
In the case of someone falling asleep at the wheel, they might film a human performer falling asleep and combine it with motion capture, 3D animations and other techniques used to create video games and animated movies, to build the desired simulation. You can map [the target behaviour] across thousands of different body types, different angles, different lighting, and add variability into the movement as well, says Yashar Behzadi, CEO of Synthesis AI.
Using synthetic data cuts out a lot of the messiness of the more traditional way to train deep learning algorithms. Typically, companies would have to amass a vast collection of real-life footage and low-paid workers would painstakingly label each of the clips. These would be fed into the model, which would learn how to recognise the behaviours.
The big sell for the synthetic data approach is that its quicker and cheaper by a wide margin. But these companies also claim it can help tackle the bias that creates a huge headache for AI developers. Its well documented that some AI facial recognition software is poor at recognising and correctly identifying particular demographic groups. This tends to be because these groups are underrepresented in the training data, meaning the software is more likely to misidentify these people.
Niharika Jain, a software engineer and expert in gender and racial bias in generative machine learning, highlights the notorious example of Nikon Coolpixs blink detection feature, which, because the training data included a majority of white faces, disproportionately judged Asian faces to be blinking. A good driver-monitoring system must avoid misidentifying members of a certain demographic as asleep more often than others, she says.
The typical response to this problem is to gather more data from the underrepresented groups in real-life settings. But companies such as Datagen say this is no longer necessary. The company can simply create more faces from the underrepresented groups, meaning theyll make up a bigger proportion of the final dataset. Real 3D face scan data from thousands of people is whipped up into millions of AI composites. Theres no bias baked into the data; you have full control of the age, gender and ethnicity of the people that youre generating, says Gil Elbaz, co-founder of Datagen. The creepy faces that emerge dont look like real people, but the company claims that theyre similar enough to teach AI systems how to respond to real people in similar scenarios.
There is, however, some debate over whether synthetic data can really eliminate bias. Bernease Herman, a data scientist at the University of Washington eScience Institute, says that although synthetic data can improve the robustness of facial recognition models on underrepresented groups, she does not believe that synthetic data alone can close the gap between the performance on those groups and others. Although the companies sometimes publish academic papers showcasing how their algorithms work, the algorithms themselves are proprietary, so researchers cannot independently evaluate them.
In areas such as virtual reality, as well as robotics, where 3D mapping is important, synthetic data companies argue it could actually be preferable to train AI on simulations, especially as 3D modelling, visual effects and gaming technologies improve. Its only a matter of time until you can create these virtual worlds and train your systems completely in a simulation, says Behzadi.
This kind of thinking is gaining ground in the autonomous vehicle industry, where synthetic data is becoming instrumental in teaching self-driving vehicles AI how to navigate the road. The traditional approach filming hours of driving footage and feeding this into a deep learning model was enough to get cars relatively good at navigating roads. But the issue vexing the industry is how to get cars to reliably handle what are known as edge cases events that are rare enough that they dont appear much in millions of hours of training data. For example, a child or dog running into the road, complicated roadworks or even some traffic cones placed in an unexpected position, which was enough to stump a driverless Waymo vehicle in Arizona in 2021.
With synthetic data, companies can create endless variations of scenarios in virtual worlds that rarely happen in the real world. Instead of waiting millions more miles to accumulate more examples, they can artificially generate as many examples as they need of the edge case for training and testing, says Phil Koopman, associate professor in electrical and computer engineering at Carnegie Mellon University.
AV companies such as Waymo, Cruise and Wayve are increasingly relying on real-life data combined with simulated driving in virtual worlds. Waymo has created a simulated world using AI and sensor data collected from its self-driving vehicles, complete with artificial raindrops and solar glare. It uses this to train vehicles on normal driving situations, as well as the trickier edge cases. In 2021, Waymo told the Verge that it had simulated 15bn miles of driving, versus a mere 20m miles of real driving.
An added benefit to testing autonomous vehicles out in virtual worlds first is minimising the chance of very real accidents. A large reason self-driving is at the forefront of a lot of the synthetic data stuff is fault tolerance, says Herman. A self-driving car making a mistake 1% of the time, or even 0.01% of the time, is probably too much.
In 2017, Volvos self-driving technology, which had been taught how to respond to large North American animals such as deer, was baffled when encountering kangaroos for the first time in Australia. If a simulator doesnt know about kangaroos, no amount of simulation will create one until it is seen in testing and designers figure out how to add it, says Koopman. For Aaron Roth, professor of computer and cognitive science at the University of Pennsylvania, the challenge will be to create synthetic data that is indistinguishable from real data. He thinks it is plausible that were at that point for face data, as computers can now generate photorealistic images of faces. But for a lot of other things, which may or may not include kangaroos I dont think that were there yet.
Excerpt from:
Is fake data the real deal when training algorithms? - The Guardian
- AI, Machine Learning to drive power sector transformation: Manohar Lal - DD News - December 7th, 2025 [December 7th, 2025]
- AI WebTracker and Machine-Learning Compliance Tools Help Law Firms Acquire High-Value Personal Injury Cases While Reducing Fake Leads and TCPA Risk -... - December 7th, 2025 [December 7th, 2025]
- AI AND MACHINE LEARNING BASED APPLICATIONS TO PLAY PIVOTAL ROLE IN TRANSFORMING INDIAS POWER SECTOR, SAYS SHRI MANOHAR LAL - pib.gov.in - December 7th, 2025 [December 7th, 2025]
- AI and Machine Learning to Transform Indias Power Sector, Says Manohar Lal - The Impressive Times - December 7th, 2025 [December 7th, 2025]
- Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU - Apple Machine Learning Research - November 23rd, 2025 [November 23rd, 2025]
- Machine learning model for HBsAg seroclearance after 48-week pegylated interferon therapy in inactive HBsAg carriers: a retrospective study - Virology... - November 23rd, 2025 [November 23rd, 2025]
- IIT Madras Free Machine Learning Course 2026: What to know - Times of India - November 23rd, 2025 [November 23rd, 2025]
- Towards a Better Evaluation of 3D CVML Algorithms: Immersive Debugging of a Localization Model - Apple Machine Learning Research - November 23rd, 2025 [November 23rd, 2025]
- A machine-learning powered liquid biopsy predicts response to paclitaxel plus ramucirumab in advanced gastric cancer: results from the prospective IVY... - November 23rd, 2025 [November 23rd, 2025]
- Monitoring for early prediction of gram-negative bacteremia using machine learning and hematological data in the emergency department - Nature - November 23rd, 2025 [November 23rd, 2025]
- Development and validation of an interpretable machine learning model for osteoporosis prediction using routine blood tests: a retrospective cohort... - November 23rd, 2025 [November 23rd, 2025]
- Snowflake Supercharges Machine Learning for Enterprises with Native Integration of NVIDIA CUDA-X Libraries - Snowflake - November 23rd, 2025 [November 23rd, 2025]
- Rethinking Revenue: How AI and Machine Learning Are Unlocking Hidden Value in the Post-Booking Space - Aviation Week Network - November 23rd, 2025 [November 23rd, 2025]
- Machine Learning Prediction of Material Properties Improves with Phonon-Informed Datasets - Quantum Zeitgeist - November 23rd, 2025 [November 23rd, 2025]
- A predictive model for the treatment outcomes of patients with secondary mitral regurgitation based on machine learning and model interpretation - BMC... - November 23rd, 2025 [November 23rd, 2025]
- Mobvista (1860.HK) Delivers Solid Revenue Growth in Q3 2025 as Mintegral Strengthens Its AI and Machine Learning Technology - Business Wire - November 23rd, 2025 [November 23rd, 2025]
- Machine learning beats classical method in predicting cosmic ray radiation near Earth - Phys.org - November 23rd, 2025 [November 23rd, 2025]
- Top Ways AI and Machine Learning Are Revolutionizing Industries in 2025 - nerdbot - November 23rd, 2025 [November 23rd, 2025]
- Snowflake Supercharges Machine Learning for Enterprises with Native Integration of NVIDIA CUDA-X Libraries - Yahoo Finance - November 18th, 2025 [November 18th, 2025]
- An interpretable machine learning model for predicting 5year survival in breast cancer based on integration of proteomics and clinical data -... - November 18th, 2025 [November 18th, 2025]
- scMFF: a machine learning framework with multiple feature fusion strategies for cell type identification - BMC Bioinformatics - November 18th, 2025 [November 18th, 2025]
- URI professor examines how machine learning can help with depression diagnosis Rhody Today - The University of Rhode Island - November 18th, 2025 [November 18th, 2025]
- Predicting drug solubility in supercritical carbon dioxide green solvent using machine learning models based on thermodynamic properties - Nature - November 18th, 2025 [November 18th, 2025]
- Relationship between C-reactive protein triglyceride glucose index and cardiovascular disease risk: a cross-sectional analysis with machine learning -... - November 18th, 2025 [November 18th, 2025]
- Using machine learning to predict student outcomes for early intervention and formative assessment - Nature - November 18th, 2025 [November 18th, 2025]
- Prevalence, associated factors, and machine learning-based prediction of probable depression among individuals with chronic diseases in Bangladesh -... - November 18th, 2025 [November 18th, 2025]
- Snowflake supercharges machine learning for enterprises with native integration of Nvidia CUDA-X libraries - MarketScreener - November 18th, 2025 [November 18th, 2025]
- Unlocking Cardiovascular Disease Insights Through Machine Learning - BIOENGINEER.ORG - November 18th, 2025 [November 18th, 2025]
- Machine learning boosts solar forecasts in diverse climates of India - researchmatters.in - November 18th, 2025 [November 18th, 2025]
- Big Data Machine Learning In Telecom Market by Type and Application Set for 14.8% CAGR Growth Through 2033 - openPR.com - November 18th, 2025 [November 18th, 2025]
- How Humans Could Soon Understand and Talk to Animals, Thanks to Machine Learning - SYFY - November 10th, 2025 [November 10th, 2025]
- Machine learning based analysis of diesel engine performance using FeO nanoadditive in sterculia foetida biodiesel blend - Nature - November 10th, 2025 [November 10th, 2025]
- Machine Learning in Maternal Care - Johns Hopkins Bloomberg School of Public Health - November 10th, 2025 [November 10th, 2025]
- Machine learning-based differentiation of benign and malignant adrenal lesions using 18F-FDG PET/CT: a two-stage classification and SHAP... - November 10th, 2025 [November 10th, 2025]
- How to Better Use AI and Machine Learning in Dermatology, With Renata Block, MMS, PA-C - HCPLive - November 10th, 2025 [November 10th, 2025]
- Avoiding Catastrophe: The Importance of Privacy when Leveraging AI and Machine Learning for Disaster Management - CSIS | Center for Strategic and... - November 10th, 2025 [November 10th, 2025]
- Efferocytosis-related signatures identified via Single-cell analysis and machine learning predict TNBC outcomes and immunotherapy response - Nature - November 10th, 2025 [November 10th, 2025]
- Arc Raiders' use of AI highlights the tension and confusion over where machine learning ends and generative AI begins - PC Gamer - November 3rd, 2025 [November 3rd, 2025]
- From performance to prediction: extracting aging data from the effects of base load aging on washing machines for a machine learning model - Nature - November 3rd, 2025 [November 3rd, 2025]
- Meet 'kvcached': A Machine Learning Library to Enable Virtualized, Elastic KV Cache for LLM Serving on Shared GPUs - MarkTechPost - October 28th, 2025 [October 28th, 2025]
- Bayesian-optimized machine learning boosts actual evapotranspiration prediction in water-stressed agricultural regions of China - Nature - October 28th, 2025 [October 28th, 2025]
- Using machine learning to shed light on how well the triage systems work - News-Medical - October 28th, 2025 [October 28th, 2025]
- Our Last Hope Before The AI Bubble Detonates: Taming LLMs - Machine Learning Week US - October 28th, 2025 [October 28th, 2025]
- Using multiple machine learning algorithms to predict spinal cord injury in patients with cervical spondylosis: a multicenter study - Nature - October 28th, 2025 [October 28th, 2025]
- The diagnostic potential of proteomics and machine learning in Lyme neuroborreliosis - Nature - October 28th, 2025 [October 28th, 2025]
- Using unsupervised machine learning methods to cluster cardio-metabolic profile of the middle-aged and elderly Chinese with general and central... - October 28th, 2025 [October 28th, 2025]
- The prognostic value of POD24 for multiple myeloma: a comprehensive analysis based on traditional statistics and machine learning - BMC Cancer - October 28th, 2025 [October 28th, 2025]
- Reducing inequalities using an unbiased machine learning approach to identify births with the highest risk of preventable neonatal deaths - Population... - October 28th, 2025 [October 28th, 2025]
- Association between SHR and mortality in critically ill patients with CVD: a retrospective analysis and machine learning approach - Diabetology &... - October 28th, 2025 [October 28th, 2025]
- AI-Powered Visual Storytelling: How Machine Learning Transforms Creative Content Production - About Chromebooks - October 28th, 2025 [October 28th, 2025]
- How beauty brand Shiseido nearly tripled revenue per user with machine learning - Performance Marketing World - October 28th, 2025 [October 28th, 2025]
- Magnite introduces machine learning-powered ad podding for streaming platforms - PPC Land - October 26th, 2025 [October 26th, 2025]
- Krafton is an AI first company and will invest 70M USD on machine learning - Female First - October 26th, 2025 [October 26th, 2025]
- Machine learning prediction of bacterial optimal growth temperature from protein domain signatures reveals thermoadaptation mechanisms - BMC Genomics - October 24th, 2025 [October 24th, 2025]
- Data Proportionality and Its Impact on Machine Learning Predictions of Ground Granulated Blast Furnace Slag Concrete Strength | Newswise - Newswise - October 24th, 2025 [October 24th, 2025]
- The Evolution of Machine Learning and Its Applications in Orthopaedics: A Bibliometric Analysis - Cureus - October 24th, 2025 [October 24th, 2025]
- Sentiment Analysis with Machine Learning Achieves 83.48% Accuracy in Predicting Consumer Behavior Trends - Quantum Zeitgeist - October 24th, 2025 [October 24th, 2025]
- Use of machine learning for risk stratification of chest pain patients in the emergency department - BMC Medical Informatics and Decision Making - October 24th, 2025 [October 24th, 2025]
- Mass spectrometry combined with machine learning identifies novel protein signatures as demonstrated with multisystem inflammatory syndrome in... - October 24th, 2025 [October 24th, 2025]
- How Machine Learning Is Shrinking to Fit the Sensor Node - All About Circuits - October 24th, 2025 [October 24th, 2025]
- Machine learning models for mechanical properties prediction of basalt fiber-reinforced concrete incorporating graphical user interface - Nature - October 24th, 2025 [October 24th, 2025]
- Ohio wins national cybersecurity award for fraud solutions using machine learning - Spectrum News NY1 - October 24th, 2025 [October 24th, 2025]
- Itron Partners with Gordian Technologies to Enhance Grid Edge Intelligence with AI and Machine Learning Solutions - Quiver Quantitative - October 24th, 2025 [October 24th, 2025]
- Wearable sensors and machine learning give leg up on better running data - Medical Xpress - October 23rd, 2025 [October 23rd, 2025]
- Geophysical-machine learning tool developed for continuous subsurface geomaterials characterization - Phys.org - October 23rd, 2025 [October 23rd, 2025]
- Ohio wins national cybersecurity award for fraud solutions using machine learning - Spectrum News 1 - October 23rd, 2025 [October 23rd, 2025]
- Machine learning predictions of climate change effects on nearly threatened bird species ( Crithagra xantholaema) habitat in Ethiopia for conservation... - October 23rd, 2025 [October 23rd, 2025]
- A machine learning tool for predicting newly diagnosed osteoporosis in primary healthcare in the Stockholm Region - Nature - October 23rd, 2025 [October 23rd, 2025]
- ECBs New Perspective on Machine Learning in Banking - KPMG - October 23rd, 2025 [October 23rd, 2025]
- Ensemble Machine Learning for Digital Mapping of Soil pH and Electrical Conductivity in the Andean Agroecosystem of Peru - Frontiers - October 21st, 2025 [October 21st, 2025]
- New UA research develops machine learning to address needs of children with autism - AZPM News - October 21st, 2025 [October 21st, 2025]
- NMDSI Speaker Series on Weather Forecasting: What Machine Learning Can and Can't Do, Oct. 23 - Marquette Today - October 21st, 2025 [October 21st, 2025]
- Polyskill Achieves 1.7x Improved Skill Reuse and 9.4% Higher Success Rates through Polymorphic Abstraction in Machine Learning - Quantum Zeitgeist - October 21st, 2025 [October 21st, 2025]
- University of Strathclyde opens admission for MSc in Machine & Deep Learning for Jan 2026 intake - The Indian Express - October 21st, 2025 [October 21st, 2025]
- Reducing Model Biases with Machine Learning Corrections Derived from Ocean Data Assimilation Increments - ESS Open Archive - October 19th, 2025 [October 19th, 2025]
- Unlocking Obesity: Multi-Omics and Machine Learning Insights - Bioengineer.org - October 19th, 2025 [October 19th, 2025]
- Lockheed Martin advances PAC-3 MSE interceptor using artificial intelligence and machine learning - Defence Industry Europe - October 19th, 2025 [October 19th, 2025]
- Semi-automated surveillance of surgical site infections using machine learning and rule-based classification models - Nature - October 19th, 2025 [October 19th, 2025]
- AI and Machine Learning - City of San Jos to release RFP for generative AI platform - Smart Cities World - October 19th, 2025 [October 19th, 2025]
- Machine learning helps identify 'thermal switch' for next-generation nanomaterials - Phys.org - October 17th, 2025 [October 17th, 2025]