When it comes to AI, can we ditch the datasets? – MIT News
Huge amounts of data are needed to train machine-learning models to perform image classification tasks, such as identifying damage in satellite photos following a natural disaster. However, these data are not always easy to come by. Datasets may cost millions of dollars to generate, if usable data exist in the first place, and even the best datasets often contain biases that negatively impact a models performance.
To circumvent some of the problems presented by datasets, MIT researchers developed a method for training a machine learning model that, rather than using a dataset, uses a special type of machine-learning model to generate extremely realistic synthetic data that can train another model for downstream vision tasks.
Their results show that a contrastive representation learning model trained using only these synthetic data is able to learn visual representations that rival or even outperform those learned from real data.
This special machine-learning model, known as a generative model, requires far less memory to store or share than a dataset. Using synthetic data also has the potential to sidestep some concerns around privacy and usage rights that limit how some real data can be distributed. A generative model could also be edited to remove certain attributes, like race or gender, which could address some biases that exist in traditional datasets.
We knew that this method should eventually work; we just needed to wait for these generative models to get better and better. But we were especially pleased when we showed that this method sometimes does even better than the real thing, says Ali Jahanian, a research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead author of the paper.
Jahanian wrote the paper with CSAIL grad students Xavier Puig and Yonglong Tian, and senior author Phillip Isola, an assistant professor in the Department of Electrical Engineering and Computer Science. The research will be presented at the International Conference on Learning Representations.
Generating synthetic data
Once a generative model has been trained on real data, it can generate synthetic data that are so realistic they are nearly indistinguishable from the real thing. The training process involves showing the generative model millions of images that contain objects in a particular class (like cars or cats), and then it learns what a car or cat looks like so it can generate similar objects.
Essentially by flipping a switch, researchers can use a pretrained generative model to output a steady stream of unique, realistic images that are based on those in the models training dataset, Jahanian says.
But generative models are even more useful because they learn how to transform the underlying data on which they are trained, he says. If the model is trained on images of cars, it can imagine how a car would look in different situations situations it did not see during training and then output images that show the car in unique poses, colors, or sizes.
Having multiple views of the same image is important for a technique called contrastive learning, where a machine-learning model is shown many unlabeled images to learn which pairs are similar or different.
The researchers connected a pretrained generative model to a contrastive learning model in a way that allowed the two models to work together automatically. The contrastive learner could tell the generative model to produce different views of an object, and then learn to identify that object from multiple angles, Jahanian explains.
This was like connecting two building blocks. Because the generative model can give us different views of the same thing, it can help the contrastive method to learn better representations, he says.
Even better than the real thing
The researchers compared their method to several other image classification models that were trained using real data and found that their method performed as well, and sometimes better, than the other models.
One advantage of using a generative model is that it can, in theory, create an infinite number of samples. So, the researchers also studied how the number of samples influenced the models performance. They found that, in some instances, generating larger numbers of unique samples led to additional improvements.
The cool thing about these generative models is that someone else trained them for you. You can find them in online repositories, so everyone can use them. And you dont need to intervene in the model to get good representations, Jahanian says.
But he cautions that there are some limitations to using generative models. In some cases, these models can reveal source data, which can pose privacy risks, and they could amplify biases in the datasets they are trained on if they arent properly audited.
He and his collaborators plan to address those limitations in future work. Another area they want to explore is using this technique to generate corner cases that could improve machine learning models. Corner cases often cant be learned from real data. For instance, if researchers are training a computer vision model for a self-driving car, real data wouldnt contain examples of a dog and his owner running down a highway, so the model would never learn what to do in this situation. Generating that corner case data synthetically could improve the performance of machine learning models in some high-stakes situations.
The researchers also want to continue improving generative models so they can compose images that are even more sophisticated, he says.
This research was supported, in part, by the MIT-IBM Watson AI Lab, the United States Air Force Research Laboratory, and the United States Air Force Artificial Intelligence Accelerator.
Here is the original post:
When it comes to AI, can we ditch the datasets? - MIT News
- Portable Electronic Nose with Machine Learning Enhances VOC Detection in Forensic Science - Chromatography Online - September 15th, 2025 [September 15th, 2025]
- Developing a predictive model for breast cancer detection using radiomics-based mammography and machine learning - SpringerOpen - September 13th, 2025 [September 13th, 2025]
- and correlation of drug solubility via hybrid machine learning and gradient based optimization - Nature - September 11th, 2025 [September 11th, 2025]
- Rice-Houston Methodist partnership uses machine learning to reveal hidden patient groups in common heart valve disease - Rice University - September 11th, 2025 [September 11th, 2025]
- Amazon Uses Machine Learning to Tell Sellers if FBA Is a Good Fit - EcommerceBytes - September 11th, 2025 [September 11th, 2025]
- Eli Lilly Launches AI, Machine Learning Platform Called TuneLab For Biotech Companies - Stocktwits - September 11th, 2025 [September 11th, 2025]
- How AI and Machine Learning are Shaping the Future of Mobile Apps - indiatechnologynews.in - September 11th, 2025 [September 11th, 2025]
- Hybrid AI and semiconductor approaches for power quality improvement - Machine Learning Week 2025 - September 9th, 2025 [September 9th, 2025]
- The Predictive Turn | Preparing to Outthink Adversaries Through Predictive Analytics - Machine Learning Week 2025 - September 9th, 2025 [September 9th, 2025]
- NFL player props, odds and bets: Week 1, 2025 NFL picks, SportsLine Machine Learning Model AI predictions, SGP - CBS Sports - September 9th, 2025 [September 9th, 2025]
- Can machine learning forecast Lobo EV Technologies Ltd. recovery - Bear Alert & Daily Price Action Insights - Newser - September 6th, 2025 [September 6th, 2025]
- Generalised Machine Learning Models Outperform Personalised Models For Cognitive Load Classification In Real-Life Settings - Frontiers - September 6th, 2025 [September 6th, 2025]
- Machine learning for the prediction of blood transfusion risk during or after mitral valve surgery: a multicenter retrospective cohort study - Nature - September 6th, 2025 [September 6th, 2025]
- Machine Learning-Driven Exploration of Composition- and Temperature-Dependent Transport and Thermodynamic Properties in LiF-NaF-KF Molten Salts for... - September 6th, 2025 [September 6th, 2025]
- Machine learning analysis reveals tumor heterogeneity and stromal-immune niches in breast cancer - Nature - September 6th, 2025 [September 6th, 2025]
- Identification of Postoperative Weight Loss Trajectories and Development of a Machine Learning-Based Tool for Predicting Malnutrition in Gastric... - September 6th, 2025 [September 6th, 2025]
- The Relationship Between Number of Pregnancies and Serum 25-Hydroxyvitamin D Levels in Women with a Prior Pregnancy: A Cross - Sectional Analysis,... - September 6th, 2025 [September 6th, 2025]
- Tohoku University Researchers Use Machine Learning to Identify Factors Improving Nickel-Based Catalysts for CO Methanation - geneonline.com - September 6th, 2025 [September 6th, 2025]
- Combining machine learning predictions for Galaxy Payroll Group Limited - Quarterly Growth Report & AI Forecast Swing Trade Picks - Newser - September 5th, 2025 [September 5th, 2025]
- Can machine learning forecast CLSKW recovery - 2025 Breakouts & Breakdowns & Daily Profit Maximizing Trade Tips - Newser - September 5th, 2025 [September 5th, 2025]
- Can machine learning forecast Granite Real Estate Investment Trust recovery - July 2025 Spike Watch & Growth Focused Stock Reports - Newser - September 5th, 2025 [September 5th, 2025]
- Can machine learning forecast VERU recovery - July 2025 Intraday Action & AI Forecasted Entry/Exit Points - Newser - September 5th, 2025 [September 5th, 2025]
- Can machine learning forecast VCI Global Limited recovery - Market Rally & Expert-Curated Trade Recommendations - Newser - September 5th, 2025 [September 5th, 2025]
- Combining machine learning predictions for AutoNation Inc. - Weekly Trend Summary & Weekly Breakout Watchlists - Newser - September 5th, 2025 [September 5th, 2025]
- Combining machine learning predictions for PLXS - Options Play & Fast Gain Stock Trading Tips - Newser - September 5th, 2025 [September 5th, 2025]
- Can machine learning forecast Valens Semiconductor Ltd. recovery - July 2025 Action & Free Growth Oriented Trading Recommendations - Newser - September 5th, 2025 [September 5th, 2025]
- Improve cost visibility of Machine Learning workloads on Amazon EKS with AWS Split Cost Allocation Data - Amazon Web Services - September 5th, 2025 [September 5th, 2025]
- Can machine learning forecast LFT.PRA recovery - Weekly Trade Recap & Daily Profit Maximizing Trade Tips - Newser - September 5th, 2025 [September 5th, 2025]
- Can machine learning forecast TEAM recovery - 2025 Pullback Review & Free Weekly Chart Analysis and Trade Guides - Newser - September 5th, 2025 [September 5th, 2025]
- Combining machine learning predictions for MSBIP - Weekly Profit Analysis & AI Powered Market Entry Strategies - Newser - September 5th, 2025 [September 5th, 2025]
- Revolutionizing Antibody Discovery with Machine Learning - BIOENGINEER.ORG - September 5th, 2025 [September 5th, 2025]
- The good and bad of machine learning | Letters - The Guardian - September 3rd, 2025 [September 3rd, 2025]
- I'm a machine learning engineer at Amazon who anticipated the ML boom. Here's my advice for staying ahead. - AOL.com - September 3rd, 2025 [September 3rd, 2025]
- Combining machine learning predictions for Dogwood Therapeutics Inc. - July 2025 Breakouts & Weekly Setup with High ROI Potential - Newser - September 3rd, 2025 [September 3rd, 2025]
- Phenotyping valvular heart diseases using the lens of unsupervised machine learning: a scoping review - Nature - September 3rd, 2025 [September 3rd, 2025]
- Students use machine learning to track and protect whale populations - Technology Org - September 3rd, 2025 [September 3rd, 2025]
- Combining machine learning predictions for Triller Group Inc. Equity Warrant - Gap Up & Weekly High Conviction Ideas - Newser - September 3rd, 2025 [September 3rd, 2025]
- Combining machine learning predictions for DallasNews Corporation - Quarterly Trade Report & Technical Entry and Exit Tips - Newser - September 3rd, 2025 [September 3rd, 2025]
- Combining machine learning predictions for System1 Inc. - Weekly Gains Summary & Risk Adjusted Swing Trade Ideas - Newser - September 3rd, 2025 [September 3rd, 2025]
- Unlocking the impossible without compromising on creative control: iZotope Ozone 12 adds new machine learning modules and a more musician-friendly AI... - September 3rd, 2025 [September 3rd, 2025]
- What machine learning models say about SLND.WS - Quarterly Trade Report & Technical Entry and Exit Tips - Newser - September 3rd, 2025 [September 3rd, 2025]
- Combining machine learning predictions for Chemed Corporation - Weekly Stock Recap & Growth Focused Entry Reports - Newser - September 3rd, 2025 [September 3rd, 2025]
- Combining machine learning predictions for TAP.A - Earnings Growth Report & Entry Point Confirmation Alerts - Newser - September 3rd, 2025 [September 3rd, 2025]
- Bridging known and unknown dynamics by transformer-based machine-learning inference from sparse observations - Nature - September 3rd, 2025 [September 3rd, 2025]
- Combining machine learning predictions for Inseego Corp. - July 2025 Retail & Technical Confirmation Trade Alerts - Newser - September 3rd, 2025 [September 3rd, 2025]
- Can machine learning forecast Aditxt Inc. recovery - July 2025 Update & Expert Curated Trade Ideas - Newser - September 3rd, 2025 [September 3rd, 2025]
- I'm a machine learning engineer at Amazon who anticipated the ML boom. Here's my advice for staying ahead. - Business Insider - September 1st, 2025 [September 1st, 2025]
- Machine learning climbs the Jacobs Ladder of optoelectronic properties - Nature - September 1st, 2025 [September 1st, 2025]
- Predicting factors associated with anxiety by patients undergoing treatment for infectious diseases using a random-forest machine learning approach -... - September 1st, 2025 [September 1st, 2025]
- Hideo Kojima used "an AI machine learning rig" to painstakingly download his celebrity friends to Death Stranding 2, but he wasn't happy... - September 1st, 2025 [September 1st, 2025]
- Fibro predict a machine learning risk score for advanced liver fibrosis in the general population using Israeli electronic health records - Nature - September 1st, 2025 [September 1st, 2025]
- Machine learning for preventing stillbirths: is it possible to transform data into life-saving insights? - BMC Pregnancy and Childbirth - September 1st, 2025 [September 1st, 2025]
- Can machine learning forecast Kura Sushi USA Inc. recovery - 2025 Fundamental Recap & AI Based Buy and Sell Signals - Newser - September 1st, 2025 [September 1st, 2025]
- Combining machine learning predictions for China Liberal Education Holdings Limited - Weekly Profit Recap & Weekly Breakout Watchlists - Newser - September 1st, 2025 [September 1st, 2025]
- Can machine learning forecast Tyson Foods Inc. recovery - 2025 Trade Ideas & Smart Swing Trading Techniques - Newser - September 1st, 2025 [September 1st, 2025]
- Can machine learning forecast GLBZ recovery - July 2025 Movers & AI Based Buy and Sell Signals - Newser - September 1st, 2025 [September 1st, 2025]
- What machine learning models say about Sypris Solutions Inc. - Market Performance Recap & Real-Time Volume Trigger Notifications - Newser - September 1st, 2025 [September 1st, 2025]
- What machine learning models say about Astria Therapeutics Inc. - July 2025 News Drivers & Real-Time Buy Signal Alerts - Newser - September 1st, 2025 [September 1st, 2025]
- Can machine learning forecast CRTO recovery - July 2025 Analyst Calls & Growth Focused Investment Plans - Newser - September 1st, 2025 [September 1st, 2025]
- Can machine learning forecast Exelon Corporation recovery - Exit Point & Pattern Based Trade Signal System - Newser - September 1st, 2025 [September 1st, 2025]
- What machine learning models say about OFIX - Bond Market & Long-Term Safe Investment Plans - Newser - September 1st, 2025 [September 1st, 2025]
- Can machine learning forecast Beneficient recovery - Weekly Trade Recap & Breakout Confirmation Alerts - Newser - September 1st, 2025 [September 1st, 2025]
- Can machine learning forecast BTBDW recovery - 2025 Geopolitical Influence & Weekly High Momentum Picks - Newser - September 1st, 2025 [September 1st, 2025]
- Can machine learning forecast Tri Pointe Homes Inc. recovery - July 2025 WrapUp & Free Long-Term Investment Growth Plans - Newser - September 1st, 2025 [September 1st, 2025]
- Can machine learning forecast TeraWulf Inc. recovery - Market Movement Recap & Community Supported Trade Ideas - Newser - September 1st, 2025 [September 1st, 2025]
- Combining machine learning predictions for Alset Inc. - 2025 Technical Patterns & Precise Buy Zone Identification - Newser - September 1st, 2025 [September 1st, 2025]
- Can machine learning forecast Exelon Corporation recovery - 2025 Bull vs Bear & Smart Allocation Stock Reports - Newser - September 1st, 2025 [September 1st, 2025]
- Can machine learning forecast Token Cat Limited Depositary Receipt recovery - 2025 Price Action Summary & Breakout Confirmation Alerts - Newser - September 1st, 2025 [September 1st, 2025]
- Combining machine learning predictions for BT Brands Inc. - Market Performance Recap & Verified Technical Trade Signals - Newser - September 1st, 2025 [September 1st, 2025]
- 7 Beginner Machine Learning Projects To Complete This Weekend - KDnuggets - August 29th, 2025 [August 29th, 2025]
- Machine learning approaches for predicting the construction time of drill-and-blast tunnels - Nature - August 29th, 2025 [August 29th, 2025]
- Combining machine learning predictions for KKR.PRD - July 2025 Closing Moves & Technical Pattern Recognition Alerts - Newser - August 29th, 2025 [August 29th, 2025]
- Leveraging data analytics to revolutionize cybersecurity with machine learning and deep learning - Nature - August 29th, 2025 [August 29th, 2025]
- Can machine learning forecast Yext Inc. recovery - Earnings Performance Report & Accurate Buy Signal Notifications - Newser - August 29th, 2025 [August 29th, 2025]
- Combining machine learning predictions for Mercer International Inc. - July 2025 Highlights & Real-Time Volume Analysis - Newser - August 29th, 2025 [August 29th, 2025]
- Combining machine learning predictions for Kandal M Venture Limited - Inflation Watch & Verified Technical Signals - Newser - August 29th, 2025 [August 29th, 2025]
- Combining machine learning predictions for Asbury Automotive Group Inc. - July 2025 Intraday Action & Daily Volume Surge Signals - Newser - August 29th, 2025 [August 29th, 2025]
- Can machine learning forecast NINE recovery - Quarterly Performance Summary & Technical Entry and Exit Tips - Newser - August 29th, 2025 [August 29th, 2025]
- IQUP identifies quantitatively unreliable spectra with machine learning for isobaric labeling-based proteomics - Nature - August 29th, 2025 [August 29th, 2025]
- Can machine learning forecast HealthEquity Inc. recovery - Exit Point & High Accuracy Buy Signal Tips - Newser - August 29th, 2025 [August 29th, 2025]