Archive for the ‘Machine Learning’ Category

OVH Groupe : A journey into the wondrous land of Machine Learning, or Cleaning data is funnier than cleaning my flat! (Part 3) – Marketscreener.com

What am I doing here? The story so far

As you might know if you have read our blog for more than a year, a few years ago, I bought a flat in Paris. If you don't know, the real estate market in Paris is expensive but despite that, it is so tight that a good flat at a correct price can be for sale for less than a day.

Obviously, you have to take a decision quite fast, and considering the prices, you have to trust your decision. Of course, to trust your decision, you have to take your time, study the market, make some visits etc This process can be quite long (in my case it took a year between the time I decided that I wanted to buy a flat and the time I actually commited to buying my current flat), and even spending a lot of time will never allow you to have a perfect understanding of the market. What if there was a way to do that very quickly and with a better accuracy than with the standard process?

As you might also know if you are one of our regular readers, I tried to solve this problem with Machine Learning, using an end-to-end software called Dataiku. In a first blog post, we learned how to make a basic use of Dataiku, and discovered that just knowing how to click on a few buttons wasn't quite enough: you had to bring some sense in your data and in the training algorithm, or you would find absurd results.

In a second entry, we studied a bit more the data, tweaked a few parameters and values in Dataiku's algorithms and trained a new model. This yielded a much better result, and this new model was - if not accurate - at least relevant: the same flat had a higher predicted place when it was bigger or supposedly in a better neighbourhood. However, it was far from perfect and really lacked accuracy for several reasons, some of them out of our control.

However, all of this was done on one instance of Dataiku - a licensed software - on a single VM. There are multiple reasons that could push me to do things differently:

What we did very intuitively (and somewhat naively) with Dataiku was actually a quite complex pipeline that is often called ELT, for Extract, Load and Transform.

And obviously, after this ELT process, we added a step to train a model on the transformed data.

So what are we going to do to redo all of that without Dataiku's help?

When ELT becomes ELTT

Now that we know what we are going to do, let us proceed!

Before beginning, we have to properly set up our environment to be able to launch the different tools and products. Throughout this tutorial, we will show you how to do everything with CLIs. However, all these manipulations can also be done on OVHcloud's manager (GUI), in which case you won't have to configure these tools.

For all the manipulations described in the next phase of this article, we will use a Virtual Machine deployed in OVHcloud's Public Cloud that will serve as the extraction agent to download the raw data from the web and push it to S3 as well as a CLI machine to launch data processing and notebook jobs. It is a d2-4 flavor with 4GB of RAM, 2 vCores and 50 GB of local storage running Debian 10, deployed in Graveline's datacenter. During this tutorial, I run a few UNIX commands but you should easily be able to adapt them to whatever OS you use if needed. All the CLI tools specific to OVHcloud's products are available on multiple OSs.

You will also need an OVHcloud NIC (user account) as well as a Public Cloud Project created for this account with a quota high enough to deploy a GPU (if that is not the case, you will be able to deploy a notebook on CPU rather than GPU, the training phase will juste take more time). To create a Public Cloud project, you can follow these steps.

Here is a list of the CLI tools and other that we will use during this tutorial and why:

Additionally you will find commented code samples for the processing and training steps in this Github repository.

In this tutorial, we will use several object storage buckets. Since we will use the S3 API, we will call them S3 bucket, but as mentioned above, if you use OVHcloud standard Public Cloud Storage, you could also use the Swift API. However, you are restricted to only the S3 API if you use our new high-performance object storage offer, currently in Beta.

For this tutorial, we are going to create and use the following S3 buckets:

To create these buckets, use the following commands after having configured your aws CLI as explained above:

Now that you have your environment set up and your S3 buckets ready, we can begin the tutorial!

First, let us download the data files directly on Etalab's website and unzip them:

You should now have the following files in your directory, each one corresponding to the French real estate transaction of a specific year:

Now, use the S3 CLI to push these files in the relevant S3 bucket:

You should now have those 5 files in your S3 bucket:

What we just did with a small VM was ingesting data into a S3 bucket. In real-life usecases with more data, we would probably use dedicated tools to ingest the data. However, in our example with just a few GB of data coming from a public website, this does the trick.

Now that you have your raw data in place to be processed, you just have to upload the code necessary to run your data processing job. Our data processing product allows you to run Spark code written either in Java, Scala or Python. In our case, we used Pyspark on Python. Your code should consist in 3 files:

Once you have your code files, go to the folder containing them and push them on the appropriate S3 bucket:

Your bucket should now look like that:

You are now ready to launch your data processing job. The following command will allow you to launch this job on 10 executors, each with 4 vCores and 15 GB of RAM.

Note that the data processing product uses the Swift API to retrieve the code files. This is totally transparent to the user, and the fact that we used the S3 CLI to create the bucket has absolutely no impact. When the job is over, you should see the following in your transactions-ecoex-clean bucket:

Before going further, let us look at the size of the data before and after cleaning:

As you can see, with ~2.5 GB of raw data, we extracted only ~10 MB of actually useful data (only 0,4%)!! What is noteworthy here is that that you can easily imagine usecases where you need a large-scale infrastructure to ingest and process the raw data but where one or a few VMs are enough to work on the clean data. Obviously, this is more often the case when working with text/structured data than with raw sound/image/videos.

Before we start training a model, take a look at these two screenshots from OVHcloud's data processing UI to erase any doubt you have about the power of distributed computing:

In the first picture, you see the time taken for this job when launching only 1 executor- 8:35 minutes. This duration is reduced to only 2:56 minutes when launching the same job (same code etc) on 4 executors: almost 3 times faster. And since you pay-as-you go, this will only cost you ~33% more in that case for the same operation done 3 times faster- without any modification to your code, only one argument in the CLI call. Let us now use this data to train a model.

To train the model, you are going to use OVHcloud AI notebook to deploy a notebook! With the following command, you will:

In our case, we launch a notebook with only 1 GPU because the code samples we provide would not leverage several GPUs for a single job. I could adapt my code to parallelize the training phase on multiple GPUs, in which case I could launch a job with up to 4 parallel GPUs.Once this is done, just get the URL of your notebook with the following command and connect to it with your browser:

Once you're done, just get the URL of your notebook with the following command and connect to it with your browser:

You can now import the real-estate-training.ipynb file to the notebook with just a few clicks. If you don't want to import it from the computer you use to access the notebook (for example if like me you use a VM to work and have cloned the git repo on this VM and not on your computer), you can push the .ipynb file to your transactions-ecoex-clean or transactions-ecoex-model bucket and re-synchronize the bucket to your notebook while it runs by using the ovhai notebook pull-data command. You will then find the notebook file in the corresponding directory.

Once you have imported the notebook file to your notebook instance, just open it and follow the directives. If you are interested in the result but don't want to do it yourself, let's sum up what the notebook does:

Use the models built in this tutorial at your own risk

So, what can we conclude from all of this? First, even if the second model is obviously better than the first, it is still very noisy: while not far from correct on average, there is still a huge variance. Where does this variance come from?

Well, it is not easy to say. To paraphrase the finishing part of my last article:

In this article, I tried to give you a glimpse at the tools that Data Scientists commonly use to manipulate data and train models at scale, in the Cloud or on their own infrastructure:

Hopefuly, you now have a better understanding on how Machine Learning algorithms work, what their limitations are, and how Data Scientists work on data to create models.

As explained earlier, all the code used to obtain these results can be found here. Please don't hesitate to replicate what I did or adapt it to other usecases!

Solutions ArchitectatOVHCloud|+ posts

Here is the original post:
OVH Groupe : A journey into the wondrous land of Machine Learning, or Cleaning data is funnier than cleaning my flat! (Part 3) - Marketscreener.com

When It Comes to AI, Can We Ditch the Datasets? Using Synthetic Data for Training Machine-Learning Models – SciTechDaily

A machine-learning model for image classification thats trained using synthetic data can rival one trained on the real thing, a study shows.

Huge amounts of data are needed to train machine-learning models to perform image classification tasks, such as identifying damage in satellite photos following a natural disaster. However, these data are not always easy to come by. Datasets may cost millions of dollars to generate, if usable data exist in the first place, and even the best datasets often contain biases that negatively impact a models performance.

To circumvent some of the problems presented by datasets, MIT researchers developed a method for training a machine learning model that, rather than using a dataset, uses a special type of machine-learning model to generate extremely realistic synthetic data that can train another model for downstream vision tasks.

Their results show that a contrastive representation learning model trained using only these synthetic data is able to learn visual representations that rival or even outperform those learned from real data.

MIT researchers have demonstrated the use of a generative machine-learning model to create synthetic data, based on real data, that can be used to train another model for image classification. This image shows examples of the generative models transformation methods. Credit: Courtesy of the researchers

This special machine-learning model, known as a generative model, requires far less memory to store or share than a dataset. Using synthetic data also has the potential to sidestep some concerns around privacy and usage rights that limit how some real data can be distributed. A generative model could also be edited to remove certain attributes, like race or gender, which could address some biases that exist in traditional datasets.

We knew that this method should eventually work; we just needed to wait for these generative models to get better and better. But we were especially pleased when we showed that this method sometimes does even better than the real thing, says Ali Jahanian, a research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead author of the paper.

Jahanian wrote the paper with CSAIL grad students Xavier Puig and Yonglong Tian, and senior author Phillip Isola, an assistant professor in the Department of Electrical Engineering and Computer Science. The research will be presented at the International Conference on Learning Representations.

Once a generative model has been trained on real data, it can generate synthetic data that are so realistic they are nearly indistinguishable from the real thing. The training process involves showing the generative model millions of images that contain objects in a particular class (like cars or cats), and then it learns what a car or cat looks like so it can generate similar objects.

Essentially by flipping a switch, researchers can use a pretrained generative model to output a steady stream of unique, realistic images that are based on those in the models training dataset, Jahanian says.

But generative models are even more useful because they learn how to transform the underlying data on which they are trained, he says. If the model is trained on images of cars, it can imagine how a car would look in different situations situations it did not see during training and then output images that show the car in unique poses, colors, or sizes.

Having multiple views of the same image is important for a technique called contrastive learning, where a machine-learning model is shown many unlabeled images to learn which pairs are similar or different.

The researchers connected a pretrained generative model to a contrastive learning model in a way that allowed the two models to work together automatically. The contrastive learner could tell the generative model to produce different views of an object, and then learn to identify that object from multiple angles, Jahanian explains.

This was like connecting two building blocks. Because the generative model can give us different views of the same thing, it can help the contrastive method to learn better representations, he says.

The researchers compared their method to several other image classification models that were trained using real data and found that their method performed as well, and sometimes better, than the other models.

One advantage of using a generative model is that it can, in theory, create an infinite number of samples. So, the researchers also studied how the number of samples influenced the models performance. They found that, in some instances, generating larger numbers of unique samples led to additional improvements.

The cool thing about these generative models is that someone else trained them for you. You can find them in online repositories, so everyone can use them. And you dont need to intervene in the model to get good representations, Jahanian says.

But he cautions that there are some limitations to using generative models. In some cases, these models can reveal source data, which can pose privacy risks, and they could amplify biases in the datasets they are trained on if they arent properly audited.

He and his collaborators plan to address those limitations in future work. Another area they want to explore is using this technique to generate corner cases that could improve machine learning models. Corner cases often cant be learned from real data. For instance, if researchers are training a computer vision model for a self-driving car, real data wouldnt contain examples of a dog and his owner running down a highway, so the model would never learn what to do in this situation. Generating that corner case data synthetically could improve the performance of machine learning models in some high-stakes situations.

The researchers also want to continue improving generative models so they can compose images that are even more sophisticated, he says.

Reference: Generative Models as a Data Source for Multiview Representation Learning by Ali Jahanian, Xavier Puig, Yonglong Tian and Phillip Isola.PDF

This research was supported, in part, by the MIT-IBM Watson AI Lab, the United States Air Force Research Laboratory, and the United States Air Force Artificial Intelligence Accelerator.

Original post:
When It Comes to AI, Can We Ditch the Datasets? Using Synthetic Data for Training Machine-Learning Models - SciTechDaily

NORCAT partners with Vector Institute on AI training program – MINING.COM – MINING.com

Vectors mission to develop and sustain responsible AI-based innovation to help foster the economic growth and improve the lives of Canadians is aligned with NORCATs goal as a regional innovation centre to accelerate the growth of innovative companies that will drive future economic and social prosperity for Canada, said NORCAT CEO Don Duval in a press release.

We are proud to collaborate with the Vector Institute to create AI-based innovation, growth and productivity in Canada by focusing on the transformative potential of machine and deep learning, he said. Together, we will work to advance AI research and drive its application, adoption and commercialization in the global mining industry.

This partnership will allow NORCAT to offer its portfolio of mining technology clients access to Vectors FastLane program. Launched in 2021, the program is tailored to the needs of Canadas growth-oriented small-and-medium sized enterprises (SMEs), delivering leading-edge AI knowledge transfer that allows this unique community to capitalize on the transformative power of artificial intelligence.

In addition to its talent recruitment and workforce development initiatives, Vector works with its industry community through the FastLane program to deliver training and knowledge transfer that improves products and processes, including an expanded suite of programs, training courses and collaborative projects that will enable participants to raise their AI fluency, develop a deeper understanding of AIs business value, experiment with applying AI models to their real-world challenges and acquire the skills to compete and innovate using AI.

AI applies to every sector of our economy and represents a once-in-a-generation opportunity to improve the lives of Canadians, said Garth Gibson, president and CEO, Vector Institute. Through the FastLane program, Vectors partnership with NORCAT will help the Canadian mining industry do just that by driving innovation, upskilling workers and recruiting world-class talent.

For more information is here.

Read the rest here:
NORCAT partners with Vector Institute on AI training program - MINING.COM - MINING.com

Johns Hopkins and Amazon collaborate to explore transformative power of AI – The Hub at Johns Hopkins

ByLisa Ercolano

Johns Hopkins University and Amazon are teaming up to harness the power of artificial intelligence to transform the way humans interact online and with the world. The new JHU + Amazon Initiative for Interactive AI, housed in the Johns Hopkins Whiting School of Engineering, will leverage the university's world-class expertise in interactive AI to advance groundbreaking technologies in machine learning, computer vision, natural language understanding, and speech processing; democratize access to the benefits of AI innovations; and broaden participation in research from diverse, interdisciplinary scholars and other innovators.

Amazon's investment will span five years, comprising doctoral fellowships, sponsored research funding, gift funding, and community projects. Sanjeev Khudanpur, an associate professor of electrical and computer engineering at the Whiting School, will serve as the initiative's founding director. Khudanpur is an expert in the application of information-theoretic methods to human language technologies such as automatic speech recognition, machine translation, and natural language processing.

"Hopkins is already renowned for its pioneering work in these areas of AI, and working with Amazon researchers will accelerate the timetable for the next big strides," Khudanpur said. "I often compare humans and AI to Luke Skywalker and R2D2 in Star Wars: They're able to accomplish amazing feats in a tiny X-wing fighter because they interact effectively to align their complementary strengths. I am very excited at the prospect of the Hopkins AI community coming together under the auspices of this initiative, and charting the future of transformational, interactive AI together with Amazon researchers,"

Ed Schlesinger, dean of the Whiting School, said, "We are very excited to work with Amazon in this new initiative. We value the challenges that they bring us and the life-changing potential of the solutions we will create together, and look forward to strengthening our work together over the coming years."

Amazon's funding will support a broad range of activities, including annual fellowships for doctoral students; research projects led by Hopkins Engineering faculty in collaboration with postdoctoral researchers, undergraduate and graduate students, and research staff; and events and activities, such as lectures, workshops, and competitions aimed at making AI activities more accessible to the general public in the Baltimore-Washington region.

Prem Natarajan, Alexa AI vice president of natural understanding, says the partnership underscores Amazon's commitment to addressing the greatest challenges in Al, democratizing access to the benefits of Al innovations, and broadening participation in research from diverse, interdisciplinary scholars and other innovators.

"This initiative brings together the top talent at Amazon and Johns Hopkins in a joint mission to drive groundbreaking advances in interactive and multimodal AI," Natarajan said. "These advances will power the next generation of interactive AI experiences across a wide variety of domainsfrom home productivity to entertainment to health."

The two organizations have teamed up in the past, with four Johns Hopkins faculty members joining Amazon as part of its Scholars program: Ozge Sahin, a professor of operations management and business analytics at the Johns Hopkins Carey Business School, in 2019, and in 2020, Gregory Hager, Mandell Bellmore Professor of Computer Science; Ren Vidal, Herschel Seder Professor of Biomedical Engineering and director of the Mathematical Institute for Data Science; and Marin Kobilarov, associate professor of mechanical engineering.

The new initiative will build on Hopkins Engineering's existing strengths in the areas of machine learning, computer vision, natural language understanding, and speech processing. Its Mathematical Institute for Data Science conducts cutting-edge research on the mathematical, statistical, and computational foundations of machine learning and computer vision. The Center for Imaging Science and the Laboratory for Computational Sensing and Robotics conduct fundamental and applied research in nearly every area of basic and applied computer vision. The university's Center for Language and Speech Processing, one of the largest and most influential academic research centers of its kind in the world, conducts research in acoustic processing, automatic speech recognition, cognitive modeling, computational linguistics, information extraction, machine translation, and text analysis. CLSP researchers conducted some of the foundational research that led to the development of digital voice assistants.

"AI has tremendous potential to enhance human abilities, and to reach it, AI of the future will interact with humans the same way we naturally interact with each other. What endeared Amazon Alexa to users was the effortlessness of the interaction. I envision that the research done under this initiative will make it possible for us to use much more powerful AI in equally effortless ways, regardless of our own physical limitations," Khudanpur said.

Hager, a director for Amazon Physical Retail, and Vidal, currently an Amazon Scholar in visual search and AR, were instrumental in helping Amazon and JHU establish the collaboration.

"Computer vision and machine learning are transforming the way in which humans shop, share content, and interact with each other," Vidal said. "This partnership will lead to new collaborations between JHU and Amazon scientists that will help translate cutting-edge advances in deep learning and visual recognition into algorithms that help humans interact with the world."

Seth Zonies, a director of business development for Johns Hopkins Technology Ventures, the university's commercialization and industry collaboration arm, said, "This collaboration represents the opportunity to harness academic ingenuity to address needs in society through industry collaboration. The engineering faculty at Johns Hopkins are committed to applied research, and Amazon is at the forefront of product development in this field. We expect this collaboration to result in deployable, high-impact innovation."

Read more:
Johns Hopkins and Amazon collaborate to explore transformative power of AI - The Hub at Johns Hopkins

Research Analyst / Associate / Fellow in Machine Learning and Artificial Intelligence job with NATIONAL UNIVERSITY OF SINGAPORE | 289568 – Times…

The Role

The Sustainable and Green Finance Institute (SGFIN) is a new university-level research institute in the National University of Singapore (NUS), jointly supported by the Monetary Authority of Singapore (MAS) and NUS. SGFIN aspires to develop deep research capabilities in sustainable and green finance, provide thought leadership in the sustainability space, and shape sustainability outcomes across the financial sector and the economy at large.

This role is ideally suited for those wishing to work in academic or industry research in quantitative analysis, particularly in the area of machine learning and artificial intelligence. The responsibilities of the role will include designing and developing various analytical frameworks to analyze structure, unstructured and non-traditional data related to corporate financial, environmental, and social indicators.

There are no teaching obligations for this position, and the candidate will have the opportunity to develop their research portfolio.

Duties and Responsibilities

The successful candidate will be expected to assume the following responsibilities:

Qualifications

Covid-19 Message

At NUS, the health and safety of our staff and students are one of our utmost priorities, and COVID-vaccination supports our commitment to ensure the safety of our community and to make NUS as safe and welcoming as possible. Many of our roles require a significant amount of physical interactions with students/staff/public members. Even for job roles that may be performed remotely, there will be instances where on-campus presences are required.

In accordance with Singapore's legal requirements, unvaccinated workers will not be able to work on the NUS premises with effect from 15 January 2022. As such, job applicants will need to be fully COVID-19 vaccinated to secure successful employment with NUS.

Read the original here:
Research Analyst / Associate / Fellow in Machine Learning and Artificial Intelligence job with NATIONAL UNIVERSITY OF SINGAPORE | 289568 - Times...