How to do data science in Azure

How to do data science in Azure

Azure services to bring your machine learning project to the cloud

·

6 min read

It is becoming increasingly difficult to do data science in your own computer nowadays. Sure, normal computers have no problem to handle data exploration, analysis and visualization. But when it comes to model training, unless you own a GPU or you are using a classical machine learning model and not a neural network, your computer will likely struggle to train a model. Both regarding RAM and length of computation.

Some people purchase more RAM, a stronger processor or a GPU. Other people use already pre-trained models from Huggingface for example, which is a very user-friendly way to interact with large models. Some people even buy a GPU and train large models and publish them on Huggingface for fun. But most people don’t have these resources, so they turn to cloud solutions.

In this post I will explain how to do different data science projects with the help of Azure specifically. I might do another post with AWS in the future as well, although I have less experience with AWS Sagemaker.

Using generic cloud resources

When you use a cloud service, you can simply use a virtual machine and pay for only the duration which you use. This is true for all cloud providers, AWS, Azure and GCP (Google). You can specify the amount of compute power and RAM you want, and the cost of course varies accordingly. A virtual machine is of type IaaS, infrastructure as a service. The cloud provider takes care of the underlying infrastructure: the hardware, networking, cooling, ventilation, security patches. You just take the server and use it for your project, without worrying about these things.

This service offers a lot of freedom and flexibility, you can install whichever OS you want, different versions of software, but it also means you have to know how to install libraries, frameworks and so on. Imagine you installed PyTorch and other Python libraries, and when you try to train your model you get an imcompatibility error between PyTorch and another library. No one from Azure will help you, you will have to find the answer on your own.

This is the tradeoff with IaaS: it’s not very user friendly, you need to set up most of the things yourself. You need to have some knowledge about the tools you are building. But if you can manage that, you have the most freedom.

Using specific cloud resources

In recent years, cloud providers have developed specific cloud resources for different applications. Most providers include services for web apps, IoT, data science, databases, big data etc. These are of type PaaS: platform as a service. They are more user friendly than IaaS, the cloud provider does the installation of the software for you and you mostly select your application type, which version of the programming language you want and so on. There is less typing in a terminal and more clicking and selecting options on the website.

These services are much more user friendly, for example someone with little knowledge of machine learning can create an image classification project. There are a lot of tutorials, help buttons and overall guidance to make the process as easy and comfortable as possible. As a drawback, these services are not so flexible. You might find yourself using one of these services, and you want to see some specific part of your data but the service doesn’t allow that. Make sure that you can export your data or your model for every service you use.

Specific cloud resources are ideal to make a test run of a big project, or to quickly implement small personal projects. In this post I will tell you about the Azure PaaS services for machine learning.

1. Azure Custom Vision

Custom Vision is a very easy-to-use service for computer vision, which is deep learning with images. You upload at least 20–30 images, label them (or already upload the labeled images), and train a model. It includes a very useful REST API so that you can make requests to the model, and you can get predictions from any other app.

The labeling website for images is unfortunately not very good, the one from Azure Machine Learning (next section) is much more customizable. But for a quick solution, Custom vision is very strong.

There are many tutorials, for example about object detection. The service has pre-trained models which means that after uploading very few images, the model can adapt to your training data very fast.

2. Azure Machine Learning

Azure Machine Learning is a more customizable, more complete solution from Azure for Machine learning. It is still a PaaS service, but it is more complex than Custom vision.

There are a lot of articles about how to use AML. You can build ML projects with an SDK or directly through the portal. For example, this article explains how to train a classification model with no-code AutoML in the Azure Machine Learning studio.

You can build an NLP project with text, or a computer vision project with images. AML includes labeling for you to annotate your data, and also capabilities to train a machine learning model on it.

One of the best features of AML in my opinion is that the labeling can be ML-enhanced. While you label images, there is a machine learning model in the background learning about your labels. Once you label enough images, the service will recommend labels to you, making labeling much faster and easier. You will no longer have to label from scratch, you will just have to correct the few mistakes that the automated service does, and maybe add missing tags. This is especially helpful when doing an object detection project, where you have many labels per image. Accelerating labeling is a big help.

3. Azure Text analytics

This service is an API for NLP, specifically sentiment analysis, key phrase extraction, named entity recognition, and language detection. For example, you can build a Flask translation app.

Conclusion

Azure has three main services for machine learning: Custom vision and Text analytics are very user friendly, fast and easy to use. The former is for images and the latter is for text. However, they might be too simple if you want your project to have more capabilities. In that case, I recommend switching to AML. You have more power over how to label the data, and how to train it. I specifically like AML’s labeling service. Even if you don’t use the model training aspect of AML, you can still label your images there and export then afterwards.

If you want even more power and control over your application, then I recommend using a virtual machine. This requires you to know how to install the frameworks yourself but you will have the most autonomy.

Thank you for reading! I’m happy to chat on Twitter :)

Did you find this article valuable?

Support Ane Berasategi: data science and cloud by becoming a sponsor. Any amount is appreciated!