The Business Process of Machine Learning, with AutoML
This is not a technical article. It is a business article for executives to better understand how to manage machine learning projects. Note that normally, I was offering this content for paid conferences and private training.
As the founder of Neuraxio and as of having done a lot of machine learning and deep learning projects in my life, I’ve developed a business process of doing machine learning for my 15+ clients to use to get results. What follows can be used for natural language processing, time series processing, computer vision, tabular data analysis, and so forth: it is general to all machine learning and deep learning projects, although not applicable to every AI project.
Here is the business process that will be explained in this article:
But first, let’s introduce a few concepts and how training machine learning algorithms works in general to later understand better how to manage these projects.
AI v.s. ML v.s. DL
As an introduction to the machine learning business process: what is Artificial Intelligence (AI) v.s. Machine Learning (ML) v.s. Deep Learning (DL) ?
Well, Deep Learning is an advanced form of Machine Learning. And Machine Learning is an advanced form of Artificial Intelligence.
Is a quite general term. It was introduced in the 1950s.
It's a big umbrella term that contains machine learning and deep learning.
You can think of artificial intelligence as algorithms that can solve chess. This is the most classical example. You also have some classical path finding algorithms, used a lot in video games to make the enemies intelligent and able to find you and find their best paths in 3D and 2D games. AI contains a very varied set of possible algorithms where the machine can solve problems with some intelligence.
Then, there is machine learning, which appeared more recently in the '80s. And it is then that the algorithms, you program start learning some behavior and rules from data to do predictions or to take some actions in the real world.
Most people think that machine learning is like robotics. ML is not necessarily robotics. These are two completely different things. Robotics is often a lot more about the physical materials but can also be about the intelligence of the algorithms, whereas machine learning is a specific class of algorithms that simply learn from data. An example of machine learning algorithms is next word prediction on your cell phone’s keyboard. The algorithm learned from past usage data.
With machine learning, you can talk a lot of statistical models being fitted on some data with provided inputs to generate outputs. Neuraxio, as a business, is most interested in solving Machine Learning problems for their clients, often using Natural Language Processing (text data) and Time Series Processing (time-based events data).
And most of the time, with machine learning, you will see a supervised learning process where you give some training examples to your algorithms from data inputs to expected outputs that you learn to predict by reducing an error metric on the data.
With machine learning comes artificial neural networks. These neural networks learn from data. Machine Learning is not limited to neural networks, there are other machine learning algorithms such as most statistical regressions, like the linear regression that you probably already have done in basic statistics classes, and logistic regressions.
Finally, there is deep learning, which is much more recent than machine learning due to unlocking some cool mathematical tricks that allows for deeper neural networks to learn form data.
A quick rule of thumb could be that a neural network that has more than two layer of stacked neurons in depth is deep learning. But it is more complicated than that. Deep learning is inspired from the brain, but can be quite different sometimes in the way it can dynamically process information in mathematical ways that the brain couldn’t even do, using dynamic graphs.
But a simple rule of thumb is that if you have an artificial neural network (ANN) with more than two layers, then it is somehow considered deep-learning.
The Supervised Learning Process
The supervised learning process - how does it work? This is most of the time how the artificial neural networks, or other machine learning algorithms, can learn on data.
You have an artificial neural network represented in the image above (in the center).
You present to it some kind of inputs (to the left). The input you present could be an image, text, sensor readings, transaction histories, or other things like that.
And you want to predict something at the output (to the right) - that might be what is found in the image. Or the sentiment in the text. Or, in how much time would a consumer buy again a certain product in your store? What products should we recommend next? Things like that.
You also have an expected target (to the right), also known as expected outputs, during the training.
Summary: you present data as inputs, then you predict something as outputs,and you want to compare the prediction outputs to the expected target output.
The key to the learning process is that the learning takes place with this comparison to adjust the neurons’ weights in a way that will reduce the error at the output.
It is very similar to solving a mathematical textbook exercise as a student, where you read the question, calculate the answer, and then write it as an output on paper. You then correct yourself by comparing your answer to the real one and then backpropagating this information in your brain to update your neural networks’ weights in between the involved neurons.
Therefore, by comparing what you generated to the real answer, you can adjust the weights of the connections between your neurons, so that you do not make this error again, or to confirm your answer and give you more confidence in your predictions / to have sharper answers in the future.
Note that often, the error is called a loss function and is minimized. You can define other business evaluation metrics (or also known as scoring functions) to evaluate your learning algorithms’s performance in ways that will suit your business’ needs better. The loss function and evaluation metrics are not the same thing.
So this is supervised learning: it is about minimizing the error over time, thus, learning, given many training examples. This implies a lot of things for a software project and the evaluation methodology.
Phases of a machine learning project
You will have phases for your projects to have your algorithms automatically learn on data and output answers.
Refer to the image just above for the following sections.
Establish a goal
The first phase is to establish a goal. This can be part of the data analysis and problem analysis. Most of the questions you'll ask yourself that will be important on your project later on will be in this phase, as this is where you set a direction for your research and development (R&D), applied research, or application development.
Is the project even possible? If so, clarify the scope and the goal to have your team solve the good problem.
What will be the prototype that your team will program? What will be the required data format and data manipulations to code this prototype? Preparing the data is a step not to be ignored. Your data scientist will have to interact with your database staff. It is best when you have people already specialized in databases in your company.
A problem that I often see is that companies have a business goal that might differ from the data they have at hand. Make sure that you have access to the right data before wondering what kind of machine learning algorithm you could apply to the data to solve your business goal, because the algorithm will be chosen in function of the data and your goal.
So, you probably need to do some predictions or to take some actions and you want to automate this task. This can be very complex and the problem analysis phase it is not to be overlooked. One of the main reasons for failure for machine learning projects is to have a goal and a problem definition that changes throughout the project. Therefore, it is very good thing to define the problem properly, at the beginning of the project to avoid surprises.
Data acquisition and data preparation
The next step is data acquisition and especially data preparation.
Starting a machine learning project usually takes some time to dig through the data, so don’t expect a machine learning firm to take too much commitment in the project too early before even having looked at the data. This step is often called Exploratory Data Analysis (EDA). There are chances that your project is impossible to do right now and they will want to report you on that. A good firm will talk to you straight regarding the feasibility of your project and its chances of success. Do the analysis with them and avoid having to restart from scratch if you change requirements later on.
This can take up to two months, but can vary a lot. Sometimes, you are lucky and your data is already well formatted. Most of the time, it is not the case in business projects. And then you do a prototype.
This can take one to three months on average, from what I've seen.
You should aim at doing just the strict minimum in this step to achieve results. Keep in mind at least some clean code and clean architecture concepts, as well as the legal and licensing part of things here to avoid hitting a wall later on and ensure that the project will be able to move to the next step.
Depending on the level of risk, if a machine learning project is very risky in the sense that it may not be physically possible to attain results with your data, then the effort on the prototype will be little and will not take into account much of the clean code principles as well as the ability to reuse the code later one for real. On the other hand, if your machine learning project feels safe to your data engineers, data scientists, and machine learning engineers, then they will find that it will save time and money to directly aim at building the right software architecture with the good concepts. In this case, you may find the article Structuring Machine Learning Code: Design Patterns & Clean Code quite interesting for your team to follow good machine learning coding principles.
It is often the case that 1 to 3 prototypes are built and compared, often optimized with Automated Machine Learning (AutoML) to tune, compare and select the best prototypes automatically, in Python, and dynamically based on your data that may grow over time. The fact that your data will grow over time may change which prototype performs better, as the choice of the model depends on the data, remember. So, with AutoML, you can automatically optimize and pick your models.
The Minimum Viable Product (MVP)
The next step is to do a Minimum Viable Product (MVP), which is a bit better than the strict minimum prototype, in a way that the MVP is functional. And it can be shown to people in live, and even sometimes as an alpha or beta version of your app, feature, product, service or project. The MVP is discussed more in depth later in this Business Process of Machine Learning article.
Deployment and iterations
This phase can take 3 to 5 months, but can extend up to an infinite amount of time. This can even be the core of the business, such as Google’s search engine.
To sum marize, you first get the prototype to reach results, and then you deploy and you iterate on the code to make the MVP deployable or usable, usually with some improvements as well, compared to the prototype.
What you must know to succeed
Machine Learning projects can take a lot of time. You must know what's ahead or have a good consultant to succeed in a machine learning project. Did you know that 87% of machine learning projects don’t make it to production? So when you begin to do prototypes, data acquisition, and so forth, you need to keep in mind the commercialization of your project at the end.
There are so many things that can go wrong in those projects if you don't do things properly. The data, the prototype, the licenses, the algorithms, the quantity of data, the complexity of the task, the time required to train the models on the data. So many things to consider. It’s best to have a highly skilled machine learning guru to analyze your problem and to use its hard-earned knowledge to set your project up the right way, at the beginning. The most important part of your project is still the problem definition that was discussed above in this article.
Working through an example
You may want to program a sentiment classifier for text (or audio) data. Okay, you've got your problem, your inputs and your outputs are clear: text to sentiment. You can do a supervised learning on that, given the right data points. This means that you need to have text that is pre-labelled with sentiment. Let’s make it simple: the input is text, and the output of your model is a choice between [“happy”, “not happy”] to classify the sentiment. Now you need a metric to evaluate your model because you will train a lot of models (Through AutoML) when working on the problem with the data and the prototypes.
You also need to be able to pick up automatically the best prototype or model here, in an optimization loop. Technically speaking, your scoring metric should be a business metric and it is noted that this is different than the training loss. It is a metric that is really useful to select the best model. Once you've done this, the problem is well-defined, and you are ready for the rest.
Then you can start to acquire, modify and pre-process your data such that the problem can be solved using a supervised learning algorithm (or other algorithms). The data can be presented to the neural network or the actual learning algorithm. And you can in parallel start coding the machine learning algorithms.
At Neuraxio, with our clients, most of the time we're programming the models and helping with the data preparation as well (see at the right of the business process image, the model part, and see at the center, where the model meets the data). Our clients are providing the data that we will then format in a good way for a learning algorithm to be plugged into it to have the data meet the model (at the middle) and therefore be able to solve the problem.
So we will train the model on the data, and analyze its errors. Depending on what the errors are, some more work will be required to be done either on improving the quality of the data or the quality of the model.
At the end, we can deploy to production, to process real life data in live if needed.
Let’s finally go through this chart in the next subsections.
Establishing a goal.
Establishing a goal is about data analysis and problem analysis, as I've said, because the two meet in the middle, ultimately to solve the business problem. The models need to be adapted to the data, and the data to the type of model as well. Working the data is underrated in the industry.
Machine learning is abstract. It is not obvious.
And even having, as of writing this, more than seven years of experience in the field of machine learning and deep learning, I found it hard at the beginning to link the right algorithms to the right data and using the right data preprocessing, where things meet at the middle.
Sometimes it's not clear. It requires a great deal of creativity, and a great knowledge of the existing models and algorithms that can be used, as well as the existing data pre-processing techniques. Most of the time, going custom is needed anyways. Just like creating a website, doing machine learning is something that will be adjusted to your data and problem to solve by using the right algorithms to solve that.
A good business overview of the problem is important. After having worked on more than 57 artificial intelligence projects at the time of writing this, I realized that someone must have at least let's say four to five years of of experience to be able to choose the development decisions right. The average deep learning project is much more complicated than the average website you will build.
For machine learning and deep learning projects, changing goal is very hard once coding has begun. Just like website development, it’s bad to be starting over with changing requirements. At the beginning of the project, the requirements to properly doing machine learning are most of the time just even harder to define. It can be less hard with Neuraxle, which is an open source machine learning framework. By using proper clean code in your machine learning projects, you will go faster.
So, as I said, you want to have a metric in your project to automatically score your machine learning algorithms. Will you have reached your goal with your algorithm? Knowing this automatically is important. Moreover, it is especially useful when doing automated machine learning (AutoML).
Basically, AutoML is automated hyperparameter optimization, where hyperparameters are like the genetic code (hyperparameters) for your machine learning algorithm. AutoML is like performing a search of the best model in the space of possible genes (hyperparameter space) and training some of them and intelligently automatically picking the next ones based on the previous scores, using the evaluation metric. You want to pick the model with the best score.
Establishing a business metric as a way to pick your model is crucial in the goal setting phase, it will help you pick the right model on the data, amongst the one that you've coded, and amongst their possible hyperparameters, so that the best model picked is really solving your problem from a business point of view as much as possible.
Ideally, you have data, right? Without data, your project is at risk. Data and data preparation is often cover more than half of the project. Sometimes it's okay to use public data, or someone else's data, or your clients’ data. In fact, it's most of the time what's done in the industry, coming up with some business deals, allowing usage of the data in the purpose of improving your business.
Many things must be considered. After the first analysis phase, continuing a project without data is hard. It's really hard to work on the part at the right of the business process diagram of this article without the data, because data allows you to iterate and improve the whole system from the defined evaluation metric. Data is useful even to test the system.
Data is the new gold. Data is the new oil. Data is the new electricity. Without the oil in the system, it's hard to make it work properly. Machine learning is only the vehicle. So if there is no data, it will be required to create synthetic data for debugging and development purposes. Sometimes, even when you have lots of data, synthetic data is created anyway just to test the system, such as with unit tests, acceptance tests, functional tests, end2end tests, and more. Those tests are different from the tests described in the present article. The tests of the present article are performance tests on the training, validation, and testing data.
Among the 15+ clients at Neuraxio, I've seen at least a company successfully do a project without data in hands at first. However, I always advise to first have data to reduce risks and costs.
It is a good thing to split the data into train, validation and test sets or into cross-validation splits.
Basically, you train your model according to the supervised learning process, using the training data.
Then you use the validation data to pick the best model with the evaluation metric.
And after having done that and selected the best model, you can test it again with the test data that was held out the whole time like the validation data was held out during the training.
Over all, you then look at the validation score and the test score for model selection and deployment. You look at the training score v.s. the validation score for debugging purposes.
Your validation score is usually worse than your training score. With error analysis, you can know in which direction to go: improve the model or regularize it. Get more data or improve the quality of your existing data. This is depicted in the next two images
In artificial neural networks and a few other similar algorithms, some more error analysis can be done on the evaluation metric to optimize the hyperparameters (the genetic code-like parameters) of your model. This is often done automatically by the usage of AutoML algorithms:
Note that it is desired that the validation data and test data have the same statistical distribution. You also want these datasets to fit real life data as you’d have in production so that you don’t have surprises and to really optimize for the good thing when optimizing the model.
So as you see in the business process image, we split the data into train, validation, and test. So what you're doing when you're testing with a test set as well as a validation set is that you're double checking that you weren't just lucky picking the best validation model. Note that the validation dataset is sometimes called the development dataset, or development set.
The validation and test scores should be the same. You should attain similar scores at the validation phase and test phase.
Meanwhile, the training set may be augmented with other data. This is called data augmentation. Because you want your validation and tests splits to represent the real problem, your chance at making your model learn more things to generalize better is to add varied examples at train-time.
So you may increase your variety of training data to make your machine learning model more robust to unseen situations for instance. It might help the model a great deal sometimes.
Typical data set splits: we often see 70% of the data in the train set when there is low data. Due to the central limit theorem (CLT), above a certain quantity of data in the validation and test set, you don’t need any more. So when you have very big quantities of data, it is ok to make the training set bigger, such as 99% of the data, and 1% for the validation and test set.
We train and evaluate the model on the training and validation data and we can analyze the bias and variance. Bias and variance, in this case, are also referred to as sometimes, respectively, underfitting and overfitting, which are described more in the next section. It's not exactly the same thing but it's quite related in the case of machine learning algorithms evaluation.
You may also want to compare the validation set performance to the human performance. The difference will really tell you whether or not you reached the best possible score that you can get on your dataset and with the specific evaluation metric you designed. By seeing the differences, you can decide if you should improve the model or the data.
Don’t overlook the importance of optimizing on the good metric and on the good validation dataset. Transferring a model to another data distribution in production is risky and it may perform badly. I’ve seen (chatbot BERT) models score 80% on a dataset, and then only 20% on another similar dataset, just because it wasn’t optimized on it, for instance.
A rule of thumb is that the more data you have, in general, the better it will be to use a deeper model rather than a simple model. If you have a small amount of data, then you want to use classical machine learning algorithms that are simpler. As the quantity of data increases, deep learning models will start to perform better than classical models.
On one hand, hard problems require more training data and therefore deep learning to have the capacity to fit on all this data and to extract meaning and generalities from it. A hard deep learning problem can be for instance speech to text or text to speech, even machine translation such as Google Translate’s model that uses attention mechanisms.
On the other hand, if you were to solve the problem of an addition, let's say two plus two equals four, then you wouldn't need all of these sophisticated algorithms. In fact, learning algorithms would most likely perform worse than if your model was a simple classical addition algorithm.
Shortly put, complex algorithms should not be used just for fun. The choice of the model depends on the complexity of the task, your data's quantity and quality, and the business goals that are reflected in your evaluation metric. Often, the value is found in the business use-case.
Keep it simple and stupid. You want to pick the simplest model possible that will attain the best results on your data. I’ve had clients insist for deep learning models while classical machine learning models would do the trick. In these instances I try to reason them and to get them a first prototype with what I believe is best. A good machine learning consultant won’t be afraid to say no to complexity when simple things works.
It is in situations where your data may vary that automated machine learning (AutoML) comes in the most handy, because you can automatically test different models and hyperparameters with your various or growing quantities of data, where you parametrize the number of neurons in your models, and number of layers that are stacked one onto another in depth and the learning
rate and stuff like that, for instance. And you can pick the best model according to your problem.
You can even configure the data pre-processing techniques in automated machine learning as such. For instance, the window size or exponential decay is often a parameter.
If your model is too complex compared to the amount of data, as in the left in the image and at the blue curve for a complex deep learning model, well your validation error, which is the vertical axis, will be low for the training and high for the validation. This is a situation of low performance.
What happens when your model is really intelligent (complex) for low data scenarios is that as it sees some training data, it will most of the time memorize it perfectly. And it might fail to give predictions on some yet-unseen new data. This is called overfitting.
When your model is not too complex, to come up with the lowest possible training error, it will be forced to learn the underlying rules in the data as it doesn’t have the capacity to simply memorize it all. This is a perfect fit.
Note that it’s possible to use a deep learning model on low data, but only if you regularize the model. To do regularization on a model means to add obstacles to learning, so it won’t be able to memorize the examples “as-is” and it will need to generalize and draw conclusions and rules of thumbs on how the problem is really solved.
Another extreme case is when your model is too simple and on a highly complex problem that has a lot of data. It’ll fare bad on the training error as well as on the validation and test error. This is called underfitting.
For sure, choosing the right model is more complicated than just looking at the quantity of data and at the difficulty of the problem to solve. There are lots of things to consider. This is why you need someone who has a lot of experience to come up with the right ideas to try with your team.
Coding a prototype.
Ultimately, you must face and realize that you must customize the model to the data.
Bad machine learning engineers are the one who have found an algorithm that they like and just want to reuse it, just like a hammer looking for nails. You may need other tools to solve your problem if your problem is not to plant a nail. Most coders in the beginning of their coding journey are like that and are very enthousiast to try these shiny techniques. Only mature coders that have endured years of experience will resist the temptation to use the holy hammer.
You need to come up with a good model, according to your data and problem. You use the data and metric, as your metric should be aligned with your business’ goal. And you may need to customize the data with some pre-processing for the model. You can often change the model, but not so much the data. And so you train your models on the data and pick the best ones.
To sum up, before the MVP, you might want to reoptimize your prototypes and the data preprocessing for better results and do some reports to show results to clients and investors with the prototype. Then you do a minimum viable product and deploy. Basically, this is an improvement to either the model on the right side, or to the data pre-processing to the left side of the methodology chart. You will want to retrain the model on the data many times with the automated machine learning loop. This is not magic.
Redeploying and iterating
In the following image, the green and red sections must be worked through again after model changes, data changes, or data preprocessing changes.
These are deployment iterations: so you collect the data, you prepare it, you split the data sets into training, validation and testing sets, you train models with the automated machine learning loop (AutoML), you pick the best validated model, measure its test performance and deploy it if it is good enough and better than your baseline to which you can compare as well in the process. Often, this is done in the cloud, as it is quite convenient to have multiple models retraining in parallel in your AutoML loop.
In lots of context, the data won’t change and you won’t need to redeploy your model. But often, you then restart the cycle with data that is changing in quantity or quality, and you got yourself an improvement loop, a virtuous cycle.
By the way, 80% of the machine learning ecosystem is in Python. The chances are that your machine learning scientist will work in Python or want to work in Python to go faster. What I recommend for the deployment.
You’ve finally reached the end. Congratulations, your project works and you are doing redeployment loops to improve it! Celebrate your success. To summarize, you first need to define your problem, which may require analysis of your data. Then you need to define an evaluation metric for the models that will be built. You then work at having the right data pre-processing for the machine learning models to be trained on. You then have your team work on the machine learning models to come up with the best ones. They will be automatically selected with AutoML. Once a prototype is ready and good enough, you can deploy a minimum viable product (MVP) of your machine learning model, and then iterate on it to reach a fully working product (from alpha to beta version, to more), which may require to re-run the AutoML loop if you are in a situation in which new data is constantly acquired.
And also don't forget to connect with us guys at Neuraxio. Cheers!