This is not a technical article. It is a business article for executives to better understand how to manage machine learning projects. Note that normally, I was offering this content for paid conferences and private training.
As the founder of Neuraxio and as of having done a lot of machine learning and deep learning projects in my life, I’ve developed a business process of doing machine learning for my 15+ clients to use to get results. What follows can be used for natural language processing, time series processing, computer vision, tabular data analysis, and so forth: it is general to all machine learning and deep learning projects, although not applicable to every AI project.
Here is the business process that will be explained in this article:
But first, let’s introduce a few concepts and how training machine learning algorithms works in general to later understand better how to manage these projects.
As an introduction to the machine learning business process: what is Artificial Intelligence (AI) v.s. Machine Learning (ML) v.s. Deep Learning (DL) ?
Well, Deep Learning is an advanced form of Machine Learning. And Machine Learning is an advanced form of Artificial Intelligence.
Is a quite general term. It was introduced in the 1950s.
It's a big umbrella term that contains machine learning and deep learning.
You can think of artificial intelligence as algorithms that can solve chess. This is the most classical example. You also have some classical path finding algorithms, used a lot in video games to make the enemies intelligent and able to find you and find their best paths in 3D and 2D games. AI contains a very varied set of possible algorithms where the machine can solve problems with some intelligence.
Then, there is machine learning, which appeared more recently in the '80s. And it is then that the algorithms, you program start learning some behavior and rules from data to do predictions or to take some actions in the real world.
Most people think that machine learning is like robotics. ML is not necessarily robotics. These are two completely different things. Robotics is often a lot more about the physical materials but can also be about the intelligence of the algorithms, whereas machine learning is a specific class of algorithms that simply learn from data. An example of machine learning algorithms is next word prediction on your cell phone’s keyboard. The algorithm learned from past usage data.
With machine learning, you can talk a lot of statistical models being fitted on some data with provided inputs to generate outputs. Neuraxio, as a business, is most interested in solving Machine Learning problems for their clients, often using Natural Language Processing (text data) and Time Series Processing (time-based events data).
And most of the time, with machine learning, you will see a supervised learning process where you give some training examples to your algorithms from data inputs to expected outputs that you learn to predict by reducing an error metric on the data.
With machine learning comes artificial neural networks. These neural networks learn from data. Machine Learning is not limited to neural networks, there are other machine learning algorithms such as most statistical regressions, like the linear regression that you probably already have done in basic statistics classes, and logistic regressions.
Finally, there is deep learning, which is much more recent than machine learning due to unlocking some cool mathematical tricks that allows for deeper neural networks to learn form data.
A quick rule of thumb could be that a neural network that has more than two layer of stacked neurons in depth is deep learning. But it is more complicated than that. Deep learning is inspired from the brain, but can be quite different sometimes in the way it can dynamically process information in mathematical ways that the brain couldn’t even do, using dynamic graphs.
But a simple rule of thumb is that if you have an artificial neural network (ANN) with more than two layers, then it is somehow considered deep-learning.
The supervised learning process - how does it work? This is most of the time how the artificial neural networks, or other machine learning algorithms, can learn on data.
You have an artificial neural network represented in the image above (in the center).
You present to it some kind of inputs (to the left). The input you present could be an image, text, sensor readings, transaction histories, or other things like that.
And you want to predict something at the output (to the right) - that might be what is found in the image. Or the sentiment in the text. Or, in how much time would a consumer buy again a certain product in your store? What products should we recommend next? Things like that.
You also have an expected target (to the right), also known as expected outputs, during the training.
Summary: you present data as inputs, then you predict something as outputs,and you want to compare the prediction outputs to the expected target output.
The key to the learning process is that the learning takes place with this comparison to adjust the neurons’ weights in a way that will reduce the error at the output.
It is very similar to solving a mathematical textbook exercise as a student, where you read the question, calculate the answer, and then write it as an output on paper. You then correct yourself by comparing your answer to the real one and then backpropagating this information in your brain to update your neural networks’ weights in between the involved neurons.
Therefore, by comparing what you generated to the real answer, you can adjust the weights of the connections between your neurons, so that you do not make this error again, or to confirm your answer and give you more confidence in your predictions / to have sharper answers in the future.
Note that often, the error is called a loss function and is minimized. You can define other business evaluation metrics (or also known as scoring functions) to evaluate your learning algorithms’s performance in ways that will suit your business’ needs better. The loss function and evaluation metrics are not the same thing.
So this is supervised learning: it is about minimizing the error over time, thus, learning, given many training examples. This implies a lot of things for a software project and the evaluation methodology.
You will have phases for your projects to have your algorithms automatically learn on data and output answers.
Refer to the image just above for the following sections.
The first phase is to establish a goal. This can be part of the data analysis and problem analysis. Most of the questions you'll ask yourself that will be important on your project later on will be in this phase, as this is where you set a direction for your research and development (R&D), applied research, or application development.
Is the project even possible? If so, clarify the scope and the goal to have your team solve the good problem.
What will be the prototype that your team will program? What will be the required data format and data manipulations to code this prototype? Preparing the data is a step not to be ignored. Your data scientist will have to interact with your database staff. It is best when you have people already specialized in databases in your company.
A problem that I often see is that companies have a business goal that might differ from the data they have at hand. Make sure that you have access to the right data before wondering what kind of machine learning algorithm you could apply to the data to solve your business goal, because the algorithm will be chosen in function of the data and your goal.
So, you probably need to do some predictions or to take some actions and you want to automate this task. This can be very complex and the problem analysis phase it is not to be overlooked. One of the main reasons for failure for machine learning projects is to have a goal and a problem definition that changes throughout the project. Therefore, it is very good thing to define the problem properly, at the beginning of the project to avoid surprises.
The next step is data acquisition and especially data preparation.
Starting a machine learning project usually takes some time to dig through the data, so don’t expect a machine learning firm to take too much commitment in the project too early before even having looked at the data. This step is often called Exploratory Data Analysis (EDA). There are chances that your project is impossible to do right now and they will want to report you on that. A good firm will talk to you straight regarding the feasibility of your project and its chances of success. Do the analysis with them and avoid having to restart from scratch if you change requirements later on.
This can take up to two months, but can vary a lot. Sometimes, you are lucky and your data is already well formatted. Most of the time, it is not the case in business projects. And then you do a prototype.
This can take one to three months on average, from what I've seen.
You should aim at doing just the strict minimum in this step to achieve results. Keep in mind at least some clean code and clean architecture concepts, as well as the legal and licensing part of things here to avoid hitting a wall later on and ensure that the project will be able to move to the next step.
Depending on the level of risk, if a machine learning project is very risky in the sense that it may not be physically possible to attain results with your data, then the effort on the prototype will be little and will not take into account much of the clean code principles as well as the ability to reuse the code later one for real. On the other hand, if your machine learning project feels safe to your data engineers, data scientists, and machine learning engineers, then they will find that it will save time and money to directly aim at building the right software architecture with the good concepts. In this case, you may find the article Structuring Machine Learning Code: Design Patterns & Clean Code quite interesting for your team to follow good machine learning coding principles.
It is often the case that 1 to 3 prototypes are built and compared, often optimized with Automated Machine Learning (AutoML) to tune, compare and select the best prototypes automatically, in Python, and dynamically based on your data that may grow over time. The fact that your data will grow over time may change which prototype performs better, as the choice of the model depends on the data, remember. So, with AutoML, you can automatically optimize and pick your models.
The next step is to do a Minimum Viable Product (MVP), which is a bit better than the strict minimum prototype, in a way that the MVP is functional. And it can be shown to people in live, and even sometimes as an alpha or beta version of your app, feature, product, service or project. The MVP is discussed more in depth later in this Business Process of Machine Learning article.
This phase can take 3 to 5 months, but can extend up to an infinite amount of time. This can even be the core of the business, such as Google’s search engine.
To sum marize, you first get the prototype to reach results, and then you deploy and you iterate on the code to make the MVP deployable or usable, usually with some improvements as well, compared to the prototype.
Machine Learning projects can take a lot of time. You must know what's ahead or have a good consultant to succeed in a machine learning project. Did you know that 87% of machine learning projects don’t make it to production? So when you begin to do prototypes, data acquisition, and so forth, you need to keep in mind the commercialization of your project at the end.
There are so many things that can go wrong in those projects if you don't do things properly. The data, the prototype, the licenses, the algorithms, the quantity of data, the complexity of the task, the time required to train the models on the data. So many things to consider. It’s best to have a highly skilled machine learning guru to analyze your problem and to use its hard-earned knowledge to set your project up the right way, at the beginning. The most important part of your project is still the problem definition that was discussed above in this article.
You may want to program a sentiment classifier for text (or audio) data. Okay, you've got your problem, your inputs and your outputs are clear: text to sentiment. You can do a supervised learning on that, given the right data points. This means that you need to have text that is pre-labelled with sentiment. Let’s make it simple: the input is text, and the output of your model is a choice between [“happy”, “not happy”] to classify the sentiment. Now you need a metric to evaluate your model because you will train a lot of models (Through AutoML) when working on the problem with the data and the prototypes.
You also need to be able to pick up automatically the best prototype or model here, in an optimization loop. Technically speaking, your scoring metric should be a business metric and it is noted that this is different than the training loss. It is a metric that is really useful to select the best model. Once you've done this, the problem is well-defined, and you are ready for the rest.
Then you can start to acquire, modify and pre-process your data such that the problem can be solved using a supervised learning algorithm (or other algorithms). The data can be presented to the neural network or the actual learning algorithm. And you can in parallel start coding the machine learning algorithms.
At Neuraxio, with our clients, most of the time we're programming the models and helping with the data preparation as well (see at the right of the business process image, the model part, and see at the center, where the model meets the data). Our clients are providing the data that we will then format in a good way for a learning algorithm to be plugged into it to have the data meet the model (at the middle) and therefore be able to solve the problem.
So we will train the model on the data, and analyze its errors. Depending on what the errors are, some more work will be required to be done either on improving the quality of the data or the quality of the model.
At the end, we can deploy to production, to process real life data in live if needed.
Let’s finally go through this chart in the next subsections.
Establishing a goal is about data analysis and problem analysis, as I've said, because the two meet in the middle, ultimately to solve the business problem. The models need to be adapted to the data, and the data to the type of model as well. Working the data is underrated in the industry.
Machine learning is abstract. It is not obvious.
And even having, as of writing this, more than seven years of experience in the field of machine learning and deep learning, I found it hard at the beginning to link the right algorithms to the right data and using the right data preprocessing, where things meet at the middle.
Sometimes it's not clear. It requires a great deal of creativity, and a great knowledge of the existing models and algorithms that can be used, as well as the existing data pre-processing techniques. Most of the time, going custom is needed anyways. Just like creating a website, doing machine learning is something that will be adjusted to your data and problem to solve by using the right algorithms to solve that.
A good business overview of the problem is important. After having worked on more than 57 artificial intelligence projects at the time of writing this, I realized that someone must have at least let's say four to five years of of experience to be able to choose the development decisions right. The average deep learning project is much more complicated than the average website you will build.
For machine learning and deep learning projects, changing goal is very hard once coding has begun. Just like website development, it’s bad to be starting over with changing requirements. At the beginning of the project, the requirements to properly doing machine learning are most of the time just even harder to define. It can be less hard with Neuraxle, which is an open source machine learning framework. By using proper clean code in your machine learning projects, you will go faster.
So, as I said, you want to have a metric in your project to automatically score your machine learning algorithms. Will you have reached your goal with your algorithm? Knowing this automatically is important. Moreover, it is especially useful when doing automated machine learning (AutoML).
Basically, AutoML is automated hyperparameter optimization, where hyperparameters are like the genetic code (hyperparameters) for your machine learning algorithm. AutoML is like performing a search of the best model in the space of possible genes (hyperparameter space) and training some of them and intelligently automatically picking the next ones based on the previous scores, using the evaluation metric. You want to pick the model with the best score.
Establishing a business metric as a way to pick your model is crucial in the goal setting phase, it will help you pick the right model on the data, amongst the one that you've coded, and amongst their possible hyperparameters, so that the best model picked is really solving your problem from a business point of view as much as possible.
Ideally, you have data, right? Without data, your project is at risk. Data and data preparation is often cover more than half of the project. Sometimes it's okay to use public data, or someone else's data, or your clients’ data. In fact, it's most of the time what's done in the industry, coming up with some business deals, allowing usage of the data in the purpose of improving your business.
Many things must be considered. After the first analysis phase, continuing a project without data is hard. It's really hard to work on the part at the right of the business process diagram of this article without the data, because data allows you to iterate and improve the whole system from the defined evaluation metric. Data is useful even to test the system.
Data is the new gold. Data is the new oil. Data is the new electricity. Without the oil in the system, it's hard to make it work properly. Machine learning is only the vehicle. So if there is no data, it will be required to create synthetic data for debugging and development purposes. Sometimes, even when you have lots of data, synthetic data is created anyway just to test the system, such as with unit tests, acceptance tests, functional tests, end2end tests, and more. Those tests are different from the tests described in the present article. The tests of the present article are performance tests on the training, validation, and testing data.
Among the 15+ clients at Neuraxio, I've seen at least a company successfully do a project without data in hands at first. However, I always advise to first have data to reduce risks and costs.
It is a good thing to split the data into train, validation and test sets or into cross-validation splits.
Basically, you train your model according to the supervised learning process, using the training data.
Then you use the validation data to pick the best model with the evaluation metric.
And after having done that and selected the best model, you can test it again with the test data that was held out the whole time like the validation data was held out during the training.
Over all, you then look at the validation score and the test score for model selection and deployment. You look at the training score v.s. the validation score for debugging purposes.
Your validation score is usually worse than your training score. With error analysis, you can know in which direction to go: improve the model or regularize it. Get more data or improve the quality of your existing data. This is depicted in the next two images
In artificial neural networks and a few other similar algorithms, some more error analysis can be done on the evaluation metric to optimize the hyperparameters (the genetic code-like parameters) of your model. This is often done automatically by the usage of AutoML algorithms:
Note that it is desired that the validation data and test data have the same statistical distribution. You also want these datasets to fit real life data as you’d have in production so that you don’t have surprises and to really optimize for the good thing when optimizing the model.
So as you see in the business process image, we split the data into train, validation, and test. So what you're doing when you're testing with a test set as well as a validation set is that you're double checking that you weren't just lucky picking the best validation model. Note that the validation dataset is sometimes called the development dataset, or development set.
The validation and test scores should be the same. You should attain similar scores at the validation phase and test phase.
Meanwhile, the training set may be augmented with other data. This is called data augmentation. Because you want your validation and tests splits to represent the real problem, your chance at making your model learn more things to generalize better is to add varied examples at train-time.
So you may increase your variety of training data to make your machine learning model more robust to unseen situations for instance. It might help the model a great deal sometimes.
Typical data set splits: we often see 70% of the data in the train set when there is low data. Due to the central limit theorem (CLT), above a certain quantity of data in the validation and test set, you don’t need any more. So when you have very big quantities of data, it is ok to make the training set bigger, such as 99% of the data, and 1% for the validation and test set.
We train and evaluate the model on the training and validation data and we can analyze the bias and variance. Bias and variance, in this case, are also referred to as sometimes, respectively, underfitting and overfitting, which are described more in the next section. It's not exactly the same thing but it's quite related in the case of machine learning algorithms evaluation.
You may also want to compare the validation set performance to the human performance. The difference will really tell you whether or not you reached the best possible score that you can get on your dataset and with the specific evaluation metric you designed. By seeing the differences, you can decide if you should improve the model or the data.
Don’t overlook the importance of optimizing on the good metric and on the good validation dataset. Transferring a model to another data distribution in production is risky and it may perform badly. I’ve seen (chatbot BERT) models score 80% on a dataset, and then only 20% on another similar dataset, just because it wasn’t optimized on it, for instance.
A rule of thumb is that the more data you have, in general, the better it will be to use a deeper model rather than a simple model. If you have a small amount of data, then you want to use classical machine learning algorithms that are simpler. As the quantity of data increases, deep learning models will start to perform better than classical models.
On one hand, hard problems require more training data and therefore deep learning to have the capacity to fit on all this data and to extract meaning and generalities from it. A hard deep learning problem can be for instance speech to text or text to speech, even machine translation such as Google Translate’s model that uses attention mechanisms.
On the other hand, if you were to solve the problem of an addition, let's say two plus two equals four, then you wouldn't need all of these sophisticated algorithms. In fact, learning algorithms would most likely perform worse than if your model was a simple classical addition algorithm.
Shortly put, complex algorithms should not be used just for fun. The choice of the model depends on the complexity of the task, your data's quantity and quality, and the business goals that are reflected in your evaluation metric. Often, the value is found in the business use-case.
Keep it simple and stupid. You want to pick the simplest model possible that will attain the best results on your data. I’ve had clients insist for deep learning models while classical machine learning models would do the trick. In these instances I try to reason them and to get them a first prototype with what I believe is best. A good machine learning consultant won’t be afraid to say no to complexity when simple things works.
It is in situations where your data may vary that automated machine learning (AutoML) comes in the most handy, because you can automatically test different models and hyperparameters with your various or growing quantities of data, where you parametrize the number of neurons in your models, and number of layers that are stacked one onto another in depth and the learning
rate and stuff like that, for instance. And you can pick the best model according to your problem.
You can even configure the data pre-processing techniques in automated machine learning as such. For instance, the window size or exponential decay is often a parameter.
If your model is too complex compared to the amount of data, as in the left in the image and at the blue curve for a complex deep learning model, well your validation error, which is the vertical axis, will be low for the training and high for the validation. This is a situation of low performance.
What happens when your model is really intelligent (complex) for low data scenarios is that as it sees some training data, it will most of the time memorize it perfectly. And it might fail to give predictions on some yet-unseen new data. This is called overfitting.
When your model is not too complex, to come up with the lowest possible training error, it will be forced to learn the underlying rules in the data as it doesn’t have the capacity to simply memorize it all. This is a perfect fit.
Note that it’s possible to use a deep learning model on low data, but only if you regularize the model. To do regularization on a model means to add obstacles to learning, so it won’t be able to memorize the examples “as-is” and it will need to generalize and draw conclusions and rules of thumbs on how the problem is really solved.
Another extreme case is when your model is too simple and on a highly complex problem that has a lot of data. It’ll fare bad on the training error as well as on the validation and test error. This is called underfitting.
For sure, choosing the right model is more complicated than just looking at the quantity of data and at the difficulty of the problem to solve. There are lots of things to consider. This is why you need someone who has a lot of experience to come up with the right ideas to try with your team.
Ultimately, you must face and realize that you must customize the model to the data.
Bad machine learning engineers are the one who have found an algorithm that they like and just want to reuse it, just like a hammer looking for nails. You may need other tools to solve your problem if your problem is not to plant a nail. Most coders in the beginning of their coding journey are like that and are very enthousiast to try these shiny techniques. Only mature coders that have endured years of experience will resist the temptation to use the holy hammer.
You need to come up with a good model, according to your data and problem. You use the data and metric, as your metric should be aligned with your business’ goal. And you may need to customize the data with some pre-processing for the model. You can often change the model, but not so much the data. And so you train your models on the data and pick the best ones.
To sum up, before the MVP, you might want to reoptimize your prototypes and the data preprocessing for better results and do some reports to show results to clients and investors with the prototype. Then you do a minimum viable product and deploy. Basically, this is an improvement to either the model on the right side, or to the data pre-processing to the left side of the methodology chart. You will want to retrain the model on the data many times with the automated machine learning loop. This is not magic.
In the following image, the green and red sections must be worked through again after model changes, data changes, or data preprocessing changes.
These are deployment iterations: so you collect the data, you prepare it, you split the data sets into training, validation and testing sets, you train models with the automated machine learning loop (AutoML), you pick the best validated model, measure its test performance and deploy it if it is good enough and better than your baseline to which you can compare as well in the process. Often, this is done in the cloud, as it is quite convenient to have multiple models retraining in parallel in your AutoML loop.
In lots of context, the data won’t change and you won’t need to redeploy your model. But often, you then restart the cycle with data that is changing in quantity or quality, and you got yourself an improvement loop, a virtuous cycle.
By the way, 80% of the machine learning ecosystem is in Python. The chances are that your machine learning scientist will work in Python or want to work in Python to go faster. What I recommend for the deployment.
You’ve finally reached the end. Congratulations, your project works and you are doing redeployment loops to improve it! Celebrate your success. To summarize, you first need to define your problem, which may require analysis of your data. Then you need to define an evaluation metric for the models that will be built. You then work at having the right data pre-processing for the machine learning models to be trained on. You then have your team work on the machine learning models to come up with the best ones. They will be automatically selected with AutoML. Once a prototype is ready and good enough, you can deploy a minimum viable product (MVP) of your machine learning model, and then iterate on it to reach a fully working product (from alpha to beta version, to more), which may require to re-run the AutoML loop if you are in a situation in which new data is constantly acquired.
Below, in the video, there are good examples of how to build proper machine learning pipelines, following clean code OOP principles such as the SOLID principles for software design.
Several design patterns are discussed with practical examples and their implications. So not only you want to build neural networks and other machine learning algorithms, but also you want to find the best hyperparameters for them automatically. We’ll here demonstrate how it’s possible in a clean code way.
This will help you structure how to handle the flow of data from one step to another. Steps are chained one after another in a machine learning pipeline. For instance, you can override behavior inherited from "BaseStep" classes to change the flow of data. You can save state in objects with the fit method, transform the data with the transform method, and more. Overall, it's possible to build really powerful steps that can edit and change the execution flow, and in multiple dimensions.
At the end of the video, there is also a fast and almost comprehensive tutorial where the usage of advanced features of the Neuraxle framework is shown. The design patterns applied to machine learning are brilliant. Clean machine learning pipeline design is shown through an example applied to time series processing. This can be applied to deep learning as well and for sure.
You may find the pages overviewed in the video here:
]]>
As part of the Fourth Industrial Revolution, Artificial Intelligence (AI) and Machine Learning (ML) have become part of our daily lives. Across industries, companies have learned to rely on the convenience and insights that these innovations bring. As of 2020, almost 50% of all companies use AI and machine learning to improve operational quality. Companies that have fully integrated AI-driven tech are also estimated to earn 13% more thanks to improved services.
However, aside from enhancing external operations and increasing consumer satisfaction rates, machine learning can accelerate another critical component of business success: the human workforce.
Despite concerns that AI and machine learning will eventually replace human workers, studies prove how the aforementioned technology can fuel profitable and timely changes. For instance, in Kathleen Walch’s report on global AI dominance, she mentions Japan as a first adopter of machine learning. Though most of the country’s efforts have been focused on robotics, this is seen as an integral solution to alleviate the aging population's workforce shortage. This is important in sectors like the strained healthcare system, wherein healthcare workers are vastly outnumbered by older patients. With the use of machine learning applications, tedious tasks like patient record maintenance can be automated.
Two more countries leading the adoption of machine learning within the workforce are China and the United States. After all, these countries have the largest and most well-backed AI ventures worldwide. In the U.S., there's an emphasis on machine learning complementing human workers. Among the more notable examples of this is highlighted in Ben Eubank’s book on AI within HR. He explains that companies are empowering their internal operations by using smart solutions that simplify processes and clear up production backlogs. This creates a more efficient workplace, that recent surveys show is important for attracting and retaining top talent.
Meanwhile, former Google China president Kai-Fu Lee’s book on today’s AI superpowers, explains that China’s machine learning initiatives are market-driven. This means that rather than being based on abstract ideas, the country’s efforts provide tangible benefits that the growing entrepreneurial market appreciates. For instance, since China is one of the most mobile-driven nations, machine learning providers offer companies a means to streamline their customer transactions. Rather than using an employee’s valuable time fulfilling a customer’s order, for example, their AI-powered systems can fulfill this instead. This is expected to create a faster-moving revenue stream, which in turn, can support the creation of 300 million jobs.
How to Introduce Machine Learning to Your Workforce
Of course, while machine learning will undoubtedly increase profitability and scalability for companies, it may still pose concerns for employees. After all, since the dawn of AI in the 50s, there has been a fear that machines may eventually replace us all. Though understandable, employers can address and assuage these worries to enjoy a seamless and beneficial machine learning integration.
So, before you roll out your machine learning efforts, do offer some classes to familiarize your workforce with AI. In this consultant-led training, your employees will be able to understand the technology as well as the rationale behind why you’re adopting them. Plus, here, they'll also learn about the benefits to them, as noted by Guillaume Chevalier’s article on AutoML. When your employees are aware of the technology you’re about to apply, implementation will be faster and less prone to hiccups.
Moreover, emphasize that your employees aren’t being pushed out. As stated in Kevin Cashman’s review of predictions for future automation, history has proven that technology does breed opportunities for growth and transformation. Should you be expecting consequent changes in your labor demands, explain how this will only change their job description, but not their employment.
Admittedly, there will be changes in the workforce following the mainstream adoption of AI. However, so long as companies aim to use machine learning as a means to enhance rather than replace their current workforce, there will be more long-term wins than losses. For more information on how to integrate machine learning within your workforce, visit Neuraxio.
Article exclusively for neuraxio.com by Olivia Rowe.]]>
CC-BY
I suggest that you grab a good coffee while you read what follows. If you write AI code at Neuraxio, or if you write AI code using software that Neuraxio distributed, this article is especially important for you to grasp what's going on with the testing and how it works.
Have you ever heard fo the testing pyramid? Martin Fowler has a nice article on this topic here. To summarize what it is: you should have LOTS OF small "unit" tests that are testing small components of your software, and then a FEW "integration" tests that are medium-sized (and will probably test your service application layer), and then VERY FEW "end-to-end" (E2E) tests that will test the whole thing at once (probably using your AI backend's REST API) with a real and complete use-case that does everything to see if everything works together. It makes a pyramid: unit tests at the bottom,
Why this different quantity of tests with these granularitues? So we have a pyramid of tests like this:
Note that the integration tests are sometimes also called acceptance tests. They may differ depending on where you work at, as different terminology is used. I personnaly prefer acceptation tests, so as to reffer to the business acceptation of a test case. Like if an acceptation test case is a business requirement written into code.
Suppose that in your daily work routine, you edit some code to either fix a bug, measure something in your code, or introduce new features. You will change something thinking that it helps. The following will eventually happen as you are not perfect and probably do errors and mistakes from time to time. How often have your code worked on the 1st try?
To sum up: unit testing gives you, and especially your team, some considerable speed. Rare are the programmers who like to be stuck just debugging software. Cut the debugging times by using unit tests, and not only will everyone be happy, but also everyone will code faster.
"Understanding code is by far the activity at which professional developers spend most of their time."
Example #1 of the AAA in a ML unit test:
[Click here to read whole original code file for the code above]
See how the test is first set-upped (arranged) at the beginning? The test above is even further setupped using an argument in the test function, meaning that this test can be ran again and again with different arguments to test using PyTest's parametrize. Here is a good example of a well-parametrized unit test that also makes use of the AAA.
Example #2 of the AAA in a ML unit test:
ATDD: Write an acceptance test first, and then do many TDD loops to fulfill this acceptance test.
"One difference between a smart programmer and a professional programmer is that the professional understands that clarity is king. Professionals use their powers for good and write code that others can understand."- Source: Robert C. Martin, Clean Code: A Handbook of Agile Software Craftsmanship
Lucky you, we've launched a series of curated resources to help you get better and to work like a pro in Machine Learning (ML) projects.
You'll learn:
If you successfully pass the quizz that will be sent to you at the end of this training, you'll be able to purchase the certificate to showcase your skills for $17 CAD, if you wish to. This certificate can be showcased on LinkedIn as "Neuraxio AI Programmer".
Remember, you're only one ML project away from achieving success.
And it starts here and now.
Original description from The Commerce Show:
In this episode, we are talking about AI technologies for eCommerce. Guillaume Chevalier has been working in artificial intelligence for over 7 years now and has been involved in over 57 machine learning projects.
We cover several applications of AI in the eCommerce industry such as Personalized shopping experience, Sales/Inventory forecasting, Automated customer service/chatbots, Visual search and powerful synonyms search, Price optimization, Understanding customers better (persona) and Recommendation algorithms.
After the first 30 minutes to talk about eCommerce, we also talk a bit deeper about AI from a developer point of view. Guillaume explains why he decided to develop Neuraxle an AI framework for machine learning projects over the years. Guillaume is also giving tips about « How to start and plan an AI transformation for a non-tech business. »
This podcast is amazing and you’ll discover Guillaume's passion for eCommerce, ML (machine learning) and NPL (natural language processing).
Listen to the full podcast:
Follow The Commerce Show.
]]>Applying clean code and SOLID principles to your ML projects is crucial, and is so often overlooked. Successful artificial intelligence projects require good programmers to work in pair with the mathematicians.
Ugly research code simply won’t do it. You need to do Clean Machine Learning at the moment you begin your project.
Despite all the hype being about the deep learning algorithms, we decided at Neuraxio to do a training about Clean Machine Learning, because it is was we feel the industry really needs.
Clean code is excessively hard to achieve in a codebase that is already dirty, action truly must be taken at the beginning of the project. It must not be postponed.
]]>Applying clean code and SOLID principles to your ML projects is crucial, and is so often overlooked. Successful artificial intelligence projects require good programmers to work in pair with the mathematicians.
Ugly research code simply won’t do it. You need to do Clean Machine Learning at the moment you begin your project.
Despite all the hype being about the deep learning algorithms, we decided at Neuraxio to do a training about Clean Machine Learning, because it is was we feel the industry really needs.
Clean code is excessively hard to achieve in a codebase that is already dirty, action truly must be taken at the beginning of the project. It must not be postponed.
We’re glad to have organized this event at Le Camp in March just before the COVID-19 outbreak. It was a fantastic event.
Thanks to participants from Thales, Shutterstock, Novatize, Artifici, Spress.ai, La Cité, LP, IA groupe financier, LGS - An IBM Company, Ville de Québec, STICKÔBOT INC., and Levio.
And also big thanks to the other event organizers including William Simetin Grenon, Francis B. Lemay, Maxime Bouchard Roy and Alexandre Brillant, as well as the other speakers outside of Neuraxio: Jérôme Bédard from Umaneo, and Vincent Bergeron from ROBIC.
It was fun, thank you all!
- Guillaume Chevalier, Founder & Machine Learning Expert @ Neuraxio
You can interact with the present post on social media:
You can also check out our Machine Learning trainings.
]]>Daily, what does a data scientist do? And how can Automated Machine Learning avoid you to babysit your AI, practically?
Here is a metaphor: your data scientist is a mom. A babysitter.
The data scientist creates a nice artificial neural network and trains it on data. Then he’s going to supervise the learning. The data scientist will make sure that the learning converges in the right way so that the artificial neural network can give good predictions and then flourish.
Seriously, that’s all well and good, but it costs time, and it costs money.
Is there anything we can do to automate the process of being a mom - actually being a data scientist? Actually, we can use Automated Machine Learning.
]]>Daily, what does a data scientist do? And how can Automated Machine Learning avoid you to babysit your AI, practically?
Here is a metaphor: your data scientist is a mom. A babysitter.
The data scientist creates a nice artificial neural network and trains it on data. Then he’s going to supervise the learning. The data scientist will make sure that the learning converges in the right way so that the artificial neural network (or model) can give good predictions and then flourish.
Seriously, that’s all well and good, but it costs time, and it costs money.
Is there anything we can do to automate the process of being a mom - actually being a data scientist? Actually, we can use Automated Machine Learning.
Automated Machine Learning allows us to automate the process of being a mom.
Firstly, when we define a model, an artificial neural network for example, we have to define the hyperparameters: The number of neurons, the number of layers of neurons on top of each other.
So we’re going to define things like the learning rate, and then the way the data is formatted to send it to the Artificial Neural Network (ANN).
Those are hyperparameters and they are all very well, but above that, to do Automated Machine Learning, we can actually define a space of hyperparameters. E.g.: the number of neurons varies between this and that.
The data can be formatted to send a certain amount of data of a certain length or a certain shape. You can disable or enable certain pre-processing steps in the data.
We can have a space in which, if we pick a point in that space, we find a special case like finally sampling a gene - a hyperparameter - a hyperparameter sample for our artificial neural network.
In other words, every point (sample) in the space is a different setting (or gene) to try.
With Automated Machine Learning, we can finally iterate in this space to pick up new points, try them out, in a somewhat random way, but still intelligent, so that after having made several attempts, converge towards a result.
There’s no need for the data scientist to be around all the time. AutoML can run for weeks, even months, for larger models if you want, and so on. With all this, eventually, then we can get the best results.
Moreover, it makes easier reusing the code of your last project into your next one. Which provides even more speed in the long run.
We can also analyze the effect of hyperparameters on neural network’s performance. That’s a problem in data science: the neural networks or the model that’s going to perform best on a set of data, isn’t the model with the most neurons, nor the model with the most everything.
In fact, it’s not the one with the least neither. It has to be somewhere in between. You have to find what’s best - there’s a trade-off between bias and then variance.
In the end, that’s one of the things we’re going to automate with Automated Machine Learning: finding the best model with a good bias/variance tradeoff.
That’s why we need a data scientist or Automated Machine Learning to supervise the artificial neural network, and then try and retry different hyperparameters, as there are no free lunches (NFL theorem).
There are different algorithms that allow you to choose the next point - not by chance - you can analyze what you’ve tried, and what the results were, and then pick the next point in space, and try it all out in an intelligent way.
Machine Learning software has more value if it has the ability to be automatically adapted to new data or a new dataset later on when things will change. The ability to adapt quickly to new data and changes in requirements is important. Those two factors that are often ignored are important in explaining why 87% of data science projects never make it into production.
In our projects, we use the free tool Neuraxle to optimize our Machine Learning algorithms’ hyperparameters. The trick is to define an hyperparameter space for our models and even for our data preprocessing functions using the good software abstraction for ML.
]]>Scikit-Learn’s “pipe and filter” design pattern is simply beautiful. But how to use it for Deep Learning, AutoML, and complex production-level pipelines?
Scikit-Learn had its first release in 2007, which was a pre deep learning era. It’s one of the most known and adopted machine learning library, and is still growing. On top of all, it uses the Pipe and Filter design pattern as a software architectural style - it’s what makes Scikit-Learn so fabulous, added to the fact it provides algorithms ready for use. However, it has massive issues when it comes to do the following, which we should be able to do in 2020 already:
Let’s first clarify what’s missing exactly, and then let’s see how we solved each of those problems with building new design patterns based on the ones Scikit-Learn already uses.
TL;DR: How could things work to allow us to do what’s in the above list with the Pipe and Filter design pattern / architectural style that is particular of Scikit-Learn? The API must be redesigned to include broader functionalities, such as allowing the definition of hyperparameter spaces, and allowing a more comprehensive object lifecycle & data flow functionalities in the steps of a pipeline. We coded a solution: that is Neuraxle.
Don’t get me wrong, I used to love Scikit-Learn, and I still love to use it. It is a nice status quo: it offers useful features such as the ability to define pipelines with a panoply of premade machine learning algorithms. However, there are serious problems that they just couldn’t see in 2007, when deep learning wasn’t a thing.
]]>Scikit-Learn’s “pipe and filter” design pattern is simply beautiful. But how to use it for Deep Learning, AutoML, and complex production-level pipelines?
Scikit-Learn had its first release in 2007, which was a pre deep learning era. It’s one of the most known and adopted machine learning library, and is still growing. On top of all, it uses the Pipe and Filter design pattern as a software architectural style - it’s what makes Scikit-Learn so fabulous, added to the fact it provides algorithms ready for use. However, it has massive issues when it comes to do the following, which we should be able to do in 2020 already:
Let’s first clarify what’s missing exactly, and then let’s see how we solved each of those problems with building new design patterns based on the ones Scikit-Learn already uses.
TL;DR: How could things work to allow us to do what’s in the above list with the Pipe and Filter design pattern / architectural style that is particular of Scikit-Learn? The API must be redesigned to include broader functionalities, such as allowing the definition of hyperparameter spaces, and allowing a more comprehensive object lifecycle & data flow functionalities in the steps of a pipeline. We coded a solution: that is Neuraxle.
Don’t get me wrong, I used to love Scikit-Learn, and I still love to use it. It is a nice status quo: it offers useful features such as the ability to define pipelines with a panoply of premade machine learning algorithms. However, there are serious problems that they just couldn’t see in 2007, when deep learning wasn’t a thing.
Some of the problems are highlighted by the top core developer of Scikit-Learn himself at a Scipy conference. He calls for new libraries to solve those problems instead of doing that within Scikit-Learn:
Source: the top core developer of Scikit-Learn himself - Andreas C. Müller @ SciPy Conference
In Scikit-Learn, the hyperparameters and the search space of the models are awkwardly defined.
Think of builtin hyperparameter spaces and AutoML algorithms. With Scikit-Learn, despite a pipeline step can have hyperparameters, they don’t each have an hyperparameter distribution.
It’d be really good to have get_hyperparams_space
as well as get_params
in Scikit-Learn, for instance.
This lack of an ability to define distributions for hyperparameters is the root of much of the limitations of Scikit-Learn with regards to doing AutoML, and there are more technical limitations out there regarding constructor arguments of pipeline steps and nested pipelines.
Think about the following features:
Scikit-Learn does almost none of the above, and hardly allows it as their API is too strict and wasn’t built with those considerations in mind: for instance they are mostly lacking in the original Scikit-Learn Pipeline. Yet, all of those things are required for Deep Learning algorithms to be trained (and thereafter deployed).
Plus, Scikit-Learn lacks some things to do proper serialization, and it also lacks a compatibility with Deep Learning frameworks (i.e.: TensorFlow, Keras, PyTorch, Poutyne). It also lacks to provide lifecycle methods to manage resources and GPU memory allocation. Think of lifecycle methods as methods where each objects has: __init__
, fit
, transform
. For instance, picture adding also setup
, teardown
, mutate
, introspect
, save
, load
, and more, to manage the events of the life of each algorithm’s object in a pipeline.
You’d also want some pipeline steps to be able to manipulate labels, for instance in the case of an autoregressive autoencoder where some “X” data is extracted to “y” data during the fitting phase only, or in the case of applying a one-hot encoder to the labels to feed them as integers.
Parallelism and serialization are convoluted in Scikit-Learn: it’s hard, not to say broken. When some steps of your pipeline imports libraries coded in C++, those objects aren’t always serializable, it doesn’t work with the usual way of saving in Scikit-Learn, which is by using the joblib serialization library.
Also, when you build pipelines that are meant to run in production, there are more things you’ll want to add on top of the previous ones. Think about:
Shortly put: it’s hard to code Metaestimators using Scikit-Learn’s base classes. Metaestimators are algorithms that wrap other algorithms in a pipeline to change the behavior of the wrapped algorithm (e.x.: decorator design pattern). Examples of metaestimators:
RandomSearch
holds another step to optimize. A RandomSearch
is itself also a step.Pipeline
holds several other steps. A Pipeline
is itself also a step (as it can be used inside other pipelines: nested pipelines).ForEachDataInputs
holds another step. A ForEachDataInputs
is itself also a step (as it is a replacement of one to just change dimensionality of the data, such as adapting a 2D step to 3D data by wrapping it).ExpandDim
holds another step. An ExpandDim
is itself also a step (inversely to the ForEachDataInputs
, it augments the dimensionality instead of lowering it).Metaestimators are crucial for advanced features. For instance, a ParallelTransform
step could wrap a step to dispatch computations across different threads. A ClusteringWrapper
could dispatch computations of the step it wraps to different worker computers within a pipeline. Upon receiving a batch of data, a ClusteringWrapper
would work by first sending the step to the workers (if it wasn’t already sent) and then a subset of the data to each worker. A pipeline is itself a metaestimator, as it contains many different steps. There are many metaestimators out there. We also name those “meta steps” as a synonym.
For sure, Scikit-Learn is very convenient and well-built. However, it needs a refresh. Here are our solutions with Neuraxle to make Scikit-Learn fresh and useable within modern computing projects!
Unfortunately, most Machine Learning pipelines and frameworks, such as Scikit-Learn, fail at combining Deep Learning algorithms within neat pipeline abstractions allowing for clean code, automatic machine learning, parallelism & cluster computing, and deployment in production. Scikit-Learn has those nice pipeline abstractions already, but it lacks the features to do AutoML, deep learning pipelines, and more complex pipelines such as for deploying to production.
Fortunately, we found some design patterns and solutions that allows for all the techniques we named to work together within a pipeline, making it easy for coders, bringing concepts from most recent frontend frameworks (e.g.: component lifecycle) into machine learning pipelines with the right abstractions, allowing for more possibilities such as a better memory management, serialization, and mutating dynamic pipelines. We also break past Scikit-Learn and Python’s parallelism limitations with a neat trick, allowing straightforward parallelization and serialization of pipelines for deployment in production.
We’re glad we’ve found a clean way to solve the most widespread problems out there related to machine learning pipelines, and we hope that our solutions to those problems will be prolific to many machine learning projects, as well as projects that can actually be deployed to production.
If you liked this reading, subscribe to Neuraxio’s updates to be kept in the loop! Also thanks to the Dot-Layer (.Layer) organization’s blog committee and administrators for their generous peer-review of the present article.
]]>Would you like to see the future? This post aims at predicting what will happen to the field of Deep Learning. Scroll on.
]]>Would you like to see the future? This post aims at predicting what will happen to the field of Deep Learning. Scroll on.
Who doesn’t like to see the real cause of trends?
Some people have said that Moore’s Law was coming to an end. A version of this law is that every 18 months, computers have 2x the computing power than before, at a constant price. However, as seen on the chart, it seems like improvements in computing got to a halt between 2000 and 2010.
This halt is in fact that we’re reaching the limit size of the transistors, an essential part of CPUs. Making them smaller than this limit size will introduce computing errors, because of quantic behavior. Quantum computing will be a good thing, however, it won’t replace the function of classical computers as we know them today.
Moore’s Law isn’t broken yet on another aspect: the number of transistors we can stack in parallel. This means that we can still have a speedup of computing when doing parallel processing. In simpler words: having more cores. GPUs are growing towards this direction: it’s fairly common to see GPUs with 2000 cores in the computing world, already.
Luckily for Deep Learning, it comprises matrix multiplications. This means that deep learning algorithms can be massively parallelized, and will profit from future improvements from what remains of Moore’s Law.
See also: Awesome Deep Learning Resources
Ray Kurtzweil predicts that the singularity will happen in 2029. That is, as he defines it, the moment when a 1000$ computer may contain as much computing power as 1000x the human brain has. He is confident that this will happen, and he insists that what needs to be worked on to reach true singularity is better algorithms.
So we’d be mostly limited by not having found the best mathematical formulas yet. Until then, for learning to properly take place using deep learning, one needs to feed a lot of data to deep learning algorithms.
We, at Neuraxio, predict that Deep Learning algorithms built for time series processing will be something very good to build upon to get closer to where the future of deep learning is headed.
Yes, this keyword is so 2014. It still holds relevant.
It is reported by IBM New Vantage that 90% of the financial data was accumulated in the past 2 years. That’s a lot. At this rate of growth, we’ll be able to feed deep learning algorithms abundantly, more and more.
That is what The Guardian reports, according to big data statistics from IDC. In contrast, only 0.5% of all data was analyzed in 2012, according to the same source. Information is more and more structured, and organizations are now more conscious of tools to analyze their data. This means that deep learning algorithms will soon have access to the data more easily, whether the data is stored locally or in the cloud.
It is about what defines us, humans, compared to all previous species: our intelligence.
The key to intelligence and cognition is a very interesting subject to explore and is not yet well understood. Technologies related to this field are promising, and simply, interesting. Many are driven by passion.
On top of that, deep learning algorithms may use Quantum Computing and will apply to machine-brain interfaces in the future. Trend stacking at its finest: a recipe for success is to align as many stars as possible while working on practical matters.
We predict that deep learning in 10 years may be more about Spiking Neural Networks (SNNs).
Those types of artificial neural networks may unlock the limitations of Deep Learning, as they are closer to natural neurons, although they require more (parallelizable) computing power. If you’re interested in learning more on that topic, see my other article on the limits and future of research in deep learning and my other article on Spiking Neural Networks (SNNs).
Although I did some researched on SNNs, it’s a far shot and they aren’t useful yet. For now, regular Artificial Neural Networks, such as LSTMs, are good for solving a plethora of tasks. Until we reach the point where SNNs will be useful, it’s very practical to have and use the good tools to do deep learning when it comes to deploying deep learning production pipelines, such as using a good machine learning framework in Python to correctly integrate deep learning algorithms within computing environments.
First, Moore’s Law and computing trends indicate that more and more things will be parallelized. Deep Learning will exploit that.
Second, the AI singularity is predicted to happen in 2029 according to Ray Kurtzweil. Advancing Deep Learning research is a way to get there to reap the rewards and do good.
Third, data doesn’t sleep. More and more data is accumulated every day. Deep Learning will exploit big data.
Finally, deep learning is about intelligence. It is about technology, it is about the brain, it is about learning, it is about what defines us, humans, compared to all previous species: our intelligence. Curious people will know their way around deep learning.
If you liked this article, consider following us for more!
]]>Machine Learning competition & research code sucks. What to do about it?
As a frequent reader of source code coming from Kaggle competitions, I’ve come to realize that it wasn’t full of rainbows, unicorns, and leprechauns. It’s rather like a Frankenstein. A Frankenstein is a work made of glued parts of other works and badly integrated. Machine Learning competition code in general, as well as machine learning research code, suffer from deep architectural issues. What to do about it? Using neat design patterns can change a lot of things for the better.
EDIT - NOTE TO THE READER: this article is written with having in mind a context where said competition code is to be reused to put it in production. The arguments in this article are oriented towards this end-goal. We are conscious that it’s natural and time-efficient for kagglers and researchers to write dirty code as their code is for a one-off thing. Reusing such code to build a production-ready pipeline is another thing, and the road to get there is bumpy.
]]>TL;DR: don’t directly reuse competition code. Instead, create a new, clean project on the side, and refactor the old code into it.
Machine Learning competition & research code sucks. What to do about it?
As a frequent reader of source code coming from Kaggle competitions, I’ve come to realize that it wasn’t full of rainbows, unicorns, and leprechauns. It’s rather like a Frankenstein. A Frankenstein is a work made of glued parts of other works and badly integrated. Machine Learning competition code in general, as well as machine learning research code, suffer from deep architectural issues. What to do about it? Using neat design patterns can change a lot of things for the better.
EDIT - NOTE TO THE READER: this article is written with having in mind a context where said competition code is to be reused to put it in production. The arguments in this article are oriented towards this end-goal. We are conscious that it’s natural and time-efficient for kagglers and researchers to write dirty code as their code is for a one-off thing. Reusing such code to build a production-ready pipeline is another thing, and the road to get there is bumpy.
TL;DR: don’t directly reuse competition code. Instead, create a new, clean project on the side, and refactor the old code into it.
It’s so common to see code coming from Kaggle competitions that lacks the proper object oriented abstractions. Moreover, they also lack the abstractions for allowing to later deploying the pipeline to production - and with reason, kagglers have no incentives to prepare for deploying code to production, as they only need to win the competition.
The situation is roughly the same in academia where researchers too often just try to get results, beating a benchmark, to publish a paper and ditch the code after. Worse: often times, those researchers use overused datasets, and it requires for them to use all sorts of very specific post-processing tricks that won’t generalize to any other dataset, nor a production version of the algorithm for real-world usage.
Unfortunately, companies often rely on beating public benchmarks to only later discover that just having a working algorithm first may be of better value, without maniacally overtuning the algorithm on one very specific dataset. To make things worse, many machine learning coders and data scientists didn’t learn to code properly in the first place, so those prototypes are often full of technical debt.
Here are a few examples of bad patterns we’ve seen:
Those bad patterns doesn’t only apply to code written in programming competition environments (such as this code of mine written in a rush - yes, I can do it too when unavoidably pressured). Here are some examples of code with checkpoints using the disks:
Companies can sometimes draw inspiration from code on Kaggle, I’d advise them to code their own pipelines to be production-proof, as taking such competition as-is is risky. There is a saying that competition code is the worst code for companies to use, and even that the people winning competitions are the worst once to hire - because they write poor code.
I wouldn’t go that far in that saying (as I myself most of the time earn podiums in coding competitions) - it’s rather that competition code is written without thinking of the future as the goal is to win. Ironically, it’s at that moment that reading Clean Code and Clean Coder gets important. Using good pipeline abstractions helps machine learning projects surviving.
So you still want to use competition and research code for you production pipeline. You’ll probably need to start anew and refactor the old code into the proper new abstractions. Here are the things you want when building a machine learning pipeline which goal is to be sent to production:
And if your goal is instead to continue to do competitions, please at least note that I (personally) started winning more competitions after reading the book Clean Code. So the solutions above should as well apply in competitions, you’ll have more mental clarity, and as a result, more speed too, even if designing and thinking about your code beforehand seems like taking a lot of precious time. You’ll start saving even in the short run, quickly.
Now that we’ve built a machine learning framework that ease the process of writing clean pipelines, we have a hard time picturing how we’d get back to our previous habits anytime. Clean is our new habit, and it now doesn’t cost much more time to start new projects with the good abstractions from the start, as we’ve already thought through them.
Another upside, if you’re a researcher, is that if your code is developer-friendly and has a good API, it has more chances of being reused, thus you’ll be more likely to be cited. It’s always sad to discover that some research results can’t be reproduced even when using the same code that generated those results.
In all cases, using good patterns and good practices will almost always save time even in the short or medium term. For instance, using the pipe and filter design pattern with Neuraxle is the simplest and cleanest thing to do.
It’s hard to write good code when pressured by the deadline of a competition. We created Neuraxle to easily allow for the good abstractions to be easily used when in a rush. As a result, it’s a good thing that competition code be refactored into Neuraxle code, and it’s a good idea to write all your future code using a framework like Neuraxle.
The future is now. If you’d like to support Neuraxle, we’ll be glad that you get in touch with us. You can also register to our updates and follow us. Cheers!
]]>Coding Machine Learning Pipelines - the right way.
Have you ever coded an ML pipeline which was taking a lot of time to run? Or worse: have you ever got to the point where you needed to save on disk intermediate parts of the pipeline to be able to focus on one step at a time by using checkpoints? Or even worse: have you ever tried to refactor such poorly-written machine learning code to put it to production, and it took you months? Well, we’ve all been there if working on machine learning pipelines for long enough. So how should we build a good pipeline that will give us flexibility and the ability to easily refactor the code to put it in production later?
First, we’ll define machine learning pipelines and explore the idea of using checkpoints between the pipeline’s steps. Then, we’ll see how we can implement such checkpoints in a way that you won’t shoot yourself in the foot when it comes to put your pipeline to production. We’ll also discuss of data streaming, and then of Oriented Object Programming (OOP) encapsulation tradeoffs that can happen in pipelines when specifying hyperparameters.
]]>Coding Machine Learning Pipelines - the right way.
Have you ever coded an ML pipeline which was taking a lot of time to run? Or worse: have you ever got to the point where you needed to save on disk intermediate parts of the pipeline to be able to focus on one step at a time by using checkpoints? Or even worse: have you ever tried to refactor such poorly-written machine learning code to put it to production, and it took you months? Well, we’ve all been there if working on machine learning pipelines for long enough. So how should we build a good pipeline that will give us flexibility and the ability to easily refactor the code to put it in production later?
First, we’ll define machine learning pipelines and explore the idea of using checkpoints between the pipeline’s steps. Then, we’ll see how we can implement such checkpoints in a way that you won’t shoot yourself in the foot when it comes to put your pipeline to production. We’ll also discuss of data streaming, and then of Oriented Object Programming (OOP) encapsulation tradeoffs that can happen in pipelines when specifying hyperparameters.
A pipeline is a series of steps in which data is transformed. It comes from the old “pipe and filter” design pattern (for instance, you could think of unix bash commands with pipes “|” or redirect operators “>”). However, pipelines are objects in the code. Thus, you may have a class for each filter (a.k.a. each pipeline step), and then another class to combine those steps into the final pipeline. Some pipelines may combine other pipelines in series or in parallel, have multiple inputs or outputs, and so on. We like to view Machine Learning pipelines as:
Pipelines (or steps in the pipeline) must have those two methods:
Note: if a step of a pipeline doesn’t need to have one of those two methods, it could inherit from NonFittableMixin or NonTransformableMixin to be provided a default implementation of one of those methods to do nothing.
It is possible for pipelines or their steps to also optionally define those methods:
The following methods are provided by default to allow for managing hyperparameters:
RandInt(1, 3)
which means 1 to 3 layers. You can call .rvs()
on this dict to pick a value randomly and send it to “set_hyperparams” to try training on it.For mini-batched algorithms like for training Deep Neural Networks (DNN), or for online learning algorithms such as in Reinforcement Learning (RL) algorithms, it is ideal if the pipelines or if the pipeline steps can update themselves on chaining several calls to “fit” one after another, re-fitting on the mini-batches on the fly. Some pipelines and some pipeline steps can support that, however, some other step will reset themselves upon having “fit” called anew. It depends how you coded your pipeline step. It is ideal if your pipeline step only resets upon calling the “teardown” method, then “setup” again before the next fit, and doesn’t reset between each fit nor during transform.
It is a good idea to use checkpoints in your pipelines - until you need to re-use that code for something else and change the data. You might be shooting yourself in the foot if you don’t use the proper abstractions in your code.
“There are only two hard things in Computer Science: cache invalidation and naming things.” — Phil Karlton
Programming frameworks and design patterns are known to be limiting by the simple fact that they enforce some design rules. That is hopefully in the goal of managing things for you in an easy way and to avoid yourself making mistakes or ending up with dirty code. Here is my shot at it for pipelines and managing state:
This should be managed by a pipelining library which can deal with all of this for you.
Why should pipeline steps not manage checkpointing their data output? Well, it’s for all these valid reasons that you’ll prefer to use a library or framework instead of doing it yourself:
This is cool. With the proper abstractions, you can now code your Machine Learning pipeline with a huge speed-up when tuning hyperparameters by caching every trial’s intermediate result, skipping steps of the pipeline trial after trial when the hyperparameters of the intermediate pipeline steps are the same. Not only that, but once you’re ready to move the code to production, you can now disable caching completely without having to try to refactor code for a month. Avoid hitting that wall.
In parallel processing theory, pipelines are taught to be a way to stream data such that a pipeline’s steps can all run in parallel. The laundry example is good at picturing the problem and the solution. For example, a streaming pipeline’s second step could start processing partial data out of the first pipeline’s step while the first step still computes more data, and without having for the first pipeline’s step to completely finish processing all the data. Let’s call those special pipelines streaming pipelines (see streaming 101, streaming 102).
Don’t get us wrong, scikit-learn pipelines are nice to use. However, they don’t allow for streaming. Not only scikit-learn, but most machine learning pipelining libraries out there don’t make use of streaming whereas they could. The whole python ecosystem has threading problems. In most pipeline libraries, each step is completely blocking and must transform all the data at once. There are just a few which enable streaming.
Enabling streaming could be as simple as using a StreamingPipeline class instead of a Pipeline class to chain steps one after the other, providing a mini-batch size and a queue size between steps (to avoid taking too much RAM, which makes things stable in production environments). The whole would also ideally require threaded queues with semaphores as described in the producer-consumer problem to pass info from one pipeline step to another.
One thing that Neuraxle does already better than scikit-learn is to have sequential pipelines, which can be used by using the MiniBatchSequentialPipeline class. The thing is not threaded yet (but it is well in our plans). At least, we already passes the data to the pipeline in mini-batches during fit or during transform before collecting results, which allows for big pipelines using pipelines like the ones in scikit-learn but here with mini-batching. And with all our extra features like hyperparameter spaces, setup methods, automatic machine learning, and so forth.
setup
methods are called throughout the pipeline. Otherwise, the pipeline needs to be serialized, cloned, and reloaded with pipeline step savers, which is something we already coded and would be ready for use. Code that uses TensorFlow and other imported code that was build in other languages such as C++ is hard to thread in Python, especially when it uses GPU memory. Even joblib can’t fix easily some of those issues. Avoiding that with proper serialization is goodNot only that, but the way to make every object threadable in Python is to make them serializable and reloadable. That said, in Neuraxle we plan to very soon code this. It will allow for dynamically sending code to be executed remotely on any worker (that it be another computer or process), even if that worker doesn’t have the code itself. This is done with a chain of serializers that are specific to each pipeline step class. By default each of those steps has a serializer that can handle regular Python code, and for more wicked code using GPUs and import code in other languages, models are just serialized with those savers, and then reloaded on the worker. If the worker is local, objects can be serialized to a RAM disk or a folder mounted in RAM.
There is one thing that still annoys us in most machine learning pipeline libraries. It is how hyperparameters are treated. Take for example scikit-learn. Hyperparameter spaces (a.k.a. statistical distributions of hyperparameters’ values) must often be specified outside of the pipeline with underscores between each steps of steps or each pipeline of pipeline, and so on. While the Random Search and the Grid Search can search hyperparameter grids or hyperparameter probability spaces such as defined with scipy distributions, scikit-learn does not provide a default hyperparameter space for each classifier and transformer. This could be the responsibility of each object of a pipeline. This way, an object is self-contained and also contains its hyperparameters, which doesn’t break the Single Responsibility Principle (SRP) and the Open-Closed Principle (OCP) of the SOLID principles of Object-Oriented Programming (OOP). Using Neuraxle is a good solution to avoid breaking those OOP principles.
A good thing to keep in mind when coding machine learning pipelines is to have them be compatible with lots of things. As of now, Neuraxle is compatible with scikit-learn, TensorFlow, Keras, PyTorch, and many other machine learning and deep learning libraries.
For instance, neuraxle has a method .tosklearn()
which allows the steps or a whole pipeline to be made a scikit-learn BaseEstimator - that is, a basic scikit-learn object. For other machine learning librairies, it’s as simple as creating a new class that inherits from Neuraxle’s BaseStep, and override at least your own fit, transform, and perhaps also the setup and teardown methods, and defining a saver to save and load your model. Just read BaseStep’s documentation to learn how to do that, and also read the related Neuraxle examples in the documentation.
To conclude, writing production-level machine learning pipeline requires many quality criterias, which hopefully can all be solved if using the good design patterns and the good structure in your code. To sum up:
Thanks to Vaughn DiMarco for brainstorming on this with me and motivating me to write this article. Also thanks to our contributors, clients and supporters who openly supports the project.
The future is now. If you’d like to support this project too, we’ll be glad that you get in touch with us. You can also register to our updates and follow us.
]]>def print_hello_world():
print("Hello World!")
Hello World!
We’ll be releasing Neuraxle 0.2.0 very soon on PyPI (so you’ll can pip install neuraxle
). We’ll also post here tutorials, articles and updates. Stay tuned, register below!
def print_hello_world():
print("Hello World!")
Hello World!
We’ll be releasing Neuraxle 0.2.0 very soon on PyPI (so you’ll can pip install neuraxle
). We’ll also post here tutorials, articles and updates. Stay tuned, register below!