Neuraxio - News

The Business Process of Machine Learning, with AutoML

2022-07-27T19:16:41-04:00

This is not a technical article. It is a business article for executives to better understand how to manage machine learning projects. Note that normally, I was offering this content for paid conferences and private training.

As the founder of Neuraxio and as of having done a lot of machine learning and deep learning projects in my life, I’ve developed a business process of doing machine learning for my 15+ clients to use to get results. What follows can be used for natural language processing, time series processing, computer vision, tabular data analysis, and so forth: it is general to all machine learning and deep learning projects, although not applicable to every AI project.

Here is the business process that will be explained in this article:

But first, let’s introduce a few concepts and how training machine learning algorithms works in general to later understand better how to manage these projects.

Introduction

AI v.s. ML v.s. DL

As an introduction to the machine learning business process: what is Artificial Intelligence (AI) v.s. Machine Learning (ML) v.s. Deep Learning (DL) ?

Well, Deep Learning is an advanced form of Machine Learning. And Machine Learning is an advanced form of Artificial Intelligence.

Artificial Intelligence

Is a quite general term. It was introduced in the 1950s.

It's a big umbrella term that contains machine learning and deep learning.

You can think of artificial intelligence as algorithms that can solve chess. This is the most classical example. You also have some classical path finding algorithms, used a lot in video games to make the enemies intelligent and able to find you and find their best paths in 3D and 2D games. AI contains a very varied set of possible algorithms where the machine can solve problems with some intelligence.

Machine Learning

Then, there is machine learning, which appeared more recently in the '80s. And it is then that the algorithms, you program start learning some behavior and rules from data to do predictions or to take some actions in the real world.

Most people think that machine learning is like robotics. ML is not necessarily robotics. These are two completely different things. Robotics is often a lot more about the physical materials but can also be about the intelligence of the algorithms, whereas machine learning is a specific class of algorithms that simply learn from data. An example of machine learning algorithms is next word prediction on your cell phone’s keyboard. The algorithm learned from past usage data.

With machine learning, you can talk a lot of statistical models being fitted on some data with provided inputs to generate outputs. Neuraxio, as a business, is most interested in solving Machine Learning problems for their clients, often using Natural Language Processing (text data) and Time Series Processing (time-based events data).

And most of the time, with machine learning, you will see a supervised learning process where you give some training examples to your algorithms from data inputs to expected outputs that you learn to predict by reducing an error metric on the data.

With machine learning comes artificial neural networks. These neural networks learn from data. Machine Learning is not limited to neural networks, there are other machine learning algorithms such as most statistical regressions, like the linear regression that you probably already have done in basic statistics classes, and logistic regressions.

Deep Learning

Finally, there is deep learning, which is much more recent than machine learning due to unlocking some cool mathematical tricks that allows for deeper neural networks to learn form data.

A quick rule of thumb could be that a neural network that has more than two layer of stacked neurons in depth is deep learning. But it is more complicated than that. Deep learning is inspired from the brain, but can be quite different sometimes in the way it can dynamically process information in mathematical ways that the brain couldn’t even do, using dynamic graphs.

But a simple rule of thumb is that if you have an artificial neural network (ANN) with more than two layers, then it is somehow considered deep-learning.

The Supervised Learning Process

The supervised learning process - how does it work? This is most of the time how the artificial neural networks, or other machine learning algorithms, can learn on data.

You have an artificial neural network represented in the image above (in the center).

You present to it some kind of inputs (to the left). The input you present could be an image, text, sensor readings, transaction histories, or other things like that.

And you want to predict something at the output (to the right) - that might be what is found in the image. Or the sentiment in the text. Or, in how much time would a consumer buy again a certain product in your store? What products should we recommend next? Things like that.

You also have an expected target (to the right), also known as expected outputs, during the training.

Summary: you present data as inputs, then you predict something as outputs,and you want to compare the prediction outputs to the expected target output.

The key to the learning process is that the learning takes place with this comparison to adjust the neurons’ weights in a way that will reduce the error at the output.

It is very similar to solving a mathematical textbook exercise as a student, where you read the question, calculate the answer, and then write it as an output on paper. You then correct yourself by comparing your answer to the real one and then backpropagating this information in your brain to update your neural networks’ weights in between the involved neurons.

Therefore, by comparing what you generated to the real answer, you can adjust the weights of the connections between your neurons, so that you do not make this error again, or to confirm your answer and give you more confidence in your predictions / to have sharper answers in the future.

Note that often, the error is called a loss function and is minimized. You can define other business evaluation metrics (or also known as scoring functions) to evaluate your learning algorithms’s performance in ways that will suit your business’ needs better. The loss function and evaluation metrics are not the same thing.

So this is supervised learning: it is about minimizing the error over time, thus, learning, given many training examples. This implies a lot of things for a software project and the evaluation methodology.

Phases of a machine learning project

You will have phases for your projects to have your algorithms automatically learn on data and output answers.

Refer to the image just above for the following sections.

Establish a goal

The first phase is to establish a goal. This can be part of the data analysis and problem analysis. Most of the questions you'll ask yourself that will be important on your project later on will be in this phase, as this is where you set a direction for your research and development (R&D), applied research, or application development.

Is the project even possible? If so, clarify the scope and the goal to have your team solve the good problem.

What will be the prototype that your team will program? What will be the required data format and data manipulations to code this prototype? Preparing the data is a step not to be ignored. Your data scientist will have to interact with your database staff. It is best when you have people already specialized in databases in your company.

A problem that I often see is that companies have a business goal that might differ from the data they have at hand. Make sure that you have access to the right data before wondering what kind of machine learning algorithm you could apply to the data to solve your business goal, because the algorithm will be chosen in function of the data and your goal.

So, you probably need to do some predictions or to take some actions and you want to automate this task. This can be very complex and the problem analysis phase it is not to be overlooked. One of the main reasons for failure for machine learning projects is to have a goal and a problem definition that changes throughout the project. Therefore, it is very good thing to define the problem properly, at the beginning of the project to avoid surprises.

Data acquisition and data preparation

The next step is data acquisition and especially data preparation.

Starting a machine learning project usually takes some time to dig through the data, so don’t expect a machine learning firm to take too much commitment in the project too early before even having looked at the data. This step is often called Exploratory Data Analysis (EDA). There are chances that your project is impossible to do right now and they will want to report you on that. A good firm will talk to you straight regarding the feasibility of your project and its chances of success. Do the analysis with them and avoid having to restart from scratch if you change requirements later on.

This can take up to two months, but can vary a lot. Sometimes, you are lucky and your data is already well formatted. Most of the time, it is not the case in business projects. And then you do a prototype.

Prototyping

This can take one to three months on average, from what I've seen.

You should aim at doing just the strict minimum in this step to achieve results. Keep in mind at least some clean code and clean architecture concepts, as well as the legal and licensing part of things here to avoid hitting a wall later on and ensure that the project will be able to move to the next step.

Depending on the level of risk, if a machine learning project is very risky in the sense that it may not be physically possible to attain results with your data, then the effort on the prototype will be little and will not take into account much of the clean code principles as well as the ability to reuse the code later one for real. On the other hand, if your machine learning project feels safe to your data engineers, data scientists, and machine learning engineers, then they will find that it will save time and money to directly aim at building the right software architecture with the good concepts. In this case, you may find the article Structuring Machine Learning Code: Design Patterns & Clean Code quite interesting for your team to follow good machine learning coding principles.

It is often the case that 1 to 3 prototypes are built and compared, often optimized with Automated Machine Learning (AutoML) to tune, compare and select the best prototypes automatically, in Python, and dynamically based on your data that may grow over time. The fact that your data will grow over time may change which prototype performs better, as the choice of the model depends on the data, remember. So, with AutoML, you can automatically optimize and pick your models.

The Minimum Viable Product (MVP)

The next step is to do a Minimum Viable Product (MVP), which is a bit better than the strict minimum prototype, in a way that the MVP is functional. And it can be shown to people in live, and even sometimes as an alpha or beta version of your app, feature, product, service or project. The MVP is discussed more in depth later in this Business Process of Machine Learning article.

Deployment and iterations

This phase can take 3 to 5 months, but can extend up to an infinite amount of time. This can even be the core of the business, such as Google’s search engine.

To sum marize, you first get the prototype to reach results, and then you deploy and you iterate on the code to make the MVP deployable or usable, usually with some improvements as well, compared to the prototype.

What you must know to succeed

Machine Learning projects can take a lot of time. You must know what's ahead or have a good consultant to succeed in a machine learning project. Did you know that 87% of machine learning projects don’t make it to production? So when you begin to do prototypes, data acquisition, and so forth, you need to keep in mind the commercialization of your project at the end.

There are so many things that can go wrong in those projects if you don't do things properly. The data, the prototype, the licenses, the algorithms, the quantity of data, the complexity of the task, the time required to train the models on the data. So many things to consider. It’s best to have a highly skilled machine learning guru to analyze your problem and to use its hard-earned knowledge to set your project up the right way, at the beginning. The most important part of your project is still the problem definition that was discussed above in this article.

Working through an example

You may want to program a sentiment classifier for text (or audio) data. Okay, you've got your problem, your inputs and your outputs are clear: text to sentiment. You can do a supervised learning on that, given the right data points. This means that you need to have text that is pre-labelled with sentiment. Let’s make it simple: the input is text, and the output of your model is a choice between [“happy”, “not happy”] to classify the sentiment. Now you need a metric to evaluate your model because you will train a lot of models (Through AutoML) when working on the problem with the data and the prototypes.

You also need to be able to pick up automatically the best prototype or model here, in an optimization loop. Technically speaking, your scoring metric should be a business metric and it is noted that this is different than the training loss. It is a metric that is really useful to select the best model. Once you've done this, the problem is well-defined, and you are ready for the rest.

Then you can start to acquire, modify and pre-process your data such that the problem can be solved using a supervised learning algorithm (or other algorithms). The data can be presented to the neural network or the actual learning algorithm. And you can in parallel start coding the machine learning algorithms.

At Neuraxio, with our clients, most of the time we're programming the models and helping with the data preparation as well (see at the right of the business process image, the model part, and see at the center, where the model meets the data). Our clients are providing the data that we will then format in a good way for a learning algorithm to be plugged into it to have the data meet the model (at the middle) and therefore be able to solve the problem.

So we will train the model on the data, and analyze its errors. Depending on what the errors are, some more work will be required to be done either on improving the quality of the data or the quality of the model.

At the end, we can deploy to production, to process real life data in live if needed.

Methodology

Let’s finally go through this chart in the next subsections.

Establishing a goal.

Establishing a goal is about data analysis and problem analysis, as I've said, because the two meet in the middle, ultimately to solve the business problem. The models need to be adapted to the data, and the data to the type of model as well. Working the data is underrated in the industry.

Machine learning is abstract. It is not obvious.

And even having, as of writing this, more than seven years of experience in the field of machine learning and deep learning, I found it hard at the beginning to link the right algorithms to the right data and using the right data preprocessing, where things meet at the middle.

Sometimes it's not clear. It requires a great deal of creativity, and a great knowledge of the existing models and algorithms that can be used, as well as the existing data pre-processing techniques. Most of the time, going custom is needed anyways. Just like creating a website, doing machine learning is something that will be adjusted to your data and problem to solve by using the right algorithms to solve that.

A good business overview of the problem is important. After having worked on more than 57 artificial intelligence projects at the time of writing this, I realized that someone must have at least let's say four to five years of of experience to be able to choose the development decisions right. The average deep learning project is much more complicated than the average website you will build.

For machine learning and deep learning projects, changing goal is very hard once coding has begun. Just like website development, it’s bad to be starting over with changing requirements. At the beginning of the project, the requirements to properly doing machine learning are most of the time just even harder to define. It can be less hard with Neuraxle, which is an open source machine learning framework. By using proper clean code in your machine learning projects, you will go faster.

So, as I said, you want to have a metric in your project to automatically score your machine learning algorithms. Will you have reached your goal with your algorithm? Knowing this automatically is important. Moreover, it is especially useful when doing automated machine learning (AutoML).

Basically, AutoML is automated hyperparameter optimization, where hyperparameters are like the genetic code (hyperparameters) for your machine learning algorithm. AutoML is like performing a search of the best model in the space of possible genes (hyperparameter space) and training some of them and intelligently automatically picking the next ones based on the previous scores, using the evaluation metric. You want to pick the model with the best score.

Establishing a business metric as a way to pick your model is crucial in the goal setting phase, it will help you pick the right model on the data, amongst the one that you've coded, and amongst their possible hyperparameters, so that the best model picked is really solving your problem from a business point of view as much as possible.

Gathering Data

Ideally, you have data, right? Without data, your project is at risk. Data and data preparation is often cover more than half of the project. Sometimes it's okay to use public data, or someone else's data, or your clients’ data. In fact, it's most of the time what's done in the industry, coming up with some business deals, allowing usage of the data in the purpose of improving your business.

Many things must be considered. After the first analysis phase, continuing a project without data is hard. It's really hard to work on the part at the right of the business process diagram of this article without the data, because data allows you to iterate and improve the whole system from the defined evaluation metric. Data is useful even to test the system.

Data is the new gold. Data is the new oil. Data is the new electricity. Without the oil in the system, it's hard to make it work properly. Machine learning is only the vehicle. So if there is no data, it will be required to create synthetic data for debugging and development purposes. Sometimes, even when you have lots of data, synthetic data is created anyway just to test the system, such as with unit tests, acceptance tests, functional tests, end2end tests, and more. Those tests are different from the tests described in the present article. The tests of the present article are performance tests on the training, validation, and testing data.

Among the 15+ clients at Neuraxio, I've seen at least a company successfully do a project without data in hands at first. However, I always advise to first have data to reduce risks and costs.

Data Splits

It is a good thing to split the data into train, validation and test sets or into cross-validation splits.

Basically, you train your model according to the supervised learning process, using the training data.

Then you use the validation data to pick the best model with the evaluation metric.

And after having done that and selected the best model, you can test it again with the test data that was held out the whole time like the validation data was held out during the training.

Over all, you then look at the validation score and the test score for model selection and deployment. You look at the training score v.s. the validation score for debugging purposes.

Your validation score is usually worse than your training score. With error analysis, you can know in which direction to go: improve the model or regularize it. Get more data or improve the quality of your existing data. This is depicted in the next two images

In artificial neural networks and a few other similar algorithms, some more error analysis can be done on the evaluation metric to optimize the hyperparameters (the genetic code-like parameters) of your model. This is often done automatically by the usage of AutoML algorithms:

Note that it is desired that the validation data and test data have the same statistical distribution. You also want these datasets to fit real life data as you’d have in production so that you don’t have surprises and to really optimize for the good thing when optimizing the model.

Data-driven programming

So as you see in the business process image, we split the data into train, validation, and test. So what you're doing when you're testing with a test set as well as a validation set is that you're double checking that you weren't just lucky picking the best validation model. Note that the validation dataset is sometimes called the development dataset, or development set.

The validation and test scores should be the same. You should attain similar scores at the validation phase and test phase.

Meanwhile, the training set may be augmented with other data. This is called data augmentation. Because you want your validation and tests splits to represent the real problem, your chance at making your model learn more things to generalize better is to add varied examples at train-time.

So you may increase your variety of training data to make your machine learning model more robust to unseen situations for instance. It might help the model a great deal sometimes.

Typical data set splits: we often see 70% of the data in the train set when there is low data. Due to the central limit theorem (CLT), above a certain quantity of data in the validation and test set, you don’t need any more. So when you have very big quantities of data, it is ok to make the training set bigger, such as 99% of the data, and 1% for the validation and test set.

We train and evaluate the model on the training and validation data and we can analyze the bias and variance. Bias and variance, in this case, are also referred to as sometimes, respectively, underfitting and overfitting, which are described more in the next section. It's not exactly the same thing but it's quite related in the case of machine learning algorithms evaluation.

You may also want to compare the validation set performance to the human performance. The difference will really tell you whether or not you reached the best possible score that you can get on your dataset and with the specific evaluation metric you designed. By seeing the differences, you can decide if you should improve the model or the data.

Don’t overlook the importance of optimizing on the good metric and on the good validation dataset. Transferring a model to another data distribution in production is risky and it may perform badly. I’ve seen (chatbot BERT) models score 80% on a dataset, and then only 20% on another similar dataset, just because it wasn’t optimized on it, for instance.

Model selection

A rule of thumb is that the more data you have, in general, the better it will be to use a deeper model rather than a simple model. If you have a small amount of data, then you want to use classical machine learning algorithms that are simpler. As the quantity of data increases, deep learning models will start to perform better than classical models.

On one hand, hard problems require more training data and therefore deep learning to have the capacity to fit on all this data and to extract meaning and generalities from it. A hard deep learning problem can be for instance speech to text or text to speech, even machine translation such as Google Translate’s model that uses attention mechanisms.

On the other hand, if you were to solve the problem of an addition, let's say two plus two equals four, then you wouldn't need all of these sophisticated algorithms. In fact, learning algorithms would most likely perform worse than if your model was a simple classical addition algorithm.

Shortly put, complex algorithms should not be used just for fun. The choice of the model depends on the complexity of the task, your data's quantity and quality, and the business goals that are reflected in your evaluation metric. Often, the value is found in the business use-case.

Keep it simple and stupid. You want to pick the simplest model possible that will attain the best results on your data. I’ve had clients insist for deep learning models while classical machine learning models would do the trick. In these instances I try to reason them and to get them a first prototype with what I believe is best. A good machine learning consultant won’t be afraid to say no to complexity when simple things works.

It is in situations where your data may vary that automated machine learning (AutoML) comes in the most handy, because you can automatically test different models and hyperparameters with your various or growing quantities of data, where you parametrize the number of neurons in your models, and number of layers that are stacked one onto another in depth and the learning

rate and stuff like that, for instance. And you can pick the best model according to your problem.

You can even configure the data pre-processing techniques in automated machine learning as such. For instance, the window size or exponential decay is often a parameter.

If your model is too complex compared to the amount of data, as in the left in the image and at the blue curve for a complex deep learning model, well your validation error, which is the vertical axis, will be low for the training and high for the validation. This is a situation of low performance.

What happens when your model is really intelligent (complex) for low data scenarios is that as it sees some training data, it will most of the time memorize it perfectly. And it might fail to give predictions on some yet-unseen new data. This is called overfitting.

When your model is not too complex, to come up with the lowest possible training error, it will be forced to learn the underlying rules in the data as it doesn’t have the capacity to simply memorize it all. This is a perfect fit.

Note that it’s possible to use a deep learning model on low data, but only if you regularize the model. To do regularization on a model means to add obstacles to learning, so it won’t be able to memorize the examples “as-is” and it will need to generalize and draw conclusions and rules of thumbs on how the problem is really solved.

Another extreme case is when your model is too simple and on a highly complex problem that has a lot of data. It’ll fare bad on the training error as well as on the validation and test error. This is called underfitting.

For sure, choosing the right model is more complicated than just looking at the quantity of data and at the difficulty of the problem to solve. There are lots of things to consider. This is why you need someone who has a lot of experience to come up with the right ideas to try with your team.

Coding a prototype.

Ultimately, you must face and realize that you must customize the model to the data.

Bad machine learning engineers are the one who have found an algorithm that they like and just want to reuse it, just like a hammer looking for nails. You may need other tools to solve your problem if your problem is not to plant a nail. Most coders in the beginning of their coding journey are like that and are very enthousiast to try these shiny techniques. Only mature coders that have endured years of experience will resist the temptation to use the holy hammer.

You need to come up with a good model, according to your data and problem. You use the data and metric, as your metric should be aligned with your business’ goal. And you may need to customize the data with some pre-processing for the model. You can often change the model, but not so much the data. And so you train your models on the data and pick the best ones.

To sum up, before the MVP, you might want to reoptimize your prototypes and the data preprocessing for better results and do some reports to show results to clients and investors with the prototype. Then you do a minimum viable product and deploy. Basically, this is an improvement to either the model on the right side, or to the data pre-processing to the left side of the methodology chart. You will want to retrain the model on the data many times with the automated machine learning loop. This is not magic.

Redeploying and iterating

In the following image, the green and red sections must be worked through again after model changes, data changes, or data preprocessing changes.

These are deployment iterations: so you collect the data, you prepare it, you split the data sets into training, validation and testing sets, you train models with the automated machine learning loop (AutoML), you pick the best validated model, measure its test performance and deploy it if it is good enough and better than your baseline to which you can compare as well in the process. Often, this is done in the cloud, as it is quite convenient to have multiple models retraining in parallel in your AutoML loop.

In lots of context, the data won’t change and you won’t need to redeploy your model. But often, you then restart the cycle with data that is changing in quantity or quality, and you got yourself an improvement loop, a virtuous cycle.

By the way, 80% of the machine learning ecosystem is in Python. The chances are that your machine learning scientist will work in Python or want to work in Python to go faster. What I recommend for the deployment.

Conclusion

You’ve finally reached the end. Congratulations, your project works and you are doing redeployment loops to improve it! Celebrate your success. To summarize, you first need to define your problem, which may require analysis of your data. Then you need to define an evaluation metric for the models that will be built. You then work at having the right data pre-processing for the machine learning models to be trained on. You then have your team work on the machine learning models to come up with the best ones. They will be automatically selected with AutoML. Once a prototype is ready and good enough, you can deploy a minimum viable product (MVP) of your machine learning model, and then iterate on it to reach a fully working product (from alpha to beta version, to more), which may require to re-run the AutoML loop if you are in a situation in which new data is constantly acquired.

And also don't forget to connect with us guys at Neuraxio. Cheers!

Structuring Machine Learning Code: Design Patterns & Clean Code

2022-02-23T19:33:57-05:00

Over the years working in machine learning projects, we’ve come to discover some software design patterns that simply cannot be avoided in serious machine learning projects.

Below, in the video, there are good examples of how to build proper machine learning pipelines, following clean code OOP principles such as the SOLID principles for software design.

Several design patterns are discussed with practical examples and their implications. So not only you want to build neural networks and other machine learning algorithms, but also you want to find the best hyperparameters for them automatically. We’ll here demonstrate how it’s possible in a clean code way.

This will help you structure how to handle the flow of data from one step to another. Steps are chained one after another in a machine learning pipeline. For instance, you can override behavior inherited from "BaseStep" classes to change the flow of data. You can save state in objects with the fit method, transform the data with the transform method, and more. Overall, it's possible to build really powerful steps that can edit and change the execution flow, and in multiple dimensions.

At the end of the video, there is also a fast and almost comprehensive tutorial where the usage of advanced features of the Neuraxle framework is shown. The design patterns applied to machine learning are brilliant. Clean machine learning pipeline design is shown through an example applied to time series processing. This can be applied to deep learning as well and for sure.

You may find the pages overviewed in the video here:

https://www.neuraxle.org/stable/walkthroughs.html

How Machine Learning Will Improve the Workforce

2022-01-12T15:14:06-05:00

As part of the Fourth Industrial Revolution, Artificial Intelligence (AI) and Machine Learning (ML) have become part of our daily lives. Across industries, companies have learned to rely on the convenience and insights that these innovations bring. As of 2020, almost 50% of all companies use AI and machine learning to improve operational quality. Companies that have fully integrated AI-driven tech are also estimated to earn 13% more thanks to improved services.

However, aside from enhancing external operations and increasing consumer satisfaction rates, machine learning can accelerate another critical component of business success: the human workforce.

Machine Learning and the Global Workforce

Despite concerns that AI and machine learning will eventually replace human workers, studies prove how the aforementioned technology can fuel profitable and timely changes. For instance, in Kathleen Walch’s report on global AI dominance, she mentions Japan as a first adopter of machine learning. Though most of the country’s efforts have been focused on robotics, this is seen as an integral solution to alleviate the aging population's workforce shortage. This is important in sectors like the strained healthcare system, wherein healthcare workers are vastly outnumbered by older patients. With the use of machine learning applications, tedious tasks like patient record maintenance can be automated.

Two more countries leading the adoption of machine learning within the workforce are China and the United States. After all, these countries have the largest and most well-backed AI ventures worldwide. In the U.S., there's an emphasis on machine learning complementing human workers. Among the more notable examples of this is highlighted in Ben Eubank’s book on AI within HR. He explains that companies are empowering their internal operations by using smart solutions that simplify processes and clear up production backlogs. This creates a more efficient workplace, that recent surveys show is important for attracting and retaining top talent.
Meanwhile, former Google China president Kai-Fu Lee’s book on today’s AI superpowers, explains that China’s machine learning initiatives are market-driven. This means that rather than being based on abstract ideas, the country’s efforts provide tangible benefits that the growing entrepreneurial market appreciates. For instance, since China is one of the most mobile-driven nations, machine learning providers offer companies a means to streamline their customer transactions. Rather than using an employee’s valuable time fulfilling a customer’s order, for example, their AI-powered systems can fulfill this instead. This is expected to create a faster-moving revenue stream, which in turn, can support the creation of 300 million jobs.

How to Introduce Machine Learning to Your Workforce
Of course, while machine learning will undoubtedly increase profitability and scalability for companies, it may still pose concerns for employees. After all, since the dawn of AI in the 50s, there has been a fear that machines may eventually replace us all. Though understandable, employers can address and assuage these worries to enjoy a seamless and beneficial machine learning integration.

So, before you roll out your machine learning efforts, do offer some classes to familiarize your workforce with AI. In this consultant-led training, your employees will be able to understand the technology as well as the rationale behind why you’re adopting them. Plus, here, they'll also learn about the benefits to them, as noted by Guillaume Chevalier’s article on AutoML. When your employees are aware of the technology you’re about to apply, implementation will be faster and less prone to hiccups.

Moreover, emphasize that your employees aren’t being pushed out. As stated in Kevin Cashman’s review of predictions for future automation, history has proven that technology does breed opportunities for growth and transformation. Should you be expecting consequent changes in your labor demands, explain how this will only change their job description, but not their employment.

Admittedly, there will be changes in the workforce following the mainstream adoption of AI. However, so long as companies aim to use machine learning as a means to enhance rather than replace their current workforce, there will be more long-term wins than losses. For more information on how to integrate machine learning within your workforce, visit Neuraxio.

Article exclusively for neuraxio.com by Olivia Rowe.
CC-BY

How to unit test machine learning code?

2021-08-25T16:12:56-04:00

Why are unit tests important? Why is testing important? How to do it for machine learning code? Those are questions I will answer.

I suggest that you grab a good coffee while you read what follows. If you write AI code at Neuraxio, or if you write AI code using software that Neuraxio distributed, this article is especially important for you to grasp what's going on with the testing and how it works.

The testing pyramid

Have you ever heard fo the testing pyramid? Martin Fowler has a nice article on this topic here. To summarize what it is: you should have LOTS OF small "unit" tests that are testing small components of your software, and then a FEW "integration" tests that are medium-sized (and will probably test your service application layer), and then VERY FEW "end-to-end" (E2E) tests that will test the whole thing at once (probably using your AI backend's REST API) with a real and complete use-case that does everything to see if everything works together. It makes a pyramid: unit tests at the bottom,

Why this different quantity of tests with these granularitues? So we have a pyramid of tests like this:

/__ VERY FEW end-to-end (large) tests; __\

/______ FEW integration (medium) tests; ______\

/___________ LOTS OF unit (small) tests. ___________\

Note that the integration tests are sometimes also called acceptance tests. They may differ depending on where you work at, as different terminology is used. I personnaly prefer acceptation tests, so as to reffer to the business acceptation of a test case. Like if an acceptation test case is a business requirement written into code.

Suppose that in your daily work routine, you edit some code to either fix a bug, measure something in your code, or introduce new features. You will change something thinking that it helps. The following will eventually happen as you are not perfect and probably do errors and mistakes from time to time. How often have your code worked on the 1st try?

Without tests at all: you will catch the bug 2 weeks later and probably have no clue where it is and how to fix it. The cost to fix this test will be 10x than if you knew it at the start when you coded it.
With large & medium tests but no unit tests: you will know instantly upon doing the change that something is wrong. But you don't know for sure exactly where it is in your code. The cost to fix this test will be 3x what it'd be compared to if you knew where it was with unit tests.
With small unit tests: not only you'll instantly that you have a bug upon doing the change and running the test, but chances are, if you have a good code coverage with your unit tests (say 80%), that you have a unit test testing the piece of software that you've just modified and you'll know instantly and exactly where you have a bug and why.

To sum up: unit testing gives you, and especially your team, some considerable speed. Rare are the programmers who like to be stuck just debugging software. Cut the debugging times by using unit tests, and not only will everyone be happy, but also everyone will code faster.

"Understanding code is by far the activity at which professional developers spend most of their time."

- Source: https://blog.codinghorror.com/when-understanding-means-rewriting/

Unit tests

A unit test has 3 parts, they are called the AAA steps of a unit test:

Arrange: create variables or constants that will be used in the next Act phase. If the variables are created in many tests, they can be extracted at the top of the test file or elsewhere to limit code duplication (or the test can even be parametrized as in the second image example later on).
Act: call your code to test using the variables or constants set up just above in the Arrange, and receive a result.
Assert: verify that the result you obtained in the Act phase matches what you'd expect.

Example #1 of the AAA in a ML unit test:

[Click here to read whole original code file for the code above]

See how the test is first set-upped (arranged) at the beginning? The test above is even further setupped using an argument in the test function, meaning that this test can be ran again and again with different arguments to test using PyTest's parametrize. Here is a good example of a well-parametrized unit test that also makes use of the AAA.

Example #2 of the AAA in a ML unit test:

[click here to read whole original code file for the code above]

In the test above, written by Alexandre Brillant, we also see the AAA. At first, we create a ML pipeline, data inputs (X), and expected outputs (y). Then, we act: we call "fit_transform" on the pipeline to get a prediction result (Y). Then, we assert: we check that the prediction result matches what we expected (y==Y). That is a very stupid and simple test, although, it can catch many bugs.

Unit tests in ML rarely use lots of data. Most of the time, they use small hand-design data samples just to check if things compile or so.

You'd then use medium-sized fake (or sometimes real) datasets in acceptance tests (medium integration tests), and real data in the end-to-end tests.

Sometimes, a unit test will test more than one thing. For instance, you'll test two things, because in your "Arrange" part you'll use something else. Hopefully, this something else was already tested individually with another test. And sometimes you could use what is called "mocks" or "stubs" to ensure you don't use two things in the same test, although mocking is a bit more advanced and more used in Java (less in Python), you can read about mocks and stubs here. Personnally, I often prefer writing stubs rather than writing mocks, as stubs feels more straightforward to use across many different unit tests.

The TDD loop

It naturally emerges that someone who do unit tests will do this 3-steps loop:

RED: write a unit test that fails;
GREEN: make the test pass by writing proper code;
BLUE: refactor the code by cleaning a bit what you've just written before moving on.

Therefore, by coding software that you test, you will do cycles of 1, 2, 3, 1, 2, 3, 1, 2, 3... and so forth. It is strongly recommended to start with writing the test that fails (red). Why? It is because it will put you in the shoes of someone using the code that you are about to write. It will start by making you think about your code's API or public function design. Plus, as per the SOLID principles (applied to Machine Learning), it will help you respect the DIP (Dependency Inversion Principle) by probably setting up something in your test (in the first AAA phase: "Arrange") and then you'll pass it to the class that you are about to test. This will effectively apply dependency inversion (DIP) to your code by passing things around as arguments and creating them outside, instead of creating them inside the classes that you test.

Obviously, by doing the TDD loop, you'll often re-run your whole unit test suite to ensure you didn't break things around in the rest of the codebase nearby.

The ATDD loop

The ATDD loop is an improvement to the TDD loop. It is summarized as follow:

ATDD: Write an acceptance test first, and then do many TDD loops to fulfill this acceptance test.

Why do so? Well, acceptance tests are medium-sized tests, compared to our small-sized unit tests. If you need to do the test beforehand, then you probably want to write an acceptance test that is a medium test case, and then within your acceptance test "medium TDD" loop, you'll encounter lots of smaller parts to solve where you'll do lots of "small TDD" unit test loops.

So the ATDD loop looks like this:

Acceptance RED: write an acceptance test that fails;
Acceptance GREEN: make the test pass by writing proper code;

1. Unit RED: write a unit test that fails;
2. Unit GREEN: make the test pass by writing proper code;
3. Unit BLUE: refactor the code by cleaning a bit what you've just written before moving on.
1. Unit RED: write a unit test that fails;
2. Unit GREEN: make the test pass by writing proper code;
3. Unit BLUE: refactor the code by cleaning a bit what you've just written before moving on.
1. Unit RED: write a unit test that fails;
2. Unit GREEN: make the test pass by writing proper code;
3. Unit BLUE: refactor the code by cleaning a bit what you've just written before moving on.
[Continue TDD loops as long as required to solve the acceptance test...]

Acceptance BLUE: refactor the code by cleaning a bit what you've just written before moving on.

Other tests

Of course, there are more types of test. Some people do "border" tests, "database" tests, cloud "environment" tests, "uptime" tests (as in SLAs with uptime warranties), and more. But the 3 types of test presented above (E2E, acceptance/integration, unit) are the real deal for coding proper enterprise software.

"One difference between a smart programmer and a professional programmer is that the professional understands that clarity is king. Professionals use their powers for good and write code that others can understand."

- Source: Robert C. Martin, Clean Code: A Handbook of Agile Software Craftsmanship

Neuraxio AI Programmer

Did you like this article? This article is part of our Neuraxio AI Programmer email series. Register below! Or by clicking here.

Our top learning resources for AI programmers

2021-08-13T15:14:54-04:00

You are an Artificial Intelligence (AI) programmer and you'd like to learn how to program well as we do at Neuraxio?

Lucky you, we've launched a series of curated resources to help you get better and to work like a pro in Machine Learning (ML) projects.

You'll learn:

How to apply the SOLID principles of Clean Code to your ML code;
Package design techniques to structure your code;
How to properly balance the usage of Jupyter Notebooks v.s. using an IDE in an ML project;
How to structure your backend software architecture code for a viable production-ready project that will live through time;
The structure of Clean Machine Learning code that yields results with AutoML
Links to the top online classes that we recommend to our programmers internally to get up to date with Machine Learning;
Top 5 documentation pages that you should read to grasp how Neuraxle works;
How to program robust software using unit tests;
Framework design patterns you can use to design your own frameworks;
How to do business with Neuraxio by working for us or referring us AI clients;
And more!

If you successfully pass the quizz that will be sent to you at the end of this training, you'll be able to purchase the certificate to showcase your skills for $17 CAD, if you wish to. This certificate can be showcased on LinkedIn as "Neuraxio AI Programmer".

Remember, you're only one ML project away from achieving success.
And it starts here and now.

AI technologies for eCommerce - The Commerce Show

2021-08-03T15:20:45-04:00

This is a podcast episode from The Commerce Show.

Original description from The Commerce Show:

In this episode, we are talking about AI technologies for eCommerce. Guillaume Chevalier has been working in artificial intelligence for over 7 years now and has been involved in over 57 machine learning projects.

We cover several applications of AI in the eCommerce industry such as Personalized shopping experience, Sales/Inventory forecasting, Automated customer service/chatbots, Visual search and powerful synonyms search, Price optimization, Understanding customers better (persona) and Recommendation algorithms.

After the first 30 minutes to talk about eCommerce, we also talk a bit deeper about AI from a developer point of view. Guillaume explains why he decided to develop Neuraxle an AI framework for machine learning projects over the years. Guillaume is also giving tips about « How to start and plan an AI transformation for a non-tech business. »

This podcast is amazing and you’ll discover Guillaume's passion for eCommerce, ML (machine learning) and NPL (natural language processing).

Listen to the full podcast:

https://thecommerceshow.com/episodes/ep-13-guillaume-chevalier-from-neuraxio-ai-technologies-for-ecommerce-the-commerce-show-tKAR98Gl

Follow The Commerce Show.

Clean Machine Learning Training

2020-06-21T17:55:00-04:00

Applying clean code and SOLID principles to your ML projects is crucial, and is so often overlooked. Successful artificial intelligence projects require good programmers to work in pair with the mathematicians.

Ugly research code simply won’t do it. You need to do Clean Machine Learning at the moment you begin your project.

Despite all the hype being about the deep learning algorithms, we decided at Neuraxio to do a training about Clean Machine Learning, because it is was we feel the industry really needs.

Clean code is excessively hard to achieve in a codebase that is already dirty, action truly must be taken at the beginning of the project. It must not be postponed.

We’re glad to have organized this event at Le Camp in March just before the COVID-19 outbreak. It was a fantastic event.

Thanks to participants from Thales, Shutterstock, Novatize, Artifici, Spress.ai, La Cité, LP, IA groupe financier, LGS - An IBM Company, Ville de Québec, STICKÔBOT INC., and Levio.

And also big thanks to the other event organizers including William Simetin Grenon, Francis B. Lemay, Maxime Bouchard Roy and Alexandre Brillant, as well as the other speakers outside of Neuraxio: Jérôme Bédard from Umaneo, and Vincent Bergeron from ROBIC.

It was fun, thank you all!

Guillaume Chevalier, Founder & Machine Learning Expert @ Neuraxio

You can interact with the present post on social media:

You can also check out our Machine Learning trainings.

What is Automated Machine Learning (AutoML)? - A Metaphor

2020-02-22T10:29:00-05:00

Daily, what does a data scientist do? And how can Automated Machine Learning avoid you to babysit your AI, practically?

Here is a metaphor: your data scientist is a mom. A babysitter.

The data scientist creates a nice artificial neural network and trains it on data. Then he’s going to supervise the learning. The data scientist will make sure that the learning converges in the right way so that the artificial neural network (or model) can give good predictions and then flourish.

Seriously, that’s all well and good, but it costs time, and it costs money.

Is there anything we can do to automate the process of being a mom - actually being a data scientist? Actually, we can use Automated Machine Learning.

Automated Machine Learning allows us to automate the process of being a mom.

Doing Automated Machine Learning (AutoML)

Firstly, when we define a model, an artificial neural network for example, we have to define the hyperparameters: The number of neurons, the number of layers of neurons on top of each other.

So we’re going to define things like the learning rate, and then the way the data is formatted to send it to the Artificial Neural Network (ANN).

Those are hyperparameters and they are all very well, but above that, to do Automated Machine Learning, we can actually define a space of hyperparameters. E.g.: the number of neurons varies between this and that.

The data can be formatted to send a certain amount of data of a certain length or a certain shape. You can disable or enable certain pre-processing steps in the data.

We can have a space in which, if we pick a point in that space, we find a special case like finally sampling a gene - a hyperparameter - a hyperparameter sample for our artificial neural network.

In other words, every point (sample) in the space is a different setting (or gene) to try.

With Automated Machine Learning, we can finally iterate in this space to pick up new points, try them out, in a somewhat random way, but still intelligent, so that after having made several attempts, converge towards a result.

The True Added Benefits of Automated Machine Learning

It avoids the data scientist to constantly have to supervise the model and to wait for the learning to finish.
It will also allow the model to finally run during the night.
The whole process will be repeatable later whenever the dataset, B2C or B2B client, or project changes!

There’s no need for the data scientist to be around all the time. AutoML can run for weeks, even months, for larger models if you want, and so on. With all this, eventually, then we can get the best results.

Moreover, it makes easier reusing the code of your last project into your next one. Which provides even more speed in the long run.

Hyperparameter Tuning is an Important Aspect of Every Machine Learning Project

We can also analyze the effect of hyperparameters on neural network’s performance. That’s a problem in data science: the neural networks or the model that’s going to perform best on a set of data, isn’t the model with the most neurons, nor the model with the most everything.

In fact, it’s not the one with the least neither. It has to be somewhere in between. You have to find what’s best - there’s a trade-off between bias and then variance.

A biased model is going to take some stupid rules of thumb eventually. Maybe it lacks complexity, it lacks neurons, and something’s going on, that’s not good enough in all of this.
Then a model, on the contrary, that has a lot of variance, has so much learning abilities that, yes, it will pass every exercise during training, but when it comes to the test, it won’t be able to generalize the substance of it correctly. He’s going to end up rote in the end. He’s going to memorize the questions and answers and then, because he was too smart, there was like too much memory or something like that, and then by the time he gets to the test, he’s not going to be able to generalize what he should have. He’s going to be lazy, and then he’s going to get bad results.

In the end, that’s one of the things we’re going to automate with Automated Machine Learning: finding the best model with a good bias/variance tradeoff.

That’s why we need a data scientist or Automated Machine Learning to supervise the artificial neural network, and then try and retry different hyperparameters, as there are no free lunches (NFL theorem).

Conclusion

There are different algorithms that allow you to choose the next point - not by chance - you can analyze what you’ve tried, and what the results were, and then pick the next point in space, and try it all out in an intelligent way.

Machine Learning software has more value if it has the ability to be automatically adapted to new data or a new dataset later on when things will change. The ability to adapt quickly to new data and changes in requirements is important. Those two factors that are often ignored are important in explaining why 87% of data science projects never make it into production.

In our projects, we use the free tool Neuraxle to optimize our Machine Learning algorithms’ hyperparameters. The trick is to define an hyperparameter space for our models and even for our data preprocessing functions using the good software abstraction for ML.

What's Wrong with Scikit-Learn Pipelines?

2020-01-03T11:35:00-05:00

Scikit-Learn’s “pipe and filter” design pattern is simply beautiful. But how to use it for Deep Learning, AutoML, and complex production-level pipelines?

Scikit-Learn had its first release in 2007, which was a pre deep learning era. It’s one of the most known and adopted machine learning library, and is still growing. On top of all, it uses the Pipe and Filter design pattern as a software architectural style - it’s what makes Scikit-Learn so fabulous, added to the fact it provides algorithms ready for use. However, it has massive issues when it comes to do the following, which we should be able to do in 2020 already:

Automatic Machine Learning (AutoML),
Deep Learning Pipelines,
More complex Machine Learning pipelines.

Let’s first clarify what’s missing exactly, and then let’s see how we solved each of those problems with building new design patterns based on the ones Scikit-Learn already uses.

TL;DR: How could things work to allow us to do what’s in the above list with the Pipe and Filter design pattern / architectural style that is particular of Scikit-Learn? The API must be redesigned to include broader functionalities, such as allowing the definition of hyperparameter spaces, and allowing a more comprehensive object lifecycle & data flow functionalities in the steps of a pipeline. We coded a solution: that is Neuraxle.

Don’t get me wrong, I used to love Scikit-Learn, and I still love to use it. It is a nice status quo: it offers useful features such as the ability to define pipelines with a panoply of premade machine learning algorithms. However, there are serious problems that they just couldn’t see in 2007, when deep learning wasn’t a thing.

The Problems

Some of the problems are highlighted by the top core developer of Scikit-Learn himself at a Scipy conference. He calls for new libraries to solve those problems instead of doing that within Scikit-Learn:

Source: the top core developer of Scikit-Learn himself - Andreas C. Müller @ SciPy Conference

Inability to Reasonably do Automatic Machine Learning (AutoML)

In Scikit-Learn, the hyperparameters and the search space of the models are awkwardly defined.

Think of builtin hyperparameter spaces and AutoML algorithms. With Scikit-Learn, despite a pipeline step can have hyperparameters, they don’t each have an hyperparameter distribution.

It’d be really good to have get_hyperparams_space as well as get_params in Scikit-Learn, for instance.

This lack of an ability to define distributions for hyperparameters is the root of much of the limitations of Scikit-Learn with regards to doing AutoML, and there are more technical limitations out there regarding constructor arguments of pipeline steps and nested pipelines.

Inability to Reasonably do Deep Learning Pipelines

Think about the following features:

train-only behavior:
- mini-batching (partial fits),
- repeating epochs during train,
- shuffling the data,
- oversampling / undersampling,
- data augmentation,
- adding noise to data,
- curriculum learning,
- online learning
test-only behavior:
- disabling regularization techniques,
- freezing parameters
mutating behavior:
- multiple or changing input placeholders,
- multiple or changing output heads,
- multi-task learning,
- unsupervised pre-training before supervised learning,
- fine-tuning
having evaluation strategies that works with the mini-batching and all of the aforementioned things.

Scikit-Learn does almost none of the above, and hardly allows it as their API is too strict and wasn’t built with those considerations in mind: for instance they are mostly lacking in the original Scikit-Learn Pipeline. Yet, all of those things are required for Deep Learning algorithms to be trained (and thereafter deployed).

Plus, Scikit-Learn lacks some things to do proper serialization, and it also lacks a compatibility with Deep Learning frameworks (i.e.: TensorFlow, Keras, PyTorch, Poutyne). It also lacks to provide lifecycle methods to manage resources and GPU memory allocation. Think of lifecycle methods as methods where each objects has: __init__, fit, transform. For instance, picture adding also setup, teardown, mutate, introspect, save, load, and more, to manage the events of the life of each algorithm’s object in a pipeline.

You’d also want some pipeline steps to be able to manipulate labels, for instance in the case of an autoregressive autoencoder where some “X” data is extracted to “y” data during the fitting phase only, or in the case of applying a one-hot encoder to the labels to feed them as integers.

Not ready for Production nor for Complex Pipelines

Parallelism and serialization are convoluted in Scikit-Learn: it’s hard, not to say broken. When some steps of your pipeline imports libraries coded in C++, those objects aren’t always serializable, it doesn’t work with the usual way of saving in Scikit-Learn, which is by using the joblib serialization library.

Also, when you build pipelines that are meant to run in production, there are more things you’ll want to add on top of the previous ones. Think about:

nested pipelines,
funky multimodal data,
parallelism and scaling on multiple cores,
parallelism and scaling on multiple machines,
cloud computing.

Shortly put: it’s hard to code Metaestimators using Scikit-Learn’s base classes. Metaestimators are algorithms that wrap other algorithms in a pipeline to change the behavior of the wrapped algorithm (e.x.: decorator design pattern). Examples of metaestimators:

a RandomSearch holds another step to optimize. A RandomSearch is itself also a step.
a Pipeline holds several other steps. A Pipeline is itself also a step (as it can be used inside other pipelines: nested pipelines).
a ForEachDataInputs holds another step. A ForEachDataInputs is itself also a step (as it is a replacement of one to just change dimensionality of the data, such as adapting a 2D step to 3D data by wrapping it).
an ExpandDim holds another step. An ExpandDim is itself also a step (inversely to the ForEachDataInputs, it augments the dimensionality instead of lowering it).

Metaestimators are crucial for advanced features. For instance, a ParallelTransform step could wrap a step to dispatch computations across different threads. A ClusteringWrapper could dispatch computations of the step it wraps to different worker computers within a pipeline. Upon receiving a batch of data, a ClusteringWrapper would work by first sending the step to the workers (if it wasn’t already sent) and then a subset of the data to each worker. A pipeline is itself a metaestimator, as it contains many different steps. There are many metaestimators out there. We also name those “meta steps” as a synonym.

Solutions that we’ve Found to Those Scikit-Learn’s Problems

For sure, Scikit-Learn is very convenient and well-built. However, it needs a refresh. Here are our solutions with Neuraxle to make Scikit-Learn fresh and useable within modern computing projects!

Conclusion

Unfortunately, most Machine Learning pipelines and frameworks, such as Scikit-Learn, fail at combining Deep Learning algorithms within neat pipeline abstractions allowing for clean code, automatic machine learning, parallelism & cluster computing, and deployment in production. Scikit-Learn has those nice pipeline abstractions already, but it lacks the features to do AutoML, deep learning pipelines, and more complex pipelines such as for deploying to production.

Fortunately, we found some design patterns and solutions that allows for all the techniques we named to work together within a pipeline, making it easy for coders, bringing concepts from most recent frontend frameworks (e.g.: component lifecycle) into machine learning pipelines with the right abstractions, allowing for more possibilities such as a better memory management, serialization, and mutating dynamic pipelines. We also break past Scikit-Learn and Python’s parallelism limitations with a neat trick, allowing straightforward parallelization and serialization of pipelines for deployment in production.

We’re glad we’ve found a clean way to solve the most widespread problems out there related to machine learning pipelines, and we hope that our solutions to those problems will be prolific to many machine learning projects, as well as projects that can actually be deployed to production.

If you liked this reading, subscribe to Neuraxio’s updates to be kept in the loop! Also thanks to the Dot-Layer (.Layer) organization’s blog committee and administrators for their generous peer-review of the present article.

Some Reasons Why Deep Learning has a Bright Future

2019-12-29T09:37:00-05:00

Would you like to see the future? This post aims at predicting what will happen to the field of Deep Learning. Scroll on.

Microprocessor Trends

Who doesn’t like to see the real cause of trends?

“Get Twice the Power at a Constant Price Every 18 months”

Some people have said that Moore’s Law was coming to an end. A version of this law is that every 18 months, computers have 2x the computing power than before, at a constant price. However, as seen on the chart, it seems like improvements in computing got to a halt between 2000 and 2010.

For Instance, See Moore’s Law Graph.

But the Growth Stalled…

This halt is in fact that we’re reaching the limit size of the transistors, an essential part of CPUs. Making them smaller than this limit size will introduce computing errors, because of quantic behavior. Quantum computing will be a good thing, however, it won’t replace the function of classical computers as we know them today.

Faith isn’t lost: invest in parallel computing

Moore’s Law isn’t broken yet on another aspect: the number of transistors we can stack in parallel. This means that we can still have a speedup of computing when doing parallel processing. In simpler words: having more cores. GPUs are growing towards this direction: it’s fairly common to see GPUs with 2000 cores in the computing world, already.

That means Deep Learning is a good bet

Luckily for Deep Learning, it comprises matrix multiplications. This means that deep learning algorithms can be massively parallelized, and will profit from future improvements from what remains of Moore’s Law.

The AI Singularity in 2029

A prediction by Ray Kurtzweil

Ray Kurtzweil predicts that the singularity will happen in 2029. That is, as he defines it, the moment when a 1000$ computer may contain as much computing power as 1000x the human brain has. He is confident that this will happen, and he insists that what needs to be worked on to reach true singularity is better algorithms.

“We’re limited by the algorithms we use”

So we’d be mostly limited by not having found the best mathematical formulas yet. Until then, for learning to properly take place using deep learning, one needs to feed a lot of data to deep learning algorithms.

We, at Neuraxio, predict that Deep Learning algorithms built for time series processing will be something very good to build upon to get closer to where the future of deep learning is headed.

Big Data and AI

Yes, this keyword is so 2014. It still holds relevant.

“90% of existing data was created in the last 2 years”

It is reported by IBM New Vantage that 90% of the financial data was accumulated in the past 2 years. That’s a lot. At this rate of growth, we’ll be able to feed deep learning algorithms abundantly, more and more.

“By 2020, 37% of the information will have a potential for analysis”

That is what The Guardian reports, according to big data statistics from IDC. In contrast, only 0.5% of all data was analyzed in 2012, according to the same source. Information is more and more structured, and organizations are now more conscious of tools to analyze their data. This means that deep learning algorithms will soon have access to the data more easily, whether the data is stored locally or in the cloud.

It’s about intelligence.

It is about what defines us, humans, compared to all previous species: our intelligence.

The key to intelligence and cognition is a very interesting subject to explore and is not yet well understood. Technologies related to this field are promising, and simply, interesting. Many are driven by passion.

On top of that, deep learning algorithms may use Quantum Computing and will apply to machine-brain interfaces in the future. Trend stacking at its finest: a recipe for success is to align as many stars as possible while working on practical matters.

What will Deep Learning become in 10 years?

We predict that deep learning in 10 years may be more about Spiking Neural Networks (SNNs).

Those types of artificial neural networks may unlock the limitations of Deep Learning, as they are closer to natural neurons, although they require more (parallelizable) computing power. If you’re interested in learning more on that topic, see my other article on the limits and future of research in deep learning and my other article on Spiking Neural Networks (SNNs).

Although I did some researched on SNNs, it’s a far shot and they aren’t useful yet. For now, regular Artificial Neural Networks, such as LSTMs, are good for solving a plethora of tasks. Until we reach the point where SNNs will be useful, it’s very practical to have and use the good tools to do deep learning when it comes to deploying deep learning production pipelines, such as using a good machine learning framework in Python to correctly integrate deep learning algorithms within computing environments.

Conclusion

First, Moore’s Law and computing trends indicate that more and more things will be parallelized. Deep Learning will exploit that.

Second, the AI singularity is predicted to happen in 2029 according to Ray Kurtzweil. Advancing Deep Learning research is a way to get there to reap the rewards and do good.

Third, data doesn’t sleep. More and more data is accumulated every day. Deep Learning will exploit big data.

Finally, deep learning is about intelligence. It is about technology, it is about the brain, it is about learning, it is about what defines us, humans, compared to all previous species: our intelligence. Curious people will know their way around deep learning.

If you liked this article, consider following us for more!

A Rant on Kaggle Competition Code (and Most Research Code)

2019-12-26T00:25:00-05:00

Machine Learning competition & research code sucks. What to do about it?

As a frequent reader of source code coming from Kaggle competitions, I’ve come to realize that it wasn’t full of rainbows, unicorns, and leprechauns. It’s rather like a Frankenstein. A Frankenstein is a work made of glued parts of other works and badly integrated. Machine Learning competition code in general, as well as machine learning research code, suffer from deep architectural issues. What to do about it? Using neat design patterns can change a lot of things for the better.

EDIT - NOTE TO THE READER: this article is written with having in mind a context where said competition code is to be reused to put it in production. The arguments in this article are oriented towards this end-goal. We are conscious that it’s natural and time-efficient for kagglers and researchers to write dirty code as their code is for a one-off thing. Reusing such code to build a production-ready pipeline is another thing, and the road to get there is bumpy.

TL;DR: don’t directly reuse competition code. Instead, create a new, clean project on the side, and refactor the old code into it.

Bad Design Patterns.

It’s so common to see code coming from Kaggle competitions that lacks the proper object oriented abstractions. Moreover, they also lack the abstractions for allowing to later deploying the pipeline to production - and with reason, kagglers have no incentives to prepare for deploying code to production, as they only need to win the competition.

The situation is roughly the same in academia where researchers too often just try to get results, beating a benchmark, to publish a paper and ditch the code after. Worse: often times, those researchers use overused datasets, and it requires for them to use all sorts of very specific post-processing tricks that won’t generalize to any other dataset, nor a production version of the algorithm for real-world usage.

Unfortunately, companies often rely on beating public benchmarks to only later discover that just having a working algorithm first may be of better value, without maniacally overtuning the algorithm on one very specific dataset. To make things worse, many machine learning coders and data scientists didn’t learn to code properly in the first place, so those prototypes are often full of technical debt.

Here are a few examples of bad patterns we’ve seen:

Coding a pipeline using bunch of manual small “main” files to be executed one by one in a certain order, in parallel or one after the other. Yes, we saw that many times.
Forcing to use disk persistence between the aforementioned small main files, which strongly couple the code to the storage and makes it impossible to run the code without saving to disk. Yes, many times.
Making the disk persistence mechanism different for each of those small “main” files. For example, using a mix of JSON, then CSV, then Pickles and sometimes HDF5 and otherwise raw numpy array dumps. Or even worse : mixing up many databases and bigger frameworks instead of keeping it simple and writing to disks. Yikes! Keep it simple!
Provide no instructions whatsoever on how to run things in the good order. Bullcrap is left as an exercise for the reader.
Have no unit tests, or unit tests that yes-do-test the algorithm, but that also requires writing to disks or using what was already written to disks. Ugh. And by the time you execute that untested code again, you end up with an updated dependency for which no installation version was provided and nothing work as it did anymore.

Pointing to Some Examples

Those bad patterns doesn’t only apply to code written in programming competition environments (such as this code of mine written in a rush - yes, I can do it too when unavoidably pressured). Here are some examples of code with checkpoints using the disks:

Most winning Kaggle competition code. We dove many times in such code, and it never occurred to us to see the proper abstractions.
BERT. Bear with me - just try to refactor “run_squad.py” for a second, and you’ll realize that every level of abstraction are coupled together. To name a few, the console argument parsing logic is mixed up at the same level of the model definition logic, full of global flag variables. Not only that, the model definition logic is mixed in all of this, along with the data loading and saving logic that uses the cloud, in one huge file of more than 1k lines of code in a small project.
FastText. The Python API is made for loading text files from disks, training on that, and dumping on disk the result. Couldn’t dumping on disk and using text files as training input be optional?

Using Competition Code?

Companies can sometimes draw inspiration from code on Kaggle, I’d advise them to code their own pipelines to be production-proof, as taking such competition as-is is risky. There is a saying that competition code is the worst code for companies to use, and even that the people winning competitions are the worst once to hire - because they write poor code.

I wouldn’t go that far in that saying (as I myself most of the time earn podiums in coding competitions) - it’s rather that competition code is written without thinking of the future as the goal is to win. Ironically, it’s at that moment that reading Clean Code and Clean Coder gets important. Using good pipeline abstractions helps machine learning projects surviving.

Solving the Problems.

So you still want to use competition and research code for you production pipeline. You’ll probably need to start anew and refactor the old code into the proper new abstractions. Here are the things you want when building a machine learning pipeline which goal is to be sent to production:

You ideally want a pipeline than can process your data by calling just one function and not lots of small executable files. You might have some caching enabled if things are too slow to run, but you keep caching as minimal as possible, and your caching might not be checkpoints exactly.
Having the possibility to not use any data checkpoints between pipeline steps simply. You want to be able to deactivate all your pipeline’s checkpoints easily. Checkpoints are good for training the model, debugging it and actively coding the pipeline, but in production it’s just heavy and it must be easily disableable.
The ability to scale your ML pipeline on a cluster of machines.
Finally, you want the whole thing to be robust to errors and to do good predictions. Automatic Machine Learning can help here.

And if your goal is instead to continue to do competitions, please at least note that I (personally) started winning more competitions after reading the book Clean Code. So the solutions above should as well apply in competitions, you’ll have more mental clarity, and as a result, more speed too, even if designing and thinking about your code beforehand seems like taking a lot of precious time. You’ll start saving even in the short run, quickly.

Now that we’ve built a machine learning framework that ease the process of writing clean pipelines, we have a hard time picturing how we’d get back to our previous habits anytime. Clean is our new habit, and it now doesn’t cost much more time to start new projects with the good abstractions from the start, as we’ve already thought through them.

Another upside, if you’re a researcher, is that if your code is developer-friendly and has a good API, it has more chances of being reused, thus you’ll be more likely to be cited. It’s always sad to discover that some research results can’t be reproduced even when using the same code that generated those results.

In all cases, using good patterns and good practices will almost always save time even in the short or medium term. For instance, using the pipe and filter design pattern with Neuraxle is the simplest and cleanest thing to do.

Conclusion

It’s hard to write good code when pressured by the deadline of a competition. We created Neuraxle to easily allow for the good abstractions to be easily used when in a rush. As a result, it’s a good thing that competition code be refactored into Neuraxle code, and it’s a good idea to write all your future code using a framework like Neuraxle.

The future is now. If you’d like to support Neuraxle, we’ll be glad that you get in touch with us. You can also register to our updates and follow us. Cheers!

How to Code Neat Machine Learning Pipelines

2019-10-26T19:22:00-04:00

Coding Machine Learning Pipelines - the right way.

Have you ever coded an ML pipeline which was taking a lot of time to run? Or worse: have you ever got to the point where you needed to save on disk intermediate parts of the pipeline to be able to focus on one step at a time by using checkpoints? Or even worse: have you ever tried to refactor such poorly-written machine learning code to put it to production, and it took you months? Well, we’ve all been there if working on machine learning pipelines for long enough. So how should we build a good pipeline that will give us flexibility and the ability to easily refactor the code to put it in production later?

First, we’ll define machine learning pipelines and explore the idea of using checkpoints between the pipeline’s steps. Then, we’ll see how we can implement such checkpoints in a way that you won’t shoot yourself in the foot when it comes to put your pipeline to production. We’ll also discuss of data streaming, and then of Oriented Object Programming (OOP) encapsulation tradeoffs that can happen in pipelines when specifying hyperparameters.

What are pipelines?

A pipeline is a series of steps in which data is transformed. It comes from the old “pipe and filter” design pattern (for instance, you could think of unix bash commands with pipes “|” or redirect operators “>”). However, pipelines are objects in the code. Thus, you may have a class for each filter (a.k.a. each pipeline step), and then another class to combine those steps into the final pipeline. Some pipelines may combine other pipelines in series or in parallel, have multiple inputs or outputs, and so on. We like to view Machine Learning pipelines as:

Pipe and filters. The pipeline’s steps process data, and they manage their inner state which can be learned from the data.
Composites. Pipelines can be nested: for example a whole pipeline can be treated as a single pipeline step in another pipeline. A pipeline step is not necessarily a pipeline, but a pipeline is itself at least a pipeline step by definition.
Directed Acyclic Graphs (DAG). A pipeline step’s output may be sent to many other steps, and then the resulting outputs can be recombined, and so on. Side note: despite pipelines are acyclic, they can process multiple items one by one, and if their state change (e.g.: using the fit_transform method each time), then they can be viewed as recurrently unfolding through time, keeping their states (think like an RNN). That’s an interesting way to see pipelines for doing online learning when putting them in production and training them on more data.

Methods of a Pipeline

Pipelines (or steps in the pipeline) must have those two methods:

“fit” to learn on the data and acquire state (e.g.: neural network’s neural weights are such state)
“transform” (or “predict”) to actually process the data and generate a prediction.

Note: if a step of a pipeline doesn’t need to have one of those two methods, it could inherit from NonFittableMixin or NonTransformableMixin to be provided a default implementation of one of those methods to do nothing.

It is possible for pipelines or their steps to also optionally define those methods:

“fit_transform” to fit and then transform the data, but in one pass, which allows for potential code optimizations when the two methods must be done one after the other directly.
“setup” which will call the “setup” method on each of its step. For instance, if a step contains a TensorFlow, PyTorch, or Keras neural network, the steps could create their neural graphs and register them to the GPU in the “setup” method before fit. It is discouraged to create the graphs directly in the constructors of the steps for several reasons, such as if the steps are copied before running many times with different hyperparameters within an Automatic Machine Learning algorithm that searches for the best hyperparameters for you.
“teardown”, which is the opposite of the “setup” method: it clears resources.

The following methods are provided by default to allow for managing hyperparameters:

“get_hyperparams” will return you a dictionary of the hyperparameters. If your pipeline contains more pipelines (nested pipelines), then the hyperparameter’ keys are chained with double underscores “__” separators.
“set_hyperparams” will allow you to set new hyperparameters in the same format of when you get them.
“get_hyperparams_space” allows you to get the space of hyperparameter, which will be not empty if you defined one. So, the only difference with “get_hyperparams” here is that you’ll get statistic distributions as values instead of a precise value. For instance, one hyperparameter for the number of layers could be a RandInt(1, 3) which means 1 to 3 layers. You can call .rvs() on this dict to pick a value randomly and send it to “set_hyperparams” to try training on it.
“set_hyperparams_space” can be used to set a new space using the same hyperparameter distribution classes as in “get_hyperparams_space”.

Re-fitting a Pipeline, Mini-Batching, and Online Learning

For mini-batched algorithms like for training Deep Neural Networks (DNN), or for online learning algorithms such as in Reinforcement Learning (RL) algorithms, it is ideal if the pipelines or if the pipeline steps can update themselves on chaining several calls to “fit” one after another, re-fitting on the mini-batches on the fly. Some pipelines and some pipeline steps can support that, however, some other step will reset themselves upon having “fit” called anew. It depends how you coded your pipeline step. It is ideal if your pipeline step only resets upon calling the “teardown” method, then “setup” again before the next fit, and doesn’t reset between each fit nor during transform.

Using checkpoints in your pipelines

It is a good idea to use checkpoints in your pipelines - until you need to re-use that code for something else and change the data. You might be shooting yourself in the foot if you don’t use the proper abstractions in your code.

Pros of using checkpoints in your pipelines:

Checkpointing can increase coding speed when coding and debugging the steps at the middle or at the end of your pipeline, avoiding to compute the first pipeline steps anew every time.
When doing hyperparameter optimization (either manual tuning or meta learning), you’ll be happy to avoid re-computing the first pipeline steps when you are tuning the next ones. For instance, the beginning of your pipeline might be always the same if it doesn’t have hyperparameters, or almost the same if it has only a few hyperparameters. Thus, with checkpoints, it’s good to resume from the places where you checkpointed if the hyperparameters and source code of the steps before the checkpoint didn’t change since the last execution.
You may be limited in computing power, and running one step at a time may be the only tractable option in consideration of your available hardware. You can use a checkpoint, then add more steps after the checkpoint, and the data will be resumed from where you left it off if you re-execute the whole thing.

Cons of using checkpoints in your pipelines:

It uses disks which can slow down your code if done wrong. At least, you can speed this up by using a RAM Disk, or by mounting the cache folder to your RAM.
It may require a lot of disk space. Or a lot of RAM space if using a folder mounted in RAM.
The state saved on disk is harder to manage: there is added complexity to your program for your code to run faster. Note that in functional programming terms, your functions and code won’t be pure anymore, because of the need to manage side effects with the disks. The side effects coming from managing the disk’s state (your cache) may create all sorts of weird bugs. It is known that in programming, some of the hardest bugs are cache invalidation problems.

“There are only two hard things in Computer Science: cache invalidation and naming things.” — Phil Karlton

An advice on properly managing state and cache in pipelines.

Programming frameworks and design patterns are known to be limiting by the simple fact that they enforce some design rules. That is hopefully in the goal of managing things for you in an easy way and to avoid yourself making mistakes or ending up with dirty code. Here is my shot at it for pipelines and managing state:

PIPELINE STEPS SHOULD NOT MANAGE CHECKPOINTING THEIR DATA OUTPUT.

This should be managed by a pipelining library which can deal with all of this for you.

Why?

Why should pipeline steps not manage checkpointing their data output? Well, it’s for all these valid reasons that you’ll prefer to use a library or framework instead of doing it yourself:

You will have an easy on/of switch to completely enable or disable checkpointing for when you’ll deploy to production.
When you’ll need to re-train on new data, caching is so well-managed for you that it’ll detect that your data changed, and it will ignore your existing cache by itself, without requiring your interaction, which will avoid important bugs.
You won’t have to interact with disks by yourself coding Inputs/Outputs (I/O) operations at every pipeline step. Most coders prefer to code the Machine Learning algorithms and to build the pipelines rather than having to code the data serialization methods. Admit it - you just want to code the crunchy algorithms and want the rest done for you. Don’t you?
You now have the possibility to give a name for each of your pipeline experiments or iterations such that every time you restart, a new caching subfolder is created for this unique occasion - even if reusing the same pipeline steps. And naming your experimentations isn’t even needed as the caching changes if your data changes.
Your pipeline steps classes inner code are hashed and compared to see if caching needs to be re-done for a class in which you just changed the code to avoid cache invalidation bugs. Hooray.
You now have the possibility of hashing the intermediate data results and skipping computing the pipeline on this data when the hyperparameters are the same and that your pipeline already transformed (and hence cached) the data earlier. This can ease hyperparameter tuning where sometimes even intermediate pipeline steps may change. For instance, the first pipeline steps may remain cached as it keeps being the same, and if you have more hyperparameters to adjust in the following steps of your pipeline and have more checkpoints later after those steps, then the multi-caching of the intermediate pipeline steps are saved with a unique name computed from the hash. Call this a blockchain if you want, because it is in fact a blockchain.

This is cool. With the proper abstractions, you can now code your Machine Learning pipeline with a huge speed-up when tuning hyperparameters by caching every trial’s intermediate result, skipping steps of the pipeline trial after trial when the hyperparameters of the intermediate pipeline steps are the same. Not only that, but once you’re ready to move the code to production, you can now disable caching completely without having to try to refactor code for a month. Avoid hitting that wall.

Data Streaming in Machine Learning Pipelines

In parallel processing theory, pipelines are taught to be a way to stream data such that a pipeline’s steps can all run in parallel. The laundry example is good at picturing the problem and the solution. For example, a streaming pipeline’s second step could start processing partial data out of the first pipeline’s step while the first step still computes more data, and without having for the first pipeline’s step to completely finish processing all the data. Let’s call those special pipelines streaming pipelines (see streaming 101, streaming 102).

Don’t get us wrong, scikit-learn pipelines are nice to use. However, they don’t allow for streaming. Not only scikit-learn, but most machine learning pipelining libraries out there don’t make use of streaming whereas they could. The whole python ecosystem has threading problems. In most pipeline libraries, each step is completely blocking and must transform all the data at once. There are just a few which enable streaming.

Enabling streaming could be as simple as using a StreamingPipeline class instead of a Pipeline class to chain steps one after the other, providing a mini-batch size and a queue size between steps (to avoid taking too much RAM, which makes things stable in production environments). The whole would also ideally require threaded queues with semaphores as described in the producer-consumer problem to pass info from one pipeline step to another.

One thing that Neuraxle does already better than scikit-learn is to have sequential pipelines, which can be used by using the MiniBatchSequentialPipeline class. The thing is not threaded yet (but it is well in our plans). At least, we already passes the data to the pipeline in mini-batches during fit or during transform before collecting results, which allows for big pipelines using pipelines like the ones in scikit-learn but here with mini-batching. And with all our extra features like hyperparameter spaces, setup methods, automatic machine learning, and so forth.

Our Solution for Parallel Data Streaming in Python

The fit and/or transform method can be called many times in a row to improve the fit with new mini-batches.
Using threaded queues inside the pipeline as in the producer-consumer problem. One queue is needed between each pipeline steps that are streamed. If many steps in a row do
It is possible to allow for parallel replication of pipeline steps to transform multiple items in parallel at each step. This can be done before the setup methods are called throughout the pipeline. Otherwise, the pipeline needs to be serialized, cloned, and reloaded with pipeline step savers, which is something we already coded and would be ready for use. Code that uses TensorFlow and other imported code that was build in other languages such as C++ is hard to thread in Python, especially when it uses GPU memory. Even joblib can’t fix easily some of those issues. Avoiding that with proper serialization is good
A parameter of the pipeline could be whether or not it is important to keep the data in the good order before sending it to the next step. By default it would be, and if not, the pipeline can continue processing data in random orders as it comes if some steps takes variable amounts of time.
It will be possible to make use of barrier objects between some steps in the pipeline. They would not be real steps, but rather, ways to specify to the pipeline how to treat the data between the steps, such that the data must or must not retain its order at some key places. For example, you could use in-order barriers, out-of-order barriers, or even wait-for-all blocking barriers (a Joiner). We already have coded the Joiner. Those barriers add information on how to process the data between the steps or groups of steps. For instance, it can specify or override the pipeline’s length of one specific queue and the number of times to run a pipeline step in parallel and how to parallelize that step.
We also plan to code repeaters and batch randomizers that will be able to repeat each training data example a few times, which is common when training neural networks.

Not only that, but the way to make every object threadable in Python is to make them serializable and reloadable. That said, in Neuraxle we plan to very soon code this. It will allow for dynamically sending code to be executed remotely on any worker (that it be another computer or process), even if that worker doesn’t have the code itself. This is done with a chain of serializers that are specific to each pipeline step class. By default each of those steps has a serializer that can handle regular Python code, and for more wicked code using GPUs and import code in other languages, models are just serialized with those savers, and then reloaded on the worker. If the worker is local, objects can be serialized to a RAM disk or a folder mounted in RAM.

Encapsulation tradeoffs

There is one thing that still annoys us in most machine learning pipeline libraries. It is how hyperparameters are treated. Take for example scikit-learn. Hyperparameter spaces (a.k.a. statistical distributions of hyperparameters’ values) must often be specified outside of the pipeline with underscores between each steps of steps or each pipeline of pipeline, and so on. While the Random Search and the Grid Search can search hyperparameter grids or hyperparameter probability spaces such as defined with scipy distributions, scikit-learn does not provide a default hyperparameter space for each classifier and transformer. This could be the responsibility of each object of a pipeline. This way, an object is self-contained and also contains its hyperparameters, which doesn’t break the Single Responsibility Principle (SRP) and the Open-Closed Principle (OCP) of the SOLID principles of Object-Oriented Programming (OOP). Using Neuraxle is a good solution to avoid breaking those OOP principles.

Compatibility and Integration

A good thing to keep in mind when coding machine learning pipelines is to have them be compatible with lots of things. As of now, Neuraxle is compatible with scikit-learn, TensorFlow, Keras, PyTorch, and many other machine learning and deep learning libraries.

For instance, neuraxle has a method .tosklearn() which allows the steps or a whole pipeline to be made a scikit-learn BaseEstimator - that is, a basic scikit-learn object. For other machine learning librairies, it’s as simple as creating a new class that inherits from Neuraxle’s BaseStep, and override at least your own fit, transform, and perhaps also the setup and teardown methods, and defining a saver to save and load your model. Just read BaseStep’s documentation to learn how to do that, and also read the related Neuraxle examples in the documentation.

Conclusion

To conclude, writing production-level machine learning pipeline requires many quality criterias, which hopefully can all be solved if using the good design patterns and the good structure in your code. To sum up:

It’s a good thing to use pipelines in your machine learning code, and to define each step as a pipeline step “BaseStep” for instance.
Then, the whole thing can be optimized with checkpoints when searching for the best hyperparameters and re-executing the code on the same data (but perhaps with different hyperparameters or with changed source code).
It’s also a good idea to fit and transform data sequentially not to blow RAM. Then, the whole thing can also be parallelized when switching from a sequential pipeline to a streaming pipeline.
You can finally also code your own pipeline steps, you just have to inherit from the BaseStep class and implement the methods you need.

Acknowledgements

Thanks to Vaughn DiMarco for brainstorming on this with me and motivating me to write this article. Also thanks to our contributors, clients and supporters who openly supports the project.

The future is now. If you’d like to support this project too, we’ll be glad that you get in touch with us. You can also register to our updates and follow us.

Hello World!

2019-10-18T15:52:00-04:00

Greetings!

def print_hello_world():
  print("Hello World!")

Hello World!

We’ll be releasing Neuraxle 0.2.0 very soon on PyPI (so you’ll can pip install neuraxle). We’ll also post here tutorials, articles and updates. Stay tuned, register below!