I am not a deep machine learning expert. This post is intended for those who wanted to learn more about the practical aspects of machine learning – where is the effort really? This series of posts are my experiences as I explore the space.
What is Machine Learning?
So what is Machine Learning in practice? Most developers I think understand the basics: you get lots of data, throw it at the computer, and it “learns”. But how? And how clever is it when learning? My answer is “it’s not clever at all,” at least not “clever” in terms of the way humans think of being clever. Machines are number crunchers. They can do lots of computations very fast. But machine learning is not about a computer “thinking”. Machine learning is about a human coming up with a theory that they then model in a way a computer can understand, then the computer can optimize the model parameters in a way that best fits the model by doing lots of number crunching.
For example, I have a data set from the net that has people viewing products, adding them to the cart, then purchasing. Not every viewed product is added to carts, not every cart is purchased. My question is “what should I do to get more people to purchase”. Well, that is too hard a question to ask. A more realistic question is “does viewing a product more than once indicate that the user is more likely to buy? If they view it three times is that an indication they may want it, but its too expensive?” That is where machine learning can help.
Machine Learning Models
For example, you can come up with a model that predicts the probability of purchase based on the number of product views. You don’t know the numbers, but you have a rough idea of the shape of the curve and a formula that can follows that general shape with a few parameters that you don’t know the values of. You then feed all the observations so far to “train” the model. That is, it can work out the probability based on past events. Training is where the machine does lots of number crunching to work out the best parameter values to use.
The consider a formula such as y = a * x + b, a linear equation – you feed in ‘x’ and you get your answer ‘y’. But what should ‘a’ and ‘b’ be? Machine learning can work out the best values for ‘a’ and ‘b’ to fit your sample input data. But if ‘y’ is a probability, it won’t do things like make sure the line does not get larger than 1. You might need a different formula for that to guarantee the values stay between 0 and 1.
Then you could go a step further. The view data has timestamps. Maybe it is a combination of how long between views that also plays a factor. So I could consider both the number of views and the length of time since the first view. Machine learning can help here again – it can handle multiple parameters and optimize across them.
Further, the machine learning library can return you and estimate of reliability based on the data. It may be the library comes back with the best ‘a’ and ‘b’ values, but actually the result will be terrible because you tried to fit a straight line to data that is a curve. Sometimes this is where the value lies – not in the answer to the question, but just knowing your model is working well or not. Is there a relationship between ‘x’ (number of product views) and ‘y’ (probability of purchase)? If not, then move on to something more productive!
But how to express the model? The different machine learning packages come with a set of tools. It is up to a human to work out how best to adapt those tools to implement the model you want. “Linear regression” for example is where you fit a straight line through a series of points. But linear regression might not be the right approach if you know the line is not straight. You either need to use some clever maths equations to turn it into a straight line, or pick another tool from the set available. Again, this needs a human and skill.
So far then a human has to come up with a question they want to answer, then work out a model for the question that is supported by the machine learning toolkit they are using. Great! What’s next?
Well, the next challenge I hit was the input data I have is not in the right format to feed into the library I picked. In my reading around, a number of people have quoted figures like 70% to 80% of the effort these days is actually around massaging the data into the right format for the machine learning libraries to use. This is not “clever” work – its just necessary grunt work to clean up the data. Huge progress has been made in the machine learning libraries themselves having lots of clever ways to work out the optimal values for parameters efficiently, so much so that that is not where the main effort is any more.
For example, I am playing with “Datalab” on Google Cloud. It is built on Jupyter (a nice live notebook approach for writing up your experiments in). Python can be embedded directly amongst Markdown syntax, where the results of the Python code are displayed directly in the page. Python is used as a glue language to call the different libraries around (including the machine learning libraries).
One of the libraries for data massaging I have been using is called Pandas. It allows you to do some data manipulation without having to write too much code, but I have been bitten a few times as well. In practice, you can write down how you want the data massaged, but you still need to worry about performance. For example, I took the original input data (views, add-to-carts, purchases – with timestamps) and annotated the views with a purchase they were associated with (no that no purchase was made). I then tried to group the data by purchase to count the number of views per purchase. Well, you have to write the code “just right” or else performance kills you. I used some provided “group by” functionality, but the next morning the code had still not finished. So simple data cleanup worked well, but more fancy transformations it seems I just have to roll up the sleeves and write some serious code.
To wrap up, this post was not trying to go into detail, but rather give a bit of a feel of where the effort is with machine learning. Humans are still needed to come up with theories and test them. Humans need to know the different forms of models the libraries support and pick the right one for the problem. They also need to draw conclusions from the results (e.g. determining if the model is a good predictor of the question you have). The machine learning libraries help for sure – but at the number crunching level. And don’t underestimate the effort to get data into the right format for a machine learning library to use. The libraries are still fussy. They won’t clean up the data for you even for the simplest of tasks.
The tools are however getting better and quickly. The machine learning engines are getting more and more accessible. Using Google Cloud Datalab I did not have to install anything locally – I am doing all the coding in a web browser directly in Datalab (the other platforms I am sure have the same capabilities). You may have noticed that I am also talking about pretty simple questions – nothing like speech or image recognition. But there are more and more libraries from clever people providing this functionality for you. That is what makes this field so exciting. Google, Microsoft, Amazon, etc are in a race at present all providing better and better tools, commoditizing what would have been a dream just a few years back.
The other aspect I have not touched is what to do with such a model? One example of an action would be to offer a discount if the model predicts they are interested but not likely to purchase. Another would be to include in follow up email campaigns. Doing something for a specific user based on their actions is one form of personalization. But that opens up a new can of worms – experimentation. If you offer a discount, how to do get confidence that the discount helped close a sale you would have lost rather than just reducing your profit margin for something they were going to purchase anyway? Running split tests across different users can help here (offer the discount to some users but not all) – but this assumes you get enough customers to be able to run the test and get meaningful results.
For me, next step is back to the drawing board for my data cleansing step. Using Pandas saved me lots of code for the easy steps, but when things go complicated it has proven not to be so useful. It hides really slow computations from me because of the abstraction, meaning I never know how fast or slow the next thing I try is. (Remember I am a newbie here!) Once I have that done I plan to do a quick write up on Jupyter – its pretty cool.