Tuesday, February 17, 2015

Introduction to Prediction.IO, an open-source Machine Learning framework



 What is Prediction.IO in a nutshell?


            Building machine learning an application from scratch is hard; you need to have the ability to work with your own data and train your algorithm with it, build a layer to serve the prediction results, manage the different algorithms you are running, their evaluations, deploy your application in production, manage the dependencies with your other tools, etc.
Prediction.io is an open source Machine Learning server that addresses these concerns. It aims to be the “LAMP stack” for data analytics.

Current state of Machine Learning frameworks

           
            Lets first review some of the tools that are popular currently in the Machine Learning (ML) community. Some widely used tools are: Mahout in the Hadoop ecosystem, MLLib in the Spark community, H2o, DeepLearning4j.

These APIs generally work great and provide implementations of the main ML algorithms. However, what is missing from a general standpoint in order to use them in a Production environment?
-       An integration layer to bring your data sources
-       A framework to roll a prototype into production
-       A simple API to query the results

Example

            Let’s take a classic recommender as an example; usually predictive modeling is based on users’ behaviors to predict product recommendations.

We will convert the data (in Json) into binary Avro format.
// Read training data
val trainingData = sc.textFile(“trainingData.txt”).map(_.split(‘,’) match {..})

which yields something like:
user1 purchases product1, product2
user2 purchases product2

Then build a predictive model with an algorithm:
// collaborative filtering algorithm
val model = ALS.train(trainingData, 10, 20, 0.01)

Then start using the model:
// collaborative filtering algorithm
allUsers.foreach { user => model.recommendProducts(user, 5) }

This recommends 5 products for each user.

This code will work in development environment, but wouldn’t work in production. Why?
- How do you integrate with your existing data?
- How do you unify the data from multiple sources?
- How to deploy a scalable service that responds to dynamic prediction query?
- How do you persist the predictive model, in a distributed environment?
- How to make your storage layer, Spark, and the algorithms talk to each other?
- How to prepare the data for model training?
- How to update the model with new data, without downtime?
- Where does the business logic get added?
- How to make the code configurable, reusable and manageable?
- How do we build these with separation of concern (SOC), like the web development side of things?
- How to make things work in a real time environment?
- How do I customize the recommender on a per-location basis? How to discard data that is out of inventory?
- How about performing different tests on the algorithms you selected?


Prediction IO to the rescue!


Let’s address the above questions.
Prediction.io boasts an event server for storage, that collects data (say, from a mobile app, web, etc) in a unified way, from multiple channels.

You can plug multiple engines within Prediction.io; each engine represents a type of prediction problem. Why is that important?
In a Production system, you will typically use multiple engines. I.e. the archetypal example of Amazon: if you bought this, recommend that. But you may also run a different algorithm on the front page for article discovery, and another one for email campaign based on what you browsed for retargeting purposes.
Prediction.io does that very well.

How to deploy a predictive model service? In a typical mobile app, the user behavior data will send user actions. Your prediction model will be trained on these, and the prediction.io engine will be deployed as a Web service. So now your mobile app can communicate wit h the engine via a REST API interface. If this was not sufficient, there are other SDKs available in different languages. The engine will return a list of results in JSON format.
Prediction.io interaction w/ a mobile app


Prediction.io manages the dependencies of Spark and Hbase and the algorithms automatically. You can launch it with a one-line command.

When using the framework, it doesn’t act as a a black box – Prediction.io is one of the most popular ML product on Github (5000+ contributors).

The framework is open-source, and is written in Scala, to take advantage of the JVM support and is a natural fit for distributed computing. R in comparison is not so easy to scale. Also Prediction.io uses Spark, currently one of the best-distributed system framework to use, and is proven to scale in Production. Algorithms are implemented via MLLib. Lastly, events are store in Apache HBase as the NoSQL storage layer.


Preparing the data for model training is a matter of running the Event server (launched via (‘pio eventserver’) and interacting with it, by defining the action (i.e. change the product price), product (i.e. give a rating A for product x), product name, attribute name, all in free format. 

Building the engine is made easy because Prediction.io offers templates for recommendation and classification. The engine is built on an MVC architecture, and has the following components:

- Data source: data comes from any data source, and is preprocessed automatically into the desired format. Data is prepared and cleansed according to what the engine expects. This follows the Separation of Concerns concept.
- Algorithms: ML algorithms at your disposal to do what you need; ability to combine multiple algorithms.
- Serving layer: ability to serve results based on predictions, and add custom business logic to them.
- Evaluator layer: ability to evaluate the performance of the prediction to compare algorithms.

Of note, MLLib has made some improvements on the API lately to address some of the concerns (i.e. creating a ML pipeline).

In summary, Prediction.io believes the functions of an engine should be to:
-       Train deployable predictive model(s)
-       Respond to dynamic queries
-       Evaluate the algorithm being used


How to get started?


The best way is to start is to:
-       Download the code from github
-        Get one of the templates, everything you need will be laid out and set up already that way, and the template can be modified according to your needs.

The whole stack can be installed in one line of code. You can then start and deploy the event server, and update the engine model with new data.





0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.