AWS Machine Learning on AWS Redshift Data
After I built the data warehouse on AWS Reshift and analyzed the visualization on AWS QuickSight. I wonder if there is anything else I can do to analyze the data or predict patterns on the data. So I tried with AWS Machine Learning (ML) to do predictions on the data. In this post, I will show you how to build and train the predictive model on the sales data from AWS Redshift data warehouse that we created in the previous post. AWS ML algorithms discover patterns in data and construct mathematical models using these discoveries. Then you can use the models to make either real-time or batch predictions on the future data. From the predictive model and batch predictions, we will see how AWS Machine Learning (ML) can help us to make better business decisions.
Overview
Definition
Amazon Machine Learning (Amazon ML) is a robust, cloud-based service that makes it easy for developers of all skill levels to use machine learning technology. It provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology. Amazon Machine Learning is a managed service for building ML models and generating predictions that enable the development of robust, scalable smart applications.
Key Concepts
AWS ML has five key concepts:
- Datasources contain metadata associated with data inputs to Amazon ML. So you can import data either from S3 or Redshift. In our study case, input data is from Redshift.
- ML Models generate predictions using the patterns extracted from the input data.
- Evaluations measure the quality of ML models.
- Batch Predictions asynchronously generate predictions for multiple input data observations. I will demo batch predictions on the training data.
- Real-time Predictions synchronously generate predictions for individual data observations.
Processing Flow
Here is the processing flow diagram of this test case:
After the Data Warehouse is launched on Redshift, load raw data from S3 to Redshift. Then setup data set to do analytics and visualization on QuickSight. If you want to know how to setup Redshift and QuickSight visualization, please review it in my online training course AWS Data Warehouse – Build with Redshift and QuickSight.
Even more, we can create datasource from Redshift for ML to do the batch prediction then store the predictions result into S3 bucket.
AWS Redshift
Please review Redshift on launching Data Warehouse and LOAD data.
AWS Machine Learning Prediction Steps
1. IAM Permission Setup
Before you can create a datasource with Amazon Redshift data, you must set up IAM permissions that allow Amazon ML to export data from Amazon Redshift. So you need to do:
- First, define IAM role for AWS machine learning and attach AmazonMachineLearningRoleforRedshiftDataSource policy this new IAM role.
- Then after you create a role for Amazon ML, create a trust policy for the role, and attach the trust policy to the role.
- In addition, if you want your IAM user to pass the role that you just created to Amazon ML, you must attach a permissions policy with the iam:PassRole permission to your IAM user.
Please review the step-by-step instruction in the video.
2. Create Datasource for ML
Since the data is ready in Redshift for this demo, Let’s use Create Datasource wizard in the Amazon Machine Learning (Amazon ML) console to create a datasource object. When you create a datasource from Amazon Redshift data, you specify the cluster that contains your data and the SQL query to retrieve your data. ML executes the query by invoking the Redshift Unload command on the cluster. ML stores the results in the Amazon Simple Storage Service (Amazon S3) location of your choice, and then uses the data stored in S3 to create the datasource. The datasource, Amazon Redshift cluster, and S3 bucket must all be in the same region.
Please review the video on the following steps:
- Enter required parameter for the create datasource wizard (e.g. cluster, database name/password, IAM role, query, S3 staging location)
- Review the data types for all attributes and pick up target row
3. Create ML Model
In this case study, I will use the default path in the Create ML model wizard that splits the input datasource and uses the first 70% for a training datasource and the remaining 30% for an evaluation datasource. You can also customize the split ratio by using the Custom option in the Create ML model wizard, where you can choose to select a random 70% sample for training and use the remaining 30% for evaluation.
4. Evaluate ML Model
The evaluating ML model helps you to determine if it will do a good job of predicting the target on new and future data. Our case is a regression ML model. So the output of a regression ML model is a numeric value for the model’s prediction of the target: the root mean square error (RMSE) metric. It is a distance measure between the predicted numeric target and the actual numeric answer (ground truth). The smaller the value of the RMSE, the better is the predictive accuracy of the model. A model with perfectly correct predictions would have an RMSE of 0. We can tune the learning process on data rearrangement. For example, by default the Datasource for evaluation: {“splitting”:{“percentBegin”:70, “percentEnd”:100, “strategy”:”sequential“}} to { “splitting”:{ “percentBegin”:70, “percentEnd”:100, “strategy”:”random“, “strategyParams”: { “randomSeed”:”RANDOMSEED” } “complement”:”true” } }
4. Create Batch Prediction
The batch predictions use the Amazon ML model and select from the datasource that we use to build the ML model. You must use the same schema for the datasource that you use to obtain batch predictions and the datasource that you used to train the ML model that you query for predictions.
In our case, it took one minute to compute 30382 records. It provides two metrics under processing information: Records seen and Records failed to process. Records seen tells you how many records Amazon ML looked at when it ran your batch prediction. Records failed to process tells you how many records Amazon ML could not process. Download prediction file (.gz) from S3 bucket. Unzip it as .csv file. The file contains two columns trueLable (each totalsold in our case) and score (prediction of sale). The score is the raw numeric prediction for each observation in the input data. The values are reported in scientific notation.
Cost
It costs me $3.72 to run this experiment. Yours maybe around $3 to $5 to do it. Here is the summary of the cost:
- Predict fee: $0.10 per block of 1,000 batch predictions, rounded up to the next 1,000. In this case, we have 31 Blocks with total $3.10
- Compute fee: The compute price is $0.42 per hour
- DataStats: 1.372Hrs x $0.42 = $0.58
- EvaluateModel: 0.068 Hrs x $0.42 = $0.03
- TrainModel: 0.039 Hrs x $0.42 = $0.02
Conclusion
This is just an experiment on the sample data. AWS Machine Learning (ML) definitely worth to try it and feel it with your real-time application and data. To define a useful business evaluation model, you should work closely with your domain experts to avoid any false assumption and identify the baseline. You can increase the model’s quality by using more real-time and higher-quality data to train it. To improve model’s predictive accuracy, we can tune the learning process such as adjusting on data rearrangement with different split strategy, adding additional types of information, and transforming the data to optimize the learning process.