Machine Learning

Ultimate step-by-step tutorial to machine learning on AWS

January 23, 2020

Hey! You are back again for another great machine learning tutorial with us. This time, we will be using the Amazon Web Service (AWS) to build our Machine Learning model. A user can use his/her data to build predictive applications via Amazon Machine Learning platform, using mathematical models and algorithms. You can import and export your data from Redshift, RDS and Amazon S3, then visualize it on the AWS management console. It has emerged as a popular platform majorly due to the ease of building ML models, affordable pricing and high performance outputs (with which you can run billions of predictions!). Sounds cool, right? 

Let us now explore what you can get out of this tutorial and what are we going to learn!

At the end of this tutorial, you will be able to;

  1. Know what is AWS and how it is used to build Machine Learning models.
  2. Understand the concepts involved in Amazon Machine Learning.
  3. Know the steps involved in building a ML model through AWS.
  4. Request and process real-time predictions using your model on the web.

Let us get started!

  1. What are the concepts of Amazon Machine Learning?

In this part, you will get to know the concepts underlying machine learning modeling on AWS. Understanding these key concepts comes crucial as it will aid us in building the right predictive model using AML. Here they are;

A. Datasources

Datasources contain the metadata about the input data we use. Amazon ML computes the descriptive statistics of the input data and stores it with a schema. This information is stored as a datasource object. This stored datasource object is then utilized by Amazon ML engine to train a model, evaluate it and generate output predictions. It must be noted that a ‘datasource’ only stores a reference of the location of the data in Amazon S3, and not the copy of the input data itself. Incase the location of the data file is changed on S3, Amazon ML engine will not be able to run the ML model. 

B. ML Models

ML models represent mathematical models that run on algorithms to find patterns in the input data, and then generate predictions based on their ‘training.’ A user can create three types of ML models on AML. They are;

  • Binary classification models; Such an ML model predicts one of the two possible outcomes. Example, ‘Yes’ or ‘No’ results. 
  • Multi-class classification models; As the name suggests, these models are used to predict outcomes based on multiple conditions. Example, predicting the income category from a country’s Census data.
  • Regression models; These ML models are used to find relationships between the features of a given dataset. They output a fixed value based on their prediction. Example, a model predicting the sales of a particular product based on previous data.

C. Evaluations

Evaluations are used to determine the performance and prediction accuracy of an ML model. Following are some terms used while evaluating a model;

  • Model insights; This feature of Amazon ML allows the user to gain insights and evaluate their predictive ML models.
  • Macro-averaged-F1-score; This evaluation parameter is used to determine the prediction accuracy for a Multi-class ML model.
  • Area under ROC curve (AUC); This metric determines the performance of a binary ML model, by measuring its ability to predict positive examples better than the negative examples.
  • Root Mean Square Error(RMSE); This evaluation parameter is used to determine the prediction accuracy for a regression ML model.
  • Accuracy; Accuracy determines the percentage of right predictions.
  • Precision; This metric represents the percentage of true positives present in the examples, predicted as positive. 
  • Recall; This parameter represents the percentage of actual true positives, predicted as positives.
  • Cut-off; Cut-off value converts the numeric prediction scores of the ML model into 0 and 1.

E. Batch predictions

Batch predictions represents a set of data that can run for predictions all at once. Batch predictions are ideal to analyze data that is not required in real-time. In AML, the batch prediction results are saved in the output location of S3 bucket. The ‘Manifest file’ corresponds each batch prediction result to each input data file. This file is located in the S3 bucket output location.

F. Real-time predictions

If an ML model has low-latency requirements, they can utilize Real-time predictions. Such models are employed in mobile, desktop or web applications. You can query an ML model for real-time predictions using the low-latency real-time prediction API. In AML, these are the key terms associated with Real-time predictions;

  • Real-time prediction API; This API enables the model to intake a single observation through the request payload and generate the prediction output in real-time.
  • Real-time prediction endpoint; A real-time prediction endpoint is created to use the ML model with Real-time prediction API. Creating this endpoint will give you a URL, which can be used for requesting real-time predictions.

Above were some key concepts for ML in AWS. Your familiarity with these concepts will help you build the ML model as coming up. Let us start by looking at the steps involved in building a machine learning model on the Amazon ML engine.

2. Steps involved in building an ML model on AWS

In this part, we will go through the basic steps involved in building an ML model on AWS. These steps will be followed by an example model, where we will build something meaningful using this knowledge. Let us keep rolling!

These are the basic steps to build an ML application on AWS;

Step 1. Formulate the problem

Before creating a machine learning application, we must know what we want our model to predict. These predictions are known as ‘target’ or ‘label’ answers. For example, we can build a machine learning model to predict ‘potential sales’ of a product before manufacturing it. In this case, our label answer predictions would be ‘Potential sales.’ Defining a problem depends upon the business need or other use cases.

We can further specify our problem for meaningful predictions. Suppose we decide to go with ‘Prediction of sales’ of a product. We can then get a numerical prediction of ‘the number of purchases’ customers will possibly make. Or, we can get a prediction on a ‘threshold value, say 50’ to see if the product will cross that sales threshold of 50. In both the cases, different algorithms will be employed depending upon our use case. The former prediction will be possible with a ‘regression model’ and the later with a ‘binary classification’ algorithm. 

Step 2. Gathering the ‘labeled’ data

Large amount of data is a prerequisite for any machine learning model. We must possess ‘labeled’ data i.e. the data for which we already know the target answer. We will then use this labeled data to train our ML model with supervised learning. To train our model for predictions, the labeled data must have the following two attributes;

  • Label or target: We need to make sure that the data we are using has labeled (correct answers) examples. Our ML model will be trained with this data and the correct labeled answers will guide it to make accurate predictions. 
  • Features or variables: Features represent the ‘attributes’ or ‘variable’ properties of an observation/example. Take for instance a demographic dataset. Individuals in that data set will represent ‘examples’ and their age, height, weight, date-of-birth, etc. will represent the features/variables. An ML model makes predictions on the basis of correlations and patterns existing between different features.

Pre-processing the data

To train machine learning models, you must have both the positive and negative examples. For instance, if we are building a ML model that classifies an email as ‘spam’ or ‘not spam,’ then we need to have training examples for both spam and not spam emails. Often, we get ‘raw’ data to run through ML models. In order to get the right results, we must prepare/pre-process the data (or clean it!). Once the labelling is done, the data must then be converted into the format suitable for our model’s algorithm. Amazon ML takes data in the CSV format, where each row represents an example, each column represents a feature and one column contains the target answer.

Step 3. Data analysis before training the model

The quality of predictions by your model depends upon the quality of data it is fed with. It is a good professional practice to analyze the data for any possible issues before we use it for our ML model. Here is a checklist to keep in mind;

  • Know the features and target data summaries-Know about the features of the examples and the values associated with each. You must know which values are dominant to evaluate if the data was collected properly. Check for missing or incorrect values to make the data reliable. 
  • Feature-label correlation: Having an idea about the target answer and its relation with the variable/feature values will help us in building an effective model. Variables with ‘high-correlation’ with the label-answer must be included for correct predictions. The high-correlation variables have high predictive power (or signal). 

Amazon ML allows you to create a data source and analyze it directly through a data report.

Step 4. Transforming features/Processing variables

Feature transformation of feature processing is the procedure to ‘change’ or ‘transform’ the variables according to our ML model. For example, in a given dataset, value of ‘dates’ can be transformed into hour of the day, day of the week and respective month, to become meaningful. Featuring processing checklist includes;

  • Replacing missing values and invalid values in the dataset.
  • Non-linear transformation: This may include operations like transforming numeric variables to categories. This is done in case the correlation between the target and numeric feature is not linear. Numerical features are hence ‘binned’ to represent a ‘range’ or ‘category’ for target prediction. ML model can then find a linear relationship with the bin category feature and the predicted value.
  • Cartesian product of one variable with another: Two different feature values can be multiplied to give a Cartesian product. This Cartesian product will represent a new feature to aid the model functioning. For example student education level (High school, college, post-graduate) is one feature, and cities (Delhi, Mumbai & Dehradun) is another variable. Cartesian product of the two variable features will be something like High school_Delhi, College_Delhi, Post-graduate_Delhi, High school_Mumbai, College_Mumbai, Post-graduate_Mumbai & High school_Dehradun, College_Dehradun, and Post-graduate_Dehradun.
  • Domain/field specific variable transformation: For example, given features of length and breadth can result into a new feature value ‘area’ in the data columns.

Transforming the existing feature values and introducing new meaningful variables can boost a model’s prediction ability.

Step 5. Splitting the data

Why do we need to split the data? Simple, because we are going to use a major chunk of the data to train the ML model. The other split part of the data will be used for predictions and testing the model for accuracy.

The general practice for splitting the data into ‘training’ and ‘evaluation’ subsets is to split the labeled data in a 70/30 ratio. This ratio can also be 80/20, however a 70/30 split is usually preferred. The ML model will be trained with the ‘training’ dataset (70%) and make predictions on the ‘evaluation’ dataset (30%). To select a model for predictions on ‘fresh data’ (When the model will be deployed) with accuracy, the best one performing in the ‘evaluation’ data subset will be preferred. 

Amazon ML by default splits the dataset into 70/30 ratio. A custom number can also be specified using the Amazon ML APIs. you can also choose to split the data in the said ration ‘randomly’ or ‘orderly’ as per the choice. 

Step 6. Training your ML model

Now, it is time to integrate a learning algorithm to our model. We will begin by feeding our prepared data to this algorithm so that it can learn. Our algorithm will eventually be able to map the relationship and patterns between the variables and the target value. Once the model algorithms is ‘trained,’ it can be used to predict label values on the new data.

Amazon ML employs linear models for predictive application building. A linear model is nothing but a model specified for linear feature relationships. Depending upon the training, the learning algorithm of the model will analyze the ‘weight’ of each variable for estimating the predictive target value.

Learning algorithms

A learning algorithms job is to figure out the ‘weight’ of the model, i.e. determining that the occurring patterns are actually ‘relationships’ between the features and the target value. A learning algorithms is equipped with a ‘loss function’ and ‘optimization function’ to calibrate the output prediction. Loss function reflects ‘by how much the predicted value was off form the real known value.’ The optimization function aims at minimizing the loss function.

On Amazon ML, following learning algorithms are available;

  • Logistic regression for binary classification models
  • Multinomial logistic regression for multiclass classification
  • Linear regression algorithm for regression models

Step 7. Understanding the training parameters

In Amazon ML, you can improve your ML model’s predictive accuracy by manipulating some parameters. These are called as the ‘hyperparameters’ and are used during the model training. Amazon ML’s default hyperparameter values help in fine-tuning the model. However, they can be manipulated as per the user’s will to get the required prediction output.

Following are the hyperparameters for boosting a model’s performance;

  • Learning rate: This value is a constant, that defines the speed at which a model reaches or converges to the optimal model weight value. Learning rate value is used in the Stochastic Gradient Descent (SGD) algorithm.
  • Model size: Too many input features and large data can make a model ‘heavy’ or ‘large’ in size. This will require more RAM and processing power for model training and predictive analysis. In Amazon ML, you can resolve this by reducing the model size. This is achieved with L1 Regularization function. Or, you can simply define the ‘maximum model size’ to be handled.
  • Number of passes: This parameter determines the number of times an algorithm runs over the training data. These sequential passes made by the algorithm over the training data result in a model that fits the data better. However, lesser number of passes are used as the model size increases.
  • Data shuffling: SGD algorithm can be influenced by the ‘order of examples’ in a given dataset. Therefore, to optimize the prediction ability of a model without any such bias or influence, data is shuffled. Training the model on constantly shuffled data boosts its reliability and performance.
  • Regularization function: Linear ML models can ‘memorize’ training data observations instead of ‘generalizing’ them. This results in loss of reliability of the model. L1 regularization function is hence used to train the models in a right way, i.e. they generalize and not memorize. L1 regularization ‘zero’ the weights of features with small correlation values/weights. This function helps in preventing the overfitting of the model. Similarly there is an L2 Regularization for overall small weight values with high correlation. The amount of regularization function employed must be used optimally so that the model’s learning is not hindered. 

Step 8. Evaluating the model’s accuracy

Many parameters and metrics are used to evaluate a model’s predictive accuracy. Since the ML model will be deployed for predicting unseen examples, it must be thoroughly evaluated using standard techniques. All of them are based on comparing the predicted output values with the actual values of the examples. Let us quickly see how the accuracy for different models is evaluated;

  • Binary classification model: This type of ML model is evaluated accuracy (ACC), recall, precision, false-positive rate and F1-measure. Another parameter AUC is also used for accuracy evaluation of the ML model. These parameters are the metrics that reflect the deviation of the predicted values from the actual values.
  • Multiclass classification model: A multiclass classification model is evaluated by a ‘confusion matrix.’ A confusion matrix is a table that shows the percentage of correct and incorrect predictions made by the model.
  • Regression model: Accuracy metrics used to evaluate a regression model are Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE).

Refer the following visuals to get an idea about the evaluation graphs for different ML models;

Score distribution: Binary classification ML model

Confusion matrix: Evaluating the Multi-class classification ML model

Residual distribution: Evaluating the Regression ML model

Based on the accuracy evaluation, an ML model can be tuned with hyperparameter values.

Step 9. Making predictions

Now that you have your predictive ML model trained and tested, it is time to use it for some predictions. In Amazon ML, you can make model predictions in the following ways;

  1. Batch predictions-Batch predictions are used in case we need to generate predictions of the full dataset at once. Such an ML application does not require low-latency for functioning. For example, you can run a batch prediction to optimize an ad campaign by targeting the customers on basis of ‘predicted potential sales.’
  2. Online predictions-Applications that function in a low-latency environment (Low data flow rate) use online predictions. In such a case, data flows in real-time and is analyzed by the machine. For example, online predictions are used in the ‘fraud detection’ engines. 

At this point, you must have grasped the idea about the basic steps to build a ML model on AWS. We will now use the ‘know-how’ of this procedure to build a meaningful ML model as a tutorial example. Keep hiking!

Tutorial example: Building an ML application on AWS

As an example, we will build a predictive model to identify potential customers for a targeted marketing campaign. Our ML model will be predicting the responses to a specific marketing offer. 

We will use a dataset of customers, with information about their responses to previous similar marketing campaigns. This is a publicly available banking and marketing dataset. You can source it via this link

Our problem

We have launched a new product, a bank term deposit (Certificate of deposit or CD). based on historical banking data, we need to figure out which customers are most likely to subscribe to our new product. Let us start building. Follow these steps;

Step 1. Set up Amazon ML account

  • Visit https://aws.amazon.com/ and select ‘Sign up’ option.
  • Follow the on-screen instructions and create the AWS account.

Step 2. Pre-process the data

The data source we are using in this tutorial has been formatted to be used with AWS. to download it from the Amazon S3 location and uploading it to our own S3 Bucket, follow these steps;

  • Download the relevant data via https://aws.amazon.com/ and save it as ‘banking.csv.’
  • Download the evaluation dataset (Whether the customers will buy your product) via https://s3.amazonaws.com/aml-sample-data/banking-batch.csv. Save it as ‘banking-batch.csv.’
  • Open ‘banking.csv’ file. The header rows have feature names representing each customer’s information. We need to ask our model “Whether this customer will subscribe to my new product?” The target answers values lie under the column ‘y’. Here, ‘1’ means yes and ‘0’ means no. (*The ‘yes’ and ‘no’ values have been transformed to ‘o’ and ‘1’ for binary processing).
  • The ‘banking-batch.csv’ does not contain a column ‘y.’ We will use this dataset to test the predictions by our model. 

Next step is to upload the downloaded files to Amazon S3 location.

  • Open S3 console from the AWS Management console via https://console.aws.amazon.com/s3/ .
  • Choose where you want to upload the files in the ‘All Buckets’ list. 
  • Select ‘Upload’ from the navigation bar.
  • Select ‘Add Files’ option.
  • Browse to the download location of ‘banking.csv’ and ‘banking-batch.csv’ in your desktop. ‘Open’ these files to upload.

Step 3. Creating Training datasource

Amazon ML uses the training datasource as objects. The datasource objects contain location and the metadata of the input data. This is utilized during the model training and evaluation. To create a data source on Amazon ML;

  • On the ‘Input data’ page, select the ‘S3’ option aside the ‘Where is your data located?’ setting. 
  • Type the full location of ‘banking.csv’ (From Step 1 of this part) in the ‘S3 Location.’ Example: xyz-bucket/banking.csv
  • Type ‘Banking data 1’ in the ‘Datasource name.’ 
  • Select ‘Verify’ option
  • Click ‘Yes’ in the ‘S3 Permission’ box.
  • If the location access and reading is successful, the page with the message ‘The validation is successful. To go to the next step, choose continue’ will appear. 
  • Select ‘Continue’ to proceed. 

Now, it’s time to provide a ‘schema’ for our model. AML requires the schema information to interpret the input data into an ML model. In AML, you can upload a separate schema file, or let Amazon ML create one for you. For this tutorial, let us infer the schema through the Amazon ML.

  • Go to the ‘Schema’ page. Amazon ML displays the schema that is inferred here. Attributes must be assigned the correct data values. This will enable correct feature processing on the data columns. Attributes with a binary value (yes or no) must be marked as ‘Binary.’ 
  • Attributes with numbers, strings, etc. must be tagged as ‘Categorical.’
  • Features with numeric quantities must be marked as ‘Numeric.’
  • Features representing text strings must be checked for ‘Text.’ 

Now, we need to select the ‘target’ attribute, the one which we want the model algorithm to train on and predict. We saw that column ‘y’ is the target label attribute. To select it as the target attribute;

  • Jump to the last page of the table where ‘y’ appears, using the ‘arrows’ near the page number indication.
  • Choose ‘y’ from the ‘Target column.’ There will be a confirmation from Amazon as to ‘y’ being your target value.
  • Click ‘Continue.’
  • In the ‘Row ID’ page, make sure ‘No’ is selected for ‘Does your data contain an identifier?’
  • Select ‘Review’ and then select ‘Continue.’

We just created a training datasource on AWS successfully! Let us proceed with building the ML application.

Step 4. Building our predictive ML model

For building an ML model on AML;

  • After setting up the training data source, AML will navigate you to the ‘ML Model settings’ page. In ‘ML MOdel name,’ ensure that ‘ML model : Banking data 1’ is displayed.
  • Select ‘Default’ in the ‘Training and evaluation settings.’ 
  • Accept the default option ‘Evaluation: ML model: Banking data 1’ in ‘Name this evaluation’ tab.
  • Select ‘Review’ then click ‘Finish.’

Once you click finish, Amazon ML will set up your model for processing. It will perform these actions by default on your model;

  • Splitting the data into 70/30 ratio.
  • ‘Training’ the model on 70% data.
  • ‘Evaluating’ the model on 30% subset.

While our ML model is lined up for processing, the status on the dashboard will be ‘Pending.’ While the AML is processing our model, the status changes to ‘In progress’ and when the model is finished building, ‘Completed’ shows up.

Step 5. Evaluating the model’s predictive performance

Our ML model has been built using the AML. It is time to see whether it can really be used for making the required predictions. Since ours is a ‘Binary’ classification model, it will be evaluated using the AUC metric (As we saw in ‘Steps to build an ML model on AML’).  Therefore, we will be checking the AUC metric for our model. To do so;

  • Go to the ‘ML model summary’ page.
  • Select ‘Evaluations’ option from the ‘ML model report’ pane. 
  • Now, select ‘Evaluation: ML model: Banking model 1’
  • Select the ‘Summary’ option
  • The ‘Evaluation summary’ page will display our model’s AUC metric.

We can also change our model’s ‘score threshold.’ This threshold number converts the scores into Binary labels of ‘0’ or ‘1’. Therefore, we can change the way our model assigns the labels to target values.

To change the ‘score threshold’ for our model;

  • Select ‘Adjust score threshold’ from the ‘Evaluation summary’ page.
  • Here, you can adjust the model’s performance metrics, manipulate score threshold or control the cut-off. It manipulates a model’s prediction confidence and tolerance of false positive and false negative results.

To predict, let’s say, top 3% of the customers that will buy our product, follow this;

  • Slide the vertical selector and set the threshold value that corresponds to “3% of the predictions as ‘1’.” 
  • Now, set the score threshold value as 0.77.

This means everytime our model makes predictions, the record scores over 0.77 will be predicted as 1.

Step 6. Generating predictions

We already know that Amazon ML allows for two types of data predictions. Real-time predictions and batch-predictions.

For running a real-time prediction;

  • Select ‘Try real-time predictions’ from the ‘ML model report’ pane.
  • Select ‘Paste a record.’
  • Paste the following observation in the dialogue box-32,services,divorced,basic.9y,no,unknown,yes,cellular,dec,mon,110,1,11,0,nonexistent,-1.8,94.465,-3 (Or any other observation you need to predict).
  • Click ‘Submit’ to confirm.
  • Choose ‘Create prediction’ at the bottom of the page.

Similarly, to create a batch prediction;

  • Select ‘Amazon Machine Learning’, then select ‘Batch predictions.’
  • Click ‘Create a new batch prediction’ option.
  • Select ‘ML model: Banking Data 1’ from the ‘ML model for batch predictions’ page.
  • Click ‘Continue.’
  • Select ‘My data is inS3, I need to create a datasource’ for ‘Locate the input data’ query.
  • Type ‘Banking data 2’ as datasource name.
  • Type the full location of ‘banking-batch.csv’ file in the ‘S3 Location.’
  • Choose ‘Yes’ for ‘Does this first line in your CSV contain the column names?’
  • Click ‘Verify.’
  • Click ‘Continue.’
  • In ‘S3 destination’ type the location where you uploaded the file in Step 1.
  • For ‘Batch prediction name’, accept the default name displayed.
  • Click ‘Review.’
  • Select ‘Yes’ in the pop-up dialog box.
  • Select ‘Finish’ on the ‘Review’ page

To view batch predictions;

  • Go to ‘Amazon machine learning’ and select ‘Batch predictions.’
  • Choose ‘Batch prediction: Ml model: Banking data 1’, the ‘Batch prediction info’ appears.
  • Go to Amazon S3 console and Visit the ‘Output S3’ URL.
  • Download the predictions file, uncompress it and view it.

Bravo! We just created a successful ML model on the AWS. Use the above procedure to create more meaningful models. See you again with another awesome ML tutorial.

With more practice and knowledge any Dev can use AWS for machine learning more efficiently.