Predictive Modeling Overview

The raw data doesn’t offer a lot of value in its unprocessed state. However, by applying the right set of tools, we can pull powerful insights from it.With data in hand, you can begin doing analytics. Analytics can be used to make smart business decisions, compete and innovate.

There are three types of data analysis:

  • Descriptive (business intelligence and data mining)
  • Predictive (forecasting),
  • Prescriptive (optimization and simulation)

Descriptive analytics is the simplest of all analytics.  It involves summarizing and reporting on what is happening.  Reporting on what is happening in your business right now is the first step to making smart business decisions. This is the core of KPI scorecards or business intelligence (BI).

Predictive analytics is the next step in analytics and it is used to see patterns in data and build models which can then be used for forecasting.

Prescriptive analytics goes beyond descriptive and predictive models and used when we need to prescribe an action/ decision for the management to make. It recommends one or more courses of action and shows the likely outcome of each decision.

What is predictive modeling?

Predictive analytics involves searching for meaningful relationships among  variables and representing those relationships in models. The goal of prediction modeling is to develop a model which can infer the predicted variable from some combination of the predictor variables.

Predictive analytics brings together the skills and understanding from various domains like management, information technology, and modeling.  Data scientists working in the field of predictive analytics should understand business—accounting, finance, marketing, and management. They should be aware  about information technology, including data structures, algorithms, and object-oriented programming. They should know statistical modeling, machine learning, and mathematical programming.

Depending on the type of predictors and response variable, a prediction task is described as:

  • Classification
  • Regression

Classification is the task of assigning objects or individuals to the appropriate class, given a set of correct, past assignments. Classification involves predicting a categorical response. Pattern classification tasks can be grouped into two main sub-categories: Supervised and unsupervised learning. In supervised learning, the class labels in the dataset, which is used to build the classification model, are known. In contrast, unsupervised learning task deal with unlabeled instances, and the classes have to be inferred from the unstructured dataset. Typically, unsupervised learning employs a clustering technique in order to group the unlabeled samples based on similarity between them. Regression involves predicting a response of a quantitative variable.

Predictive modeling process:

The predictive modeling process involves the following steps at the broad level:

  1. Retrieve and Organize the Data
  2. Explore the Data
  3. Process the data
  4. Split the Data and Build a Model
  5. Evaluate the Model’s Performance
  6. Validate the model

Retrieve and Organize the Data

Raw Data could be in multiple forms like structured data from Relational Databases, NoSQL Databases, Distributed File Systems, JSON Files,  Text Files etc. Depending on how the data is stored various tools such as queries, mapreduce algorithms etc may be used to organize the data into a form from which it can be analyzed.

Deal with the missing data if any. If the amount of empty cells in the dataset is low then samples rows that or the attribute columns for which data is missing can be removed. Missing data could also be replaced using certain statistics rather than complete removal. This is called imputation. For categorical data, the missing value can be interpolated from the most frequent category, and the sample average can be used to interpolate missing values for numerical attributes. In general, resubstitution via k-nearest neighbor imputation is considered to be superior over resubstitution of missing data by the overall sample mean.

Explore the data

Before model generation, explore and understand the data, look at the data dictionary to see which data is available. This includes understanding the characteristics of the predictor variables as well as the relationship(s) that might exist between them. The type of data the predictor represents (Nominal, Ordinal, Quantitative). Are there missing or invalid values or outliers?  How the predictor’s values are distributed and  what is the units of the predictors.  Decide upon the model required e.g regression for numerical data and classifier for categorical data.

Data can be investigated through the below means:

  • cross tabulations on categorical variables to understand the coding and volumes
  • summary statistics to understand the distribution of the continuous variables
  • simple visualization techniques for explanatory data analysis to see any underlying patterns

Visualization will reveal which variable contains more discriminatory information and this could be used for feature selection in order to remove noise and reduce the size of our dataset.

Process the data:

Data processing includes things like Feature engineering, transformation, normalization, Dimensionality reduction etc.

A transformation is when you make a new variable by applying some mathematical function to the previous variable like converting to Exponential/Logarithmic.

Normalization and other feature scaling techniques are often mandatory in order to make comparisons between different attributes (e.g., to compute distances or similarities in cluster analysis), especially, if the attributes were measured on different scales proper scaling of features is a requirement for most machine learning algorithms.

  • Min-Max scaling: The scaling of attributes in a certain range, e.g., 0 to 1.
  • Standardization or scaling to unit-variance: Every sample is subtracted by the attribute’s mean and divided by the standard deviation so that the attribute will have the properties of a standard normal distribution (μ=0, σ=1).

Feature selection is a part of feature engineering. The main purpose of Feature selection and Dimensionality reduction is to remove noise, increase computational efficiency by retaining only “useful” (discriminatory) information, and to avoid overfitting.

Feature selection and dimensionality reduction appear similar as both result in smaller feature space.  The difference however lies in the transformation techniques involved.  In feature selection, we are interested in retaining only those features that are “meaningful” – features that can help to build a “good” classifier. Dimensionality reduction on the other hand involves a transformation technique. Commonly used dimensionality reduction techniques are linear transformations such as Principal Component Analyses (PCA) and Linear Discriminant Analysis (LDA).

Split the Data and Build a Model

After we have extracted the relevant features from our raw data, we would randomly split our dataset into training and testing dataset (ratio 80%/20%) using random selection without replacement. The larger set forms the training dataset and will be used to train the model and the purpose of the test dataset is to evaluate the performance of the final model at the very end.

There are  enormous number of different learning algorithms viz. Support Vector Machine (SVM), Naive Bayes, Artificial Neural Networks (ANN), Decision tree classifiers which can be used for training the model.

Once the model is ready its performance is evaluated on the test data at the very end. It is important that we use the test dataset only once in order to avoid overfitting when we compute the prediction-error metrics. Overfitting leads to classifiers that perform well on training data but do not generalize well  on new data.

Techniques such as cross-validation are used in the model creation and refinement steps to evaluate the classification performance.  Cross-validation is one of the most useful techniques to evaluate different combinations of feature selection, dimensionality reduction, and learning algorithms.

There are multiple flavors of cross-validation:

  • K fold cross-validation where you pick a number K and you split the training set into K number of groups where 1 fold is retained as test set, and the other K-1 folds are used for training the model
  • leave-out-one where every data point it’s own fold and you repeatedly train on every data point except for one and then test on that one.

K-fold is quicker and it’s preferred by some theoreticians. Leave-out-one is more stable and it avoids the issue of how to select folds

There’s a bunch of cross-validation variants.

–           Flat cross-validation: each point has equal chance of being placed into each fold. You’ll also just see flat cross-validation called cross-validation,

–           Student-level cross-validation where folds are selected so that no student’s data is represented in two folds. So in other words, a student is at any time either in a training fold or in a test fold.

Student-level cross-validation allows you to test model generalizability to new students, which is often something we want. By contrast, flat cross-validation or even stratified cross-validation can just test model generalizability to new data from the same students.

Evaluate the Model’s Performance

 “The quality of model predictions calls for at least two numbers: one number to indicate accuracy of prediction …., and another number to reflect its generalizability …” (Bruer, 2006)

There are many techniques for evaluating the performance of a model. The techniques vary as well according to the type of model (regression, classification) and the problem domain. In general, though, we’re looking to see how accurate the model is at predicting the desired outcome. Concretely, we’re looking for ‘goodness of fit’ and indications of bias or variance in the model.

Evaluating the performance of a model will often give us a good sense of where to go next. Do we need more data? Do we just need to tweak the model we’re working on? Or is an entirely different model necessary?

After each attempt at building a model, we want to keep track of how well it performs. It may also be desirable to try a few different models to see if one has a better chance of predicting outcomes than other candidate models do. Ultimately, we choose the model that best predicts the intended outcome.

Validate a model

Validation implies checking that the model is good enough. There are many dimensions of validity, and it’s important to address them all for a model that you can really consider valid.

  1. Generalizability means that the model remains predictive when used in a new data set.
  2. Ecological validity: Model applies to real-life situations outside of research settings.
  3. Construct validity. Model actually measures what it was intended to measure i.e. the model fits the training data.
  4. Predictive validity. The model predicts not just the present, but the future as well.
  5. Substantive validity. The variable that the model is predicting should matter and have an impact in real life.
  6. Content validity. The model should cover the full range of values it’s intended to.
  7. Conclusion validity. Your conclusions based on the model should be justified based on the evidence.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s