Predictive modeling involves finding good subsets of predictors or explanatory variables. Models that fit the data well are better than models that fit the data poorly. Simple models are better than complex models. Working with a list of useful predictors, we can fit many models to the available data and then evaluate those models by their simplicity and by how well they fit the data.
Qualities of an effective model
Accuracy is the main objective of predictive model. The whole purpose of designing a model is to accurately predict the outcomes. A model giving inaccurate results is worthless and waste of time.
However, apart from accuracy there are other qualities of a model which make it effective. We want our predictor to be interpretable so that it could be easily understood and practically implemented.
Model is interpretable if it is simple. A simple model has fewer predictor variables and is easy to explain and is often better than being really complicated and hard to understand.
Often there is a tradeoff between accuracy and other features of predictors. A right balance has to be struck depending on the situation and context.
Model should also be scalable and fast. By fast means it should be very easy to build the model and to test the model in small samples. Scalability on the other hand means it should be easy to apply to a large data set.
Different components of building a model are:
Question -> data -> features -> algorithms
We start with a question we are trying to answer. Then we collect some data. We compress our data into a set of features to feed to our model. And then we build a model based on those features to predict things.
The results of predictive modeling are a factor of the model we choose, the data we have and the features we prepared.
Some of the things which contribute towards effective modeling are:
- High-quality data
- Ground Truth
- Feature Engineering
- Knowledge engineering
- Avoid Overfitting
Question: Just as we should first identify our desination before we actully start travelling, question is the most important component of building a good machine learning algorithm. We should be asking the right questions to get the right answers. Vague questions do not lead us anywhere.
High-quality data: The next most important step is to be able to actually collect data. The key point when making a decision is garbage in garbage out. If the data is bad then no matter how good our machine learning algorithm is we will get bad results out. We must have data that’s relevant for the question we are trying to answer. The key idea is to use the related data to make a prediction. If the data is unrelated to the problem at hand then prediction would not make sense.
Apart from the quality of data, quantity of data also has a role to play. Lot of problems, very simple machine learning techniques can be levered into incredibly powerful classifiers with the addition of loads of data. However, addition of data does not always result in a good classifier beyond a certain point.
If the problem is a high-bias problem collecting more data would not help beyond a certain limit rather the model needs to be improved.
On the other hand, while fitting a complex model to data if the training data is insufficient as compared to the number of model parameters then a problem of High Variance occurs resulting in over-fitting.
If High variance is due to over-estimating the model complexity when a simpler model can fit the data then adding more data will NOT help. However, when the underlying data model is in fact complex and can be explained by a complicated model then errors due to high variance can be minimized by adding more data.
In machine learning, the term “ground truth” refers to the accuracy of the training set’s classification for supervised learning techniques. Ground truth means that you have data instances that are labeled in accordance with your goal. If this information is available for a small subset of one’s entire collection, it can be used to build a model. This model can then be used to label the rest of the (unseen) documents in the collection.
It does limit how good we can realistically expect our models to be. If your training labels have inter-rater agreement of Kappa equals 0.62, you probably shouldn’t expect or want your detector to have Kappa equals 0.75.
Bayesian spam filtering is a common example of supervised learning. In this system, the algorithm is manually taught the differences between spam and non-spam. This depends on the ground truth of the messages used to train the algorithm; inaccuracies in that ground truth will correlate to inaccuracies in the resulting spam/non-spam verdicts.
Feature Engineering is when you use your knowledge about the data to create fields which capture all of the relevant and valuable information from the data that make machine learning algorithms work better. Effective design of features increase the likelihood of a certain class going up with the value of the field.
Knowledge engineering is where your model is created by a smart human being, rather than a computer that just exhaustively searches through all the possibilities. Great things happen in machine learning when human and machine work together, combining understanding the construct with detailed knowledge of your specific data and how to create relevant features from the data.
Feature engineering is very closely related to knowledge engineering and careful study of a construct is going to lead to better features and ultimately to better models.
Overfitting leads to classifiers that perform well on training data but do not generalize well on new data. Every model is over-fit in some fashion as it cannot be completely eliminated. The important consideration is where do you want to be able to use your model and in what context.
However you want your model to generalize, be sure to cross-validate at that level because if you don’t and you ignore this issue, you might have a model that doesn’t work nearly as well as you think it does in the case you’re using it. And that’s a problem.
Simpler models are less prone to overfitting. A few things you can do to reduce overfitting are:
- ones with fewer variables
- use less complex functions which corresponds to minimum description language