Imagine you are working as a data scientist with a corporate business. The business has given you a data set, consisting of historical (X) where right answer i.e. output variable (Y) is also provided for each example or case. Your job is to use this data as training set and build a model which could predict output value (Y) of a given case in future. Eventally, the business will your model to make better decisions on new (business) cases.
Like any data scientist, your first and foremost step would be preprocessing and exploratory analysis of the data. This would allow you to handle missing values, transform features and understand their relationship among themselves and to the output (Y) variable. Consequently, you could select a good set of candidate features to build model upon.
Let us imagine, the output variable is real-valued. Therefore you made a decision to treat it as regression problem (supervised learning) - as right answers are known and the output variable is real-valued. Remember, linear regression is a simple but still powerful and easy to understand model.
By now the data has been preprocessed, features are selected and linear regression is the target model. Now is the time to split the data set into training set and test set. For instance, you may split data as training set: 80% and test set: 20% of preprocessed data. Using training set, you would learn the hypothesis H(X) by minimizing the cost function J(Q). You may use gradient descent to minimize cost function J over model parameters Q.
Once cost function has been minimized and optimal set of parameters Q (of hypothesis) has been learned, next comes the most critical step of hypothesis evaluation. This is done by feeding test set examples to learned hypothesis. The hypothesis will predict output value (Ypredict) for each test case example. Then, you will calculate the test error (based on predicted and actual values of test set examples) and compare it with training error. This may lead to one of three possible scenarios; a) if both training and test error are high, this would mean our learning algorithm is suffering from high bias (underfitting), b) low training error and high test error would be a sign of high variance (overfitting), and c) just perfect scenario - low training and test error.
In next blog, we will look into possible ways to diagnose and fix underfitting and overfitting (in context of simple and complex supervised learning problems).