How to Evaluat a Machine Learning Model?

Imagine you are working as a data scientist with a corporate business. The business has given you a data set, consisting of historical (X) where right answer i.e. output variable (Y) is also provided for each example or case. Your job is to use this data as training set and build a model which could predict output value (Y) of a given case in future. Eventally, the business will your model to make better decisions on new (business) cases.

Like any data scientist, your first and foremost step would be preprocessing and exploratory analysis of the data. This would allow you to handle missing values, transform features and understand their relationship among themselves and to the output (Y) variable. Consequently, you could select a good set of candidate features to build model upon.

Let us imagine, the output variable is real-valued. Therefore you made a decision to treat it as regression problem (supervised learning) - as right answers are known and the output variable is real-valued. Remember, linear regression is a simple but still powerful and easy to understand model.

By now the data has been preprocessed, features are selected and linear regression is the target model. Now is the time to split the data set into training set and test set. For instance, you may split data as training set: 80% and test set: 20% of preprocessed data. Using training set, you would learn the hypothesis H(X) by minimizing the cost function J(Q). You may use gradient descent to minimize cost function J over model parameters Q.

Once cost function has been minimized and optimal set of parameters Q (of hypothesis) has been learned, next comes the most critical step of hypothesis evaluation. This is done by feeding test set examples to learned hypothesis. The hypothesis will predict output value (Ypredict) for each test case example. Then, you will calculate the test error (based on predicted and actual values of test set examples) and compare it with training error. This may lead to one of three possible scenarios; a) if both training and test error are high, this would mean our learning algorithm is suffering from high bias (underfitting), b) low training error and high test error would be a sign of high variance (overfitting), and c) just perfect scenario - low training and test error.

In next blog, we will look into possible ways to diagnose and fix underfitting and overfitting (in context of simple and complex supervised learning problems).

Big Data in Remote Sensing–a Big Picture

alt text

Finally the point has come where we need to discuss the term “big data in remote sensing” for more effective and long term consistent monitoring of Earth surface. The volume of data that spaceborne satellite are collecting daily bases is in petabytes. The distribution of such a big data is a big challenge too, for that reason more efficient reliable and highly optimized methods are required to process this huge amount of data and only the valuable information extracted from this data is distributed to the user community and decision/policy makers.

The term big data is relatively new in the field of applied remote sensing and remote sensing data sets are bit different from typical big data sets like financial time series. Remote sensing time series data sets have both temporal and spatial dimension with multiple ground targets that have different signature in remote sensing signals. Newly launch and follow-up satellite missions like Radarsat–2, Tandem-L, ALOS PALSAR–2, COSMO–SkyMed constellation, Landsat–8, Sentinel series have improve spatial, temporal (short revisit time) and spectral resolution. This huge inflow of satellite remote sensing data will help to develop more efficient monitoring and decision support systems with high consistency. Now it will be possible to develop application based on dense and consistent multitemporal time series. New frame works are required to handle this new upcoming inflow of remote sensing data with fine resolution and wide coverage. Below is the big picture of potential and applicability of remote sensing driven big data.