What is concept drift?
In predictive analytics, machine learning, and incremental learning the concept drift is an event that occurs when the statistical properties of the output parameter, which the model is trying to forecast, alters over time in unexpected behaviors. This results in glitches because the forecasts become less correct as time passes.
In stream data mining, where input data arrives based on time instance this issue is prominent. In other words, online learning suffers most from concept drift problem. Let us take a real world example. Before past 3-4 decades, owning a car was status symbol and was considered as a luxury. Now, in India at least, ownership of a car does not mean that owner is rich. It has become necessity now. Even the number of cars on the road has been increased tremendously. So in case, you want to predict the financial wealth of the people in particular region, ’owns a car’ feature is no more significant.
Let us discuss more about technical details of this issue. Before proceeding to read further, please make sure that you have read basics of machine learning and incremental learning as a pre-requisite.
How does concept drift occur?
Data is always generated by some function. Conventional system of data mining algorithms assumes that each dataset is produced from a specific, static, hidden function. It means that uniform function is used for training and testing of the data. This assumption may fail in case of stream data, i.e. the function which produces instances at time step t may not be the identical function as the one that creates instances at time step t+1 (next time instance).This possible variation in the underlying function is referred to as concept drift. In other words, concept drift might be viewed in a more abstract sense as a hindrance caused by inadequate, unknown or unobservable characteristics in a dataset, an event called as hidden context.
Here, the underlying phenomenon which lends a real and static picture over time for each class is regrettably concealed from the learner’s vision. New data is generated by some hidden function and learner is not aware of this fact and therefore the concept drift is unpredictable. If the generating function for the drifting concepts was already recognized, one could simply learn a suitable classifier for each pertinent concept, and use the appropriate classifier for all recent data (which is called as the multitask learning problem).
In the unavailability of such knowledge, then, we should configure a unified classifier that is able to acquire such swings in concepts over time. Diagram below shows the relationship among concept drift, model adaptation, knowledge (wisdom) sign-over and time series analysis. It is quite clear from the figure that all this features are dependent on time.
Non-stationary environment is the usual victim of the concept drift problem and needs to be tackled carefully.
There are some general guidelines to shape up a system for learning in non-stationary environments as:
- Necessity of truly incremental or one pass learning where access to previous data is strictly not allowed for future training. It means that any instance is processed only once with training and at that time only the knowledge must be retrieved and summarized so that it can be used in model building process
- It is known fact that the most recent dataset is a portrayal of the present environment. So the knowledge must be grouped based on its relevance to the present environment, and be dynamically brought up to date as latest data arrives.
- The learner should have a system to resign itself when former and recently learned knowledge dispute with each other. Moreover, there should be a system for keeping track of both the incoming data and the learner’s performance on recent and existing data for the intention of complexity reduction, problematizing, and fading.
- The learner must have a system to neglect or omit information that is no longer relevant, but selectively with the added capability to look back such information if the drift or change chase a cyclical nature.
- Knowledge should be incrementally stored after certain chunk of time interval so that it can start working to generate the wise speculation for an unknown (or unlabeled) data instance through any instant of time in the learning routine.
Approaches for handling concept drift
Concept drift handling algorithms can be categorized in different ways, like:
1) Online vs. batch approaches
2) Single classifier vs. ensemble-based approaches
3) Incremental vs. non-incremental approaches
4) Active vs. passive approaches
In one of the next posts, we will discuss about ensemble approach or multiple classifier system in detail.