What is Missing Features Problem?
The principle of machine learning and AI based application is “Garbage in-garbage out”. Hence, data cleaning and Exploratory Data Analysis are the key steps of data preparation phase. In the real world, the data comes with noise. The noise may include many things such as incorrect data, unnecessary extra data, missing values, missing attributes/features. Among st them, it is difficult to handle missing features problem. Missing features is one of the most common problems that occur in data preparation/pre-processing.
The unity and completeness of data are necessary for any classification system. A trained classifier requires special training to address this challenge and cannot operate examples with lost characteristics.
Real world applications where missing data problem occurs frequently are: malfunctioning sensors, wrong pixel information, blank answers to questions asked in surveys, failed equipment, medical tests that cannot be monitored under some specific situations, etc. These are all common applications in real-life that can result in lost characteristics. Values of features that are beyond certain dynamic minimum and maximum limit of the data caused by utmost noise, signal saturation, data corruption, etc. can also be considered as missing features.
A decent machine learning system should be able to tackle the case where there is no information about an attribute. For example, by setting all missing values to some constant value ( -1 for example) then learning technique should be able to find that, here, the output does not rest on the attribute having this specific missing value.
How to deal with missing features problem?
First thing to understand is, believe that there is NO best approach to handle missing data. One needs to try several formations to check which method works better.
Some classifiers itself deal with missing values quite well such as Bayesian classification methods and Expectation minimization technique. In addition, some algorithms can use the missing value as a unique and different value when constructing the predictive model such as classification and regression trees and xgboost.
The simplest way to deal with lost data is to neglect those instances that have missing attributes. When large portions of the data is facing problem of missing features or lost attributes filtering or deletion (list wise) are suboptimal (and impractical due to conditions applied on them to work well) approaches and generally known as filtering.
Treatment of missing data in Machine learning is reliant on what algorithm you select to work with. Feature reduction can be a recommended commencement to recognize important features that form your subgroups, and a possible technique to confirm removing features with missing values.
General methods to deal with missing features problem are:
- Discard the record (row/column) that has missing values
Simply eliminate the feature(s) from your input data. If it is missing/lacking frequently in the dataset (training and test) you have, it does not make much sense to use it for training particular supervised classification algorithms anyways.
- In case of numerical data, or if NA is present in th Encode NAs as -1 or -9999
- Use the number of missing values in a given record of the data to create a new-engineered attribute in that specific dataset. Missing data can have lots of useful signal and this is a better approach to encode that information.
- Imputation of missing values
This means substituting the missing value with a different sensible value. The choice of replacement value, have multiple ways.
- set the missing value to zero
- set the missing value to the average of the attribute from the available scenarios(replace with mean/median values)
- imputation using K Nearest Neighbours technique
R code for Imputation using KNN:
knnOutput <- knnImputation(mydata)
- Preprocessing the data to another format that will have better approach included to handle missing values.
- Use the classification algorithm that handles missing values problem itself while model building phase such as KNN, XGboost, regression trees, decision and random forest.
- There are several other approaches such as neural network based methods, neuro –fuzzy algorithms. Algorithms based on the common fuzzy min-max neural network or ARTMAP and fuzzy c-means clustering are some of the examples of the method. Ensemble based approaches have also been launched. The algorithm such as DECORATE that produces artificial data (with lack of missing values) from already present data (with the presence of missing values) is robust to lost features.
Uniting an ensemble of one-class classifiers, where each classifier trained on one feature is also a good technique found in literature. This method is able to handle any type of combination of missing data, with less number of classifiers. This technique can be much powerful as long as single characteristic is able to calculate the underlying decision boundaries. However, that is not often reasonable.
Most popular R packages that handle missing data are:
- My review paper that describes problem in incremental learning environment