In statistics, exploratory data analysis (EDA) is a technique that analyze data to recapitulate their major features, frequently with visual approaches. Its is an initial step of data anlysis from experiment. Primarily EDA is for sighting what the data can express beyond the formal modeling or hypothesis testing job. EDA is different from initial data analysis (IDA) which emphases on glancing assumptions needed for model fitting and hypothesis testing, and managing missing values and making transformations of variables as required. In other words, Exploratory Data Analysis (EDA) is a philosophy for data analysis that employs number of methods (typically graphical) to:
1. exploit insight into a data set
2. discover fundamental edifice
3. extract significant variables
4. sense outliers and anomalies
5. test primary assumptions
6. progress stingy models
7. regulate optimal factor settings
Exploratory Data Analysis is not equal to statistical graphics though the two standings are applied almost interchangeably. Statistical graphics is a group of schemes–all graphically based and all concentrating on single data arrangement facet. Exploratory Data Analysis incorporates a bigger venue; EDA is a tactic to data analysis that delays the usual assumptions about what type of model the data follow with the more direct technique of permitting the data itself to disclose its fundamental structure and model. EDA is not a sheer collection of methods; Exploratory Data Analysis is a philosophy as to how we divide a data set; what one looks for; how one looks; and how one interprets. It is right that EDA severely uses the group of practices that are called as “statistical graphics”, nonetheless it is distinguishable to statistical graphics per se.
The EDA aims to:
1. Propose hypotheses regarding the causes of experimental phenomena
2. Evaluate expectations on which statistical inference will be based
3. Provision the choice of suitable statistical tools and systems
4. Deliver a basis for additional data collection through reviews or trials
Many EDA methods have been accepted into data mining, as well as into big data analytics. They are also being taught to young students to familiarize them to statistical thinking
Maximum Exploratory Data Analysis methods are graphical in nature with a few quantitative approaches. The cause for the hefty reliance on graphics is that by its very type the major character of EDA is to open-mindedly discover, and graphics stretches the analysts supreme power to do so, inviting the data to divulge its structural secrets, and being continuously ready to advance some new, often unpredicted, insight into the data. In blend with the natural pattern-recognition abilities, graphics offers, of course, supreme power to bring this out.
Graphical methods used in Exploratory Data Analysis are usually simple, containing several systems of:
1. Plotting the raw data
Examples: data traces, histograms, bihistograms, probability plots, lag plots, block plots, and Youden plots.
2. Plotting simple statistics
Examples: mean plots, standard deviation plots, box plots, and main effects plots of the raw data.
3. Positioning such plots to maximize natural pattern-recognition capabilities,
Example: multiple plots per page.
General approaches that are used in Exploratory Data Analysis:
3. Run chart
4. Pareto chart
5. Scatter plot
6. Stem-and-leaf plot
7. Parallel coordinates
8. Odds ratio
9. Multidimensional scaling
10. Targeted projection pursuit
11.Principal component analysis
12. Multilinear PCA
13. Projection methods such as grand tour, guided tour and manual tour
14. Interactive versions of these plots
General quantitative systems in Exploratory Data Analysis are:
1. Median polish
There are various tools that can be used to employ EDA, but again needless to say, EDA is branded more by the attitude taken than by specific approaches or tools.
Open source tools available to perform Exploratory Data Analysis are:
1. Weka: includes visualization and Exploratory Data Analysis tools such as targeted projection pursuit
2. KNIME: Konstanz Information Miner – data exploration platform based on Eclipse.
3. Orange: data mining and machine learning software suite
Note: Pre-requisite for this post is Reader should be aware of basics of statistics and data analysis.
Read more about data analytics and data science, here.