Exploratory Data Analysis (EDA) Techniques
Introduction
Exploratory Data Analysis (EDA) is a crucial phase in the data cleaning and preprocessing process. Its primary goal is to understand the underlying structure of the data, identify potential issues, and formulate hypotheses for further analysis. EDA helps to discover patterns, spot anomalies, and check assumptions through visual and quantitative methods.Key EDA Techniques
1. Descriptive Statistics
Descriptive statistics provide a summary of the data, offering a quick overview of its central tendency, dispersion, and shape. Common metrics include: - Mean: The average value. - Median: The middle value, which is less affected by outliers. - Mode: The most frequently occurring value. - Standard Deviation: A measure of the amount of variation or dispersion.Example
`
python
import pandas as pddf = pd.DataFrame({ 'Age': [22, 25, 29, 35, 29, 30, 22], 'Salary': [50000, 54000, 58000, 62000, 59000, 57000, 50000] })
descriptive_stats = df.describe() print(descriptive_stats)
Displays count, mean, std, min, 25%, 50%, 75%, max
`
2. Data Visualization
Visual representations of data can reveal insights that are not immediately obvious from descriptive statistics. - Histograms: Useful for understanding the distribution of a numerical variable. - Box Plots: Great for visualizing the range and identifying outliers. - Scatter Plots: Help in identifying relationships between two numerical variables.Example
`
python
import matplotlib.pyplot as plt
import seaborn as snsHistogram
plt.figure(figsize=(10,6)) sns.histplot(df['Salary'], bins=10, kde=True) plt.title('Salary Distribution') plt.xlabel('Salary') plt.ylabel('Frequency') plt.show()`
3. Correlation Analysis
Correlation analysis measures the relationship between variables. The correlation coefficient ranges from -1 to +1, indicating the strength and direction of the linear relationship.Example
`
python
correlation = df.corr()
print(correlation) Displays correlation matrix
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
`
4. Handling Missing Values
Identifying and visualizing missing data is an essential aspect of EDA. Techniques include: - Missing Value Heatmap: Visualizing missing data in a matrix form. - Imputation Strategies: Filling missing values using mean, median, or mode, or even more complex methods like KNN.Example
`
python
import numpy as npCreating missing values
df.loc[2, 'Salary'] = np.nanHeatmap of missing values
sns.heatmap(df.isnull(), cbar=False, cmap='viridis') plt.title('Missing Values Heatmap') plt.show()`