Exploratory Data Analysis (EDA): Unveiling Insights in Data Exploration
Introduction to Exploratory Data Analysis (EDA)
Exploratory Data Analysis, commonly known as EDA, is a crucial step in the data analysis process that involves uncovering patterns, trends, and insights in a dataset. It serves as the preliminary phase before diving into more complex analyses or machine learning models. The primary goal of EDA is to understand the structure and characteristics of the data, providing a foundation for informed decision-making and hypothesis generation.
The Essence of EDA
1. Data Understanding:
- Descriptive Statistics: EDA begins with basic statistical measures such as mean, median, mode, standard deviation, and quartiles. These metrics offer a quick overview of the central tendency and dispersion of the data.
- Visualization Techniques: Graphical representations, including histograms, box plots, and scatter plots, are powerful tools in EDA. Visualization helps in identifying patterns, outliers, and potential relationships between variables.
2. Handling Missing Data:
- Identification: EDA includes the detection of missing values in the dataset, understanding their distribution, and deciding on appropriate strategies for handling them.
- Imputation: Techniques like mean or median imputation, forward or backward filling, and more advanced methods are employed to fill missing values, ensuring a complete and reliable dataset.
3. Outlier Detection:
- Statistical Methods: EDA involves the use of statistical methods, such as the IQR (Interquartile Range), to identify outliers in the data. Outliers can significantly impact analyses and model performance.
- Visualization: Box plots and scatter plots are effective visual tools for spotting outliers that may require special attention or further investigation.
4. Bivariate and Multivariate Analysis:
- Correlation: Exploring relationships between numerical variables using correlation coefficients helps understand how changes in one variable relate to changes in another.
- Contingency Tables: For categorical variables, creating contingency tables allows for the examination of associations and dependencies between different categories.
5. Feature Engineering:
- Creating New Variables: EDA often involves the creation of new features based on existing ones. This can include transformations, categorizations, or the development of composite indices to enhance the dataset.
- Dimensionality Reduction: Techniques like PCA (Principal Component Analysis) may be considered to reduce the number of features while retaining essential information.
6. One-Hot Encoding and Categorical Analysis:
- Converting Categorical to Numeric: EDA includes strategies like one-hot encoding to convert categorical variables into a format suitable for machine learning models.
- Distribution Analysis: Understanding the distribution of categorical variables through frequency tables and bar charts aids in grasping the diversity within each category.
Dataset Used — Uncleaned Laptop Price dataset (kaggle.com)
Dataset after clean — laptop cleaned data (kaggle.com)
1. Understanding the Data
1.1 Overview
The initial step involves gaining a quick understanding of the dataset. Utilizing SQL queries, we examine the first and last few rows, as well as a random sample.
-- head, tail, and sample
SELECT * FROM laptops ORDER BY `index` LIMIT 5;
SELECT * FROM laptops ORDER BY `index` DESC LIMIT 5;
SELECT * FROM laptops ORDER BY RAND() LIMIT 5;
2. Univariate Analysis
2.1 Numerical Columns
For numerical columns like ‘price,’ we perform basic statistical analysis and determine quartiles.
-- basic statistics for 'price'
SELECT COUNT(price), MIN(price), MAX(price), AVG(price), STD(price) FROM laptops;
-- quartiles for 'price'
SELECT
COUNT(price),
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY price) OVER () AS 'Q1',
PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY price) OVER () AS 'Median',
PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY price) OVER () AS 'Q3'
FROM laptops;
2.2 Handling Null Values and Outliers
Identifying and handling null values and outliers is crucial for accurate analysis.
-- identifying null values in 'price'
SELECT price FROM laptops WHERE price IS NULL;
-- identifying outliers using quartiles
SELECT *
FROM (
SELECT *,
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY price) OVER () AS 'Q1',
PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY price) OVER () AS 'Median'
FROM laptops
) t
WHERE t.Price < t.Q1 - (1.5 * (t.Q3 - t1.Q1)) OR t.Price > t.Q3 + 1.5 * (t.Q3 - t1.Q1);
2.3 Histograms
Creating a histogram to visualize the distribution of prices in different buckets.
-- creating buckets and histogram
SELECT t.buckets, REPEAT('*', COUNT(*) / 5)
FROM (
SELECT price,
CASE
WHEN price BETWEEN 0 AND 25000 THEN '0-25k'
WHEN price BETWEEN 25001 AND 50000 THEN '25k-50k'
WHEN price BETWEEN 50001 AND 75000 THEN '50k-75k'
WHEN price BETWEEN 75001 AND 100000 THEN '75k-100k'
WHEN price > 100000 THEN '>100k'
END AS 'buckets'
FROM laptops
) t
GROUP BY t.buckets;
3. Bivariate Analysis
3.1 Numerical Columns and Numerical Columns
Explore relationships between numerical columns like ‘cpu_speed’ and ‘price.’
-- exploring relationship between 'cpu_speed' and 'price'
SELECT cpu_speed, price FROM laptops;
-- exploring relationship between 'primary_storage' and 'price'
SELECT primary_storage, price FROM laptops;
-- correlation between 'cpu_speed' and 'price'
SELECT CORR(cpu_speed, price) FROM laptops;
3.2 Categorical Columns
Understand the distribution of categorical data such as ‘company’ and ‘cpu_brand.’
-- distribution of 'company'
SELECT company, COUNT(company) FROM laptops GROUP BY Company;
-- distribution of 'cpu_brand'
SELECT cpu_brand, COUNT(cpu_brand) FROM laptops GROUP BY cpu_brand;
3.3 Categorical and Numerical Columns
Examine the relationship between categorical and numerical columns.
-- contingency table for 'company' and 'touchscreen'
SELECT Company,
SUM(CASE WHEN touchscreen = 1 THEN 1 ELSE 0 END) AS 'Touchscreen_yes',
SUM(CASE WHEN touchscreen = 0 THEN 1 ELSE 0 END) AS 'Touchscreen_no'
FROM laptops
GROUP BY Company;
-- statistics for 'price' based on 'company'
SELECT Company, MIN(price), MAX(price), AVG(price), STD(price)
FROM laptops
GROUP BY Company;
4. Dealing with Missing Values
Handling missing values is essential for ensuring the integrity of the dataset.
-- replace missing values in 'price' with mean of all prices
UPDATE laptops
SET price = (SELECT AVG(price) FROM laptops)
WHERE price IS NULL;
-- replace missing values in 'price' with mean price of corresponding company
UPDATE laptops l1
SET price = (SELECT AVG(price) FROM laptops l2 WHERE l2.company = l1.company)
WHERE price IS NULL;
-- replace missing values in 'price' with mean price of corresponding company and CPU
UPDATE laptops l1
SET price = (
SELECT AVG(price) FROM laptops l2
WHERE l2.company = l1.company AND l2.cpu_name = l1.cpu_name
)
WHERE price IS NULL;
5. Feature Engineering
Enhance the dataset by creating a new feature, ‘ppi’ (pixels per inch).
-- calculating 'ppi'
ALTER TABLE laptops ADD COLUMN ppi INTEGER;
UPDATE laptops SET ppi = SQRT(resolution_width * resolution_width + resolution_height * resolution_height) / Inches;
-- categorizing screen sizes
ALTER TABLE laptops ADD COLUMN screensize VARCHAR(255) AFTER Inches;
UPDATE laptops
SET screensize =
CASE
WHEN Inches < 14.0 THEN 'small'
WHEN inches >=14.0 AND inches < 17.0 THEN 'medium'
ELSE 'large'
END;
5.1 Analyzing the Engineered Features
-- analyzing 'screensize' and its impact on 'price'
SELECT screensize, AVG(price) FROM laptops GROUP BY screensize;
6. One-Hot Encoding
Transform categorical data into numerical format using one-hot encoding.
-- one-hot encoding for 'gpu_brand'
SELECT
gpu_brand,
CASE WHEN gpu_brand = 'Intel' THEN 1 ELSE 0 END AS 'intel',
CASE WHEN gpu_brand = 'AMD' THEN 1 ELSE 0 END AS 'amd',
CASE WHEN gpu_brand = 'nvidia' THEN 1 ELSE 0 END AS 'nvidia',
CASE WHEN gpu_brand = 'arm' THEN 1 ELSE 0 END AS 'arm'
FROM laptops;
Conclusion
Exploratory Data Analysis is a powerful tool for extracting meaningful insights from datasets. By leveraging SQL queries, we’ve navigated through basic statistics, handled missing values, performed feature engineering, and gained valuable insights into the relationships within the laptop dataset. This journey showcases the importance of EDA in understanding and preparing data for further analysis and modeling.
Previous Blog: Mastering Data Preprocessing and Cleaning
In our recent blog, we took a deep dive into the intricacies of data preprocessing and cleaning — a foundational step in the data analysis journey. From handling missing values to outlier detection, our exploration covered essential techniques, empowering you to transform raw data into a reliable, robust foundation for further analysis. If you haven’t already, delve into the insights shared in that blog to elevate your data preparation skills and ensure the integrity of your analytical endeavors. Discover the power of clean, well-prepared data as the cornerstone of impactful analyses.
data cleaning blog — data cleaning data handling data preprocessing laptop data | Medium