Beginner’s Guide to Exploratory Data Analysis (EDA)

My father gave me an excel sheet with his work details and asked me to infer anything and everything that I could. This is what I did.

What do you mean by Exploratory Data Analysis?

Wikipedia says “EDA is an approach to analyzing datasets to summarize their main characteristics, often with visual methods”. It is an approach which will help you build a better relationship with the new dataset. If you use it wisely, half of your analysis is done.

For example: When you are buying something online, do you read the reviews? Do you see the price? Do you see the color? Do you swipe through the few images used to describe the object? — Yep, that is you performing an exploratory data analysis on the object you are interested in buying.

A few steps on EDA —

Understand your variables

df.head() : shows you the first five samples in your dataset.

  • It makes it easier for you to understand the data-type of the features ie. string based, date-time based, integer based, etc.
  • Makes you familiar with the data you are using.

df.describe(): shows you a generic overview of all the quantitative variables. It basically summarizes the count, mean, standard deviation, min, and max for quantitative variables.

  • If you look closely, you can see that Discount has a max of Rs 1,04,761! Is it even possible to have a discount worth 1 lakh especially when all the products range from Rs 425 to Rs 2990 (Refer: cost ). Voila! We found our first outlier.

A few other initial functions to use are:

  1. df.shape(): shows you the number of samples and features in your dataset.
  2. show you the data type of all the features.
  3. df.columns(): returns the name of all the columns in your dataset.

Explore your Dataset

df[variable_name].value_counts(): returns the number of times all the unique values occurred in variable_name ie. State: Uttar Pradesh had the highest number of buyers (120) followed by Karnataka (93) followed by Maharashtra (90).

  • Clean your dataset whenever needed — sometimes your don’t want any missing values (Nan) in your feature values. You can simply remove the samples with missing value for a particular feature by — df[~df[‘Size’].isna()]
  • To better understand your features and to extract insights quicker, try plotting your features. In the above plot, you can easily see that Sizes: S, M, L are sold the most.

I wanted to know the sizes that should be there in production beforehand for different stores in different states. As we can see from the above data, I need more 3XL’s for Andhra Pradesh than Arunachal Pradesh. Using this analysis, I can make my inventory management more efficient.

df.groupby([‘State’,’Size’]).size(): show the number of samples where State and Size is a unique combination.

Since I did not want any outlier in my selling price column, I am only taking samples where my selling price is less than 10,000 INR and making sure if it works by plotting a histogram.

Later, I want to compare which States are willing to buy more expensive dresses. Since Selling Price is a continuous variable, we separate it into bins where bin labeled as 1 being least expensive and 10 being the most expensive.

Now, we will groupby() on binned and state to get the name of the states based on their willingness to spend on my product.

As we can see, orders from Andhra Pradesh range from bin [1,5] meaning the Selling Price ranges from Rs 1000 to Rs 5000; which will again help me with my inventory management.

There are several other types of visualizations that you can use depending on the dataset.

I hope ya’ll have a much better understanding of exploratory data analysis now.

tl;dr —

  1. Understand your features.
  2. Select various features to explore.
  3. Deal with unclean data.
  4. Plots and lots of plots!


Masters in Data Science @ NYU

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store