Data Wrangling in Python
CIT Club event blog & Notebook. The aim of this blog/notebook was to introduce CIT Club members in data wrangling.
- Introductions
- 1. Prerequisites
- 2. Reading Data
- 3. Data Exploration
- 4. Reshaping Data
- 5. Advanced Data Wrangling
- 7. Exporting Data
- 6. More Resources
- About the Speaker
Data wrangling is the process of gathering, selecting, and transforming data to answer an analytical question. Also known as data cleaning or “munging”, legend has it that this wrangling costs analytics professionals as much as 80% of their time, leaving only 20% for exploration and modeling.
Hello and welcome to Data Wrangling in Python. This is a notebook/presentatation/blog used in Data Science CIT club event, it depends on how you've accessed it. Depending on how you got this notebook, this is how to work around with it to access the data sets and the entire project.
-
Github
- clone the entire project, instructions are on the
README.md
file
- clone the entire project, instructions are on the
- Blog
Data Sets
-
client status
: Shows the client status and loan status at the end of the month their status remained the same. -
kenya_subcounties
: shows the county name and sub counties
Before starting data wrangling, we need to first install the necessary libraries needed.
- Pandas: Pandas is the popular library used for data analysis and data transformation
- numpy: Numpy is a popular library for scientific computing
We must import these libraries first, as a convention during importing pandas is aliased to pd
and numpy as np
.
#collapse-hide
import pandas as pd
import numpy as np
For the purpose of viewing, some datasets might be having very long or very many columns, therefore we need to make sure pandas will be able to show all columns
#collapse-hide
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
Data might come in very many formats. But mainly, for many data scientists data, the popular used are CSV files. We will use CSV file for our data. Below is how to read CSV with pandas. In pandas, we use pd.read_csv
to create pandas dataframes from csv. You pass the csv location as a string/variablestring as the first argument. e.g. pd.read_csv("file.csv")
#collapse-hide
status = pd.read_csv('data/client_status.csv')
EXERCISE
Create a dataframe known as counties
, the csv location is https://raw.githubusercontent.com/VictorOmondi1997/data_wrangling_cit/master/data/kenya_subcounty.csv
#collapse-hide
Before doing data wrangling or any data cleaning, it is important as a data scientist/analyst to explore your data. This will help you know how the data looks like, how many rows and columns it has. To check the first few rows we use df.head(n)
where df is the dataframe name and n
is the number of rows you want to be returned. The same is for df.tail
and df.sample
.
By Default, without passing any argument in head()
, it will return the first 5
rows.
#collapse-hide
status.head()
Passing the number of rows you want
#collapse-hide
status.head(10)
#collapse-hide
The tail
method returns the last rows of a dataframe, by default it returns the 5 last rows, but you can specify the number of last rows that you need to return.
#collapse-hide
status.tail()
4
rows have metadata on when the data was generated. We will hand this situation in the Reshaping data part.
#collapse-hide
status.tail(10)
#collapse-hide
df.sample
mainly is used during sampling techniques, it a random sample of items from an axis of object.
#collapse-hide
status.sample(5)
#collapse-hide
In pandas, there are various ways to know how many columns (variables) and rows (observations) a dataframe has.
-
len(df)
- returns number of rows -
df.shape
- returns number of rows and columns as a tuple -
df.info()
- returns data frame info, non null values, columns and data columns and the data type of columns.
#collapse-hide
status.shape
#collapse-hide
df.info
prints a summary of the data frame, ie, number of rows, columns, data column and their non-null values and the dtype
#collapse-hide
status.info()
#collapse-hide
We use df.describe()
to get summary statistics, by default it returns the summary statistics of the numerical columns.
#collapse-hide
status.describe()
To show all summary statistics including those for objects (strings) you can pass include="all"
to the df.describe
.
#collapse-hide
status.describe(include='all')
#collapse-hide
From the tail that we looked above, we saw that the last 4 rows have the metadata on the generation of the data. Those metadata are not very important in our analysis. We will first remove them.
Let's check some of the irrelevant data.
#collapse-hide
status.tail()
we can drop these 4 last columns by using df.drop
and assigning the index
argument to a list of indeces we want to drop.
df.drop(columns=cols)
were cols is a list of column names you want to drop.
#collapse-hide
status_2 = status.drop(index=[40020, 40021, 40022, 40023])
status_2.tail()
inplace
argument to True
#collapse-hide
status.drop(index=[40020, 40021, 40022, 40023], inplace=True)
status.tail()
For our data frame, Client ID
column uniquely identifies rows, for our analysis, we won't be doing analysis on this column, we can set Client ID
to be the index of our dataframe.
#collapse-hide
status.set_index('Client ID', inplace=True)
status.head()
Sorting is arranging values/index in data either in ascending or descending order. To sort index we use df.sort_index
, to sort values in columns we use df.sort_values(cols)
where cols
is column name or list of column names we want to sort their values.
#collapse-hide
status.sort_index(inplace=True)
status.head()
#collapse-hide
status_date_sorted = status.sort_values('Status Date')
status_date_sorted.head()
By defauld the sorting are in ascending order, we can also sort in descending order. To do this, we use ascending=False
argument.
status.sort_values('Client Score', ascending=False)
#collapse-hide
Duplicates are rows that are identical to each other, ie, they have the same values. to check for duplicates we use df.duplicated
this will return a boolean series we can use the boolean series to filter out the repeting values and we can use df.drop_duplicates
to remove any duplicates in the data frame.
Client ID
to be a column as it was before.
#collapse-hide
status.reset_index(inplace=True)
#collapse-hide
status.duplicated()
#collapse-hide
status[status.duplicated()]
#collapse-hide
df.drop_duplicates
removes duplicates keeping first occurence by default.
#collapse-hide
status.drop_duplicates(inplace=True)
status.shape
Majorly involed during data cleaning. Missing values are values that are missing, alaaa!. In this session we won't go deeper on how to handle missing values. In python, to check for missing values we use df.isnull()
or df.isna
, to remove missing values we use df.dropna
status.isna()
That returns a dataframe with boolean showing if a value is null
status.isnull().sum()
df.isnull().sum()
is the most widely used way of cheking the number of missing values per column.
As a data scientist it is good to check the counts of unique values, these is mainly used to check for imbalance at later states. To check for the number of unique values, we use df[col].value_counts()
where col
is the column name.
df[col]
is selecting a column returning a series, where col
is the column name
#collapse-hide
status['Region'].value_counts()
By default .value_counts
drops values which are missing values, we can include the counts of missing values by setting dropna=False
#collapse-hide
status['Region'].value_counts(dropna=False)
#collapse-hide
In pandas
- we can use comparison operators to filter data meeting a certain condition,
- to filter columns we can use
df[[col1....coln]
, - to filter rows based on their index we can use
-
iloc[i]
orloc[strn]
for integer based or label based indexing respectively.
-
#collapse-hide
status[status['Region'].isnull()]
We can use comparison operators for filtering values that meet a certain condition.
-
>
- Greater than -
<
- Less than -
>=
- Greater or equal to -
<=
- Less than or equal to -
==
equal to -
!=
not equal to
#collapse-hide
status['Client Score']>90
df[col]>value
returns a boolean series, we can use the boolean series to filter out values in dataframe not meeting that condition.
#collapse-hide
status[status['Client Score']>90]
We can use logical conditions to support two or more expressions. In pandas we we can use the following operators:
-
&
: for logical and -
|
: for logical or -
~
: for logical notImportant: Paranthesis are very important. enclose expressions in a paranthesis(.....)
#collapse-hide
status[(status['Client Score']>90) & (status['Loan Status']=='P+I In Default')]
#collapse-hide
Grouping data is a critical step in data wrangling. in pandas we can group DataFrame using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
#collapse-hide
status.groupby('Region')
The groupby returns an object, we can use the object to perform actions in the groups.
#collapse-hide
status.groupby('Region').agg('sum')
#Collapse-hide
status.groupby('Region').agg({'Client Score':np.sum})
#collapse-hide
status.groupby('Region').agg('mean')
To check for the size of each group, we can use size()
method of a groupby object.
#collapse-hide
status.groupby('Region').size()
#collapse-hide
status.groupby('Region')['Office'].nunique()
Pivot table (derived from excel) is an advanced pandas grouping method. Here you can decide on the index, columns, values and the aggregate functions. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
the aggfunc by default is mean. You can pass string aggregates eg 'mean'
, 'sum'
ect or aggregate functions such as np.mean
, np.sum
etc
#collapse-hide
status_pivot=pd.pivot_table(status, index='Office', columns='Region', values='Client Score', aggfunc='mean')
status_pivot.head()
setting margins=True
will result to row aggregate and column column aggregate.
#collapse-hide
pd.pivot_table(status, index='Status Date', columns='Region', values='Client Score', aggfunc=np.mean, margins=True)
the are varies ways of combining data sets
- append - add rows to the end of the caller.
- concatenate
- merge
- inner join
- left join
- right join
Type of merge to be performed.
- left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
- right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
- outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
- inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
- cross: creates the cartesian product from both frames, preserves the order of the left keys.
We use df.merge
/pd.merge
to import columns from other datasets.
#collapse-hide
counties = pd.read_csv('data/kenya_subcounty.csv')
counties.head()
We can merge status
and counties
using Office
and subCounty
(this is because that is the key present in both columns. Since we have diffent column names in both data frames, we have to use left_on
and right_on
to specify the columns.
if the columns are the same on both columns, we can use on
argument.
#collapse-hide
combined_ = status.merge(counties, left_on='Office', right_on='subCounty')
combined_.head()
Another method is using the pd.merge
#collapse-hide
combined_2 = pd.merge(status, counties, left_on='Office', right_on='subCounty')
combined_2.head()
#collapse-hide
After finishing data wrangling, it is good to export the data to the right file format for further uses. Let us do some final tweaks and export the final dataset as a csv.
#collapse-hide
# We will use the newly combined dataset
# 1. Rename columns and remove unwanted columns
combined_2.rename(columns={'name':'county'}, inplace=True)
combined_2.drop(columns=['subCounty'], inplace=True)
combined_2.dropna(inplace=True)
combined_3 = pd.pivot_table(combined_2, index='county', columns='Loan Status', values='Client Score', aggfunc=np.median)
display(combined_3.head())
combined_3.to_csv('combined.csv', index=False)
Thank you for attending todays session.
- For more Consultation you can schedule a meeting at Calendly
- Send Email: info.victoromondi@gmail.com or victor.omondi@inukaafrica.com
- Call/Text/Whatsapp: +254797817059