Data Wrangling in Python

CIT Club event blog & Notebook. The aim of this blog/notebook was to introduce CIT Club members in data wrangling.
data-wrangling
python
cit-club
Author

Victor Omondi

Published

June 11, 2021

Wikipedia Definition

Introductions

Data wrangling is the process of gathering, selecting, and transforming data to answer an analytical question. Also known as data cleaning or “munging”, legend has it that this wrangling costs analytics professionals as much as 80% of their time, leaving only 20% for exploration and modeling.

Hello and welcome to Data Wrangling in Python. This is a notebook/presentatation/blog used in Data Science CIT club event, it depends on how you’ve accessed it. Depending on how you got this notebook, this is how to work around with it to access the data sets and the entire project. - Github - clone the entire project, instructions are on the README.md file - Blog - click on any of the banners shown at the top of the project - Open In Collab - Binder

Data Sets

  • client status: Shows the client status and loan status at the end of the month their status remained the same.
  • kenya_subcounties: shows the county name and sub counties

1. Prerequisites

Before starting data wrangling, we need to first install the necessary libraries needed. - Pandas: Pandas is the popular library used for data analysis and data transformation - numpy: Numpy is a popular library for scientific computing

We must import these libraries first, as a convention during importing pandas is aliased to pd and numpy as np.

Code
import pandas as pd
import numpy as np

For the purpose of viewing, some datasets might be having very long or very many columns, therefore we need to make sure pandas will be able to show all columns

Note

Pandas usually truncates columns if they are many

outline

Code
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

2. Reading Data

Reading Data

Data might come in very many formats. But mainly, for many data scientists data, the popular used are CSV files. We will use CSV file for our data. Below is how to read CSV with pandas. In pandas, we use pd.read_csv to create pandas dataframes from csv. You pass the csv location as a string/variablestring as the first argument. e.g. pd.read_csv("file.csv")

Code
status = pd.read_csv('data/client_status.csv')

EXERCISE

Create a dataframe known as counties, the csv location is https://raw.githubusercontent.com/VictorOmondi1997/data_wrangling_cit/master/data/kenya_subcounty.csv

3. Data Exploration

Data Exploration

Before doing data wrangling or any data cleaning, it is important as a data scientist/analyst to explore your data. This will help you know how the data looks like, how many rows and columns it has. To check the first few rows we use df.head(n) where df is the dataframe name and n is the number of rows you want to be returned. The same is for df.tail and df.sample.

3.1 First few rows

By Default, without passing any argument in head(), it will return the first 5 rows.

Code
status.head()
Region Office Client ID Client Status Loan Status Client Score Status Date
0 South Rift AgriBiz Ndunyu Njeru 85119.0 New P+I Current NaN 2021-02-28
1 South Rift Region Nakuru 2005.0 Dormant Penalties 44.83 2018-09-30
2 Eastern Region Wote 69865.0 Dormant Penalties 50.65 2020-11-30
3 Western Region Bungoma 55339.0 Dormant Penalties 60.71 2020-07-31
4 South Rift Region Kericho 38049.0 Dormant Penalties 0.15 2019-06-30

Passing the number of rows you want

Code
status.head(10)
Region Office Client ID Client Status Loan Status Client Score Status Date
0 South Rift AgriBiz Ndunyu Njeru 85119.0 New P+I Current NaN 2021-02-28
1 South Rift Region Nakuru 2005.0 Dormant Penalties 44.83 2018-09-30
2 Eastern Region Wote 69865.0 Dormant Penalties 50.65 2020-11-30
3 Western Region Bungoma 55339.0 Dormant Penalties 60.71 2020-07-31
4 South Rift Region Kericho 38049.0 Dormant Penalties 0.15 2019-06-30
5 South Rift Region Narok 33391.0 Dormant Penalties 0.94 2019-04-30
6 North Rift Region Iten 53675.0 Dormant Penalties 35.24 2020-05-31
7 North Rift Region Eldoret3 16893.0 Active Penalties 26.21 2020-12-31
8 Nairobi Region Kiambu 35681.0 Dormant Penalties 46.78 2019-05-31
9 North Rift Region Kitale 24807.0 Dormant Penalties 36.94 2019-02-28

EXERCISE

For the dataframe we created (counties) return the first 15 rows.

3.2 Last few rows

The tail method returns the last rows of a dataframe, by default it returns the 5 last rows, but you can specify the number of last rows that you need to return.

Code
status.tail()
Region Office Client ID Client Status Loan Status Client Score Status Date
40019 Eastern Region Matuu 66991.0 Dormant Penalties 33.38 2020-09-30
40020 Report Title: Client Status (Monthly) NaN NaN NaN NaN NaN NaN
40021 Generated By: Victor Omondi NaN NaN NaN NaN NaN NaN
40022 Generated Time: 2021-03-26 02:44:32 NaN NaN NaN NaN NaN NaN
40023 Report Parameters:Office: Head Office month: 2 year: 2021 NaN NaN NaN NaN NaN NaN
Tip

As you can see the last 4 rows have metadata on when the data was generated. We will hand this situation in the Reshaping data part.

Code
status.tail(10)
Region Office Client ID Client Status Loan Status Client Score Status Date
40014 South Rift AgriBiz Maili Nne 48513.0 Active Penalties 56.17 2021-01-31
40015 North Rift Region Iten 42221.0 Dormant Penalties 44.50 2019-08-31
40016 Western Region Mbale 54391.0 Active P+I In Default 56.98 2021-02-28
40017 Coast Region Mariakani 67944.0 Dormant P+I In Default 0.15 2020-10-31
40018 North Rift Region Kitale 6888.0 Dormant Penalties 32.42 2020-11-30
40019 Eastern Region Matuu 66991.0 Dormant Penalties 33.38 2020-09-30
40020 Report Title: Client Status (Monthly) NaN NaN NaN NaN NaN NaN
40021 Generated By: Victor Omondi NaN NaN NaN NaN NaN NaN
40022 Generated Time: 2021-03-26 02:44:32 NaN NaN NaN NaN NaN NaN
40023 Report Parameters:Office: Head Office month: 2 year: 2021 NaN NaN NaN NaN NaN NaN

EXERCISE

For the dataframe (counties) we created return the last 15 rows.

3.3 sample

df.sample mainly is used during sampling techniques, it a random sample of items from an axis of object.

Code
status.sample(5)
Region Office Client ID Client Status Loan Status Client Score Status Date
24131 North Rift Region Nandi Hills 39393.0 Dormant Penalties 43.14 2020-01-31
2638 Western Region Bungoma 71754.0 Dormant Penalties 55.37 2021-02-28
3398 North Rift Region Eldoret1 32182.0 Dormant Penalties 4.35 2019-04-30
24100 Coast Region Voi 73956.0 Dormant Penalties 40.49 2021-02-28
31920 North Rift Region Kapsabet 6664.0 Dormant Penalties 34.83 2019-03-31

EXERCISE

For the dataframe (counties) we created return a sample of 10 rows

3.4. Number of Rows and columns

In pandas, there are various ways to know how many columns (variables) and rows (observations) a dataframe has. 1. len(df) - returns number of rows 2. df.shape - returns number of rows and columns as a tuple 3. df.info() - returns data frame info, non null values, columns and data columns and the data type of columns.

Code
status.shape
(40024, 7)

EXERCISE

For the dataframe (counties) we created how many rows and columns does it have?

3.5 Info

df.info prints a summary of the data frame, ie, number of rows, columns, data column and their non-null values and the dtype

Code
status.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40024 entries, 0 to 40023
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Region         39944 non-null  object 
 1   Office         40020 non-null  object 
 2   Client ID      40020 non-null  float64
 3   Client Status  40020 non-null  object 
 4   Loan Status    40020 non-null  object 
 5   Client Score   36720 non-null  float64
 6   Status Date    40020 non-null  object 
dtypes: float64(2), object(5)
memory usage: 2.1+ MB

EXERCISE

For the dataframe (counties) we created how many missing values are there in the name columns?

3.6. Describe

We use df.describe() to get summary statistics, by default it returns the summary statistics of the numerical columns.

Code
status.describe()
Client ID Client Score
count 40020.000000 36720.000000
mean 43747.996652 35.593555
std 24860.118889 20.538354
min 2.000000 0.150000
25% 22291.750000 21.887500
50% 44105.000000 36.700000
75% 65170.500000 49.822500
max 86654.000000 92.540000

To show all summary statistics including those for objects (strings) you can pass include="all" to the df.describe.

Code
status.describe(include='all')
Region Office Client ID Client Status Loan Status Client Score Status Date
count 39944 40020 40020.000000 40020 40020 36720.000000 40020
unique 12 57 NaN 4 4 NaN 67
top Nairobi Region Embakasi NaN Dormant Penalties NaN 2021-02-28
freq 9860 1561 NaN 30535 28071 NaN 8430
mean NaN NaN 43747.996652 NaN NaN 35.593555 NaN
std NaN NaN 24860.118889 NaN NaN 20.538354 NaN
min NaN NaN 2.000000 NaN NaN 0.150000 NaN
25% NaN NaN 22291.750000 NaN NaN 21.887500 NaN
50% NaN NaN 44105.000000 NaN NaN 36.700000 NaN
75% NaN NaN 65170.500000 NaN NaN 49.822500 NaN
max NaN NaN 86654.000000 NaN NaN 92.540000 NaN

EXERCISE

For the dataframe (counties) we created how many unique sub counties are there in Kenya?

4. Reshaping Data

Reshaping Data

From the tail that we looked above, we saw that the last 4 rows have the metadata on the generation of the data. Those metadata are not very important in our analysis. We will first remove them.

4.1. Droping Irrelevant Data.

Let’s check some of the irrelevant data.

Code
status.tail()
Region Office Client ID Client Status Loan Status Client Score Status Date
40019 Eastern Region Matuu 66991.0 Dormant Penalties 33.38 2020-09-30
40020 Report Title: Client Status (Monthly) NaN NaN NaN NaN NaN NaN
40021 Generated By: Victor Omondi NaN NaN NaN NaN NaN NaN
40022 Generated Time: 2021-03-26 02:44:32 NaN NaN NaN NaN NaN NaN
40023 Report Parameters:Office: Head Office month: 2 year: 2021 NaN NaN NaN NaN NaN NaN

we can drop these 4 last columns by using df.drop and assigning the index argument to a list of indeces we want to drop.

Important

If you want to drop columns, use df.drop(columns=cols) were cols is a list of column names you want to drop.

Code
status_2 = status.drop(index=[40020, 40021, 40022, 40023])
status_2.tail()
Region Office Client ID Client Status Loan Status Client Score Status Date
40015 North Rift Region Iten 42221.0 Dormant Penalties 44.50 2019-08-31
40016 Western Region Mbale 54391.0 Active P+I In Default 56.98 2021-02-28
40017 Coast Region Mariakani 67944.0 Dormant P+I In Default 0.15 2020-10-31
40018 North Rift Region Kitale 6888.0 Dormant Penalties 32.42 2020-11-30
40019 Eastern Region Matuu 66991.0 Dormant Penalties 33.38 2020-09-30
Note

We can also drop inplace, no need of assigning it to a variable. This can be done by specifying the inplace argument to True

Code
status.drop(index=[40020, 40021, 40022, 40023], inplace=True)
status.tail()
Region Office Client ID Client Status Loan Status Client Score Status Date
40015 North Rift Region Iten 42221.0 Dormant Penalties 44.50 2019-08-31
40016 Western Region Mbale 54391.0 Active P+I In Default 56.98 2021-02-28
40017 Coast Region Mariakani 67944.0 Dormant P+I In Default 0.15 2020-10-31
40018 North Rift Region Kitale 6888.0 Dormant Penalties 32.42 2020-11-30
40019 Eastern Region Matuu 66991.0 Dormant Penalties 33.38 2020-09-30

4.2. Set Index

For our data frame, Client ID column uniquely identifies rows, for our analysis, we won’t be doing analysis on this column, we can set Client ID to be the index of our dataframe.

Code
status.set_index('Client ID', inplace=True)
status.head()
Region Office Client Status Loan Status Client Score Status Date
Client ID
85119.0 South Rift AgriBiz Ndunyu Njeru New P+I Current NaN 2021-02-28
2005.0 South Rift Region Nakuru Dormant Penalties 44.83 2018-09-30
69865.0 Eastern Region Wote Dormant Penalties 50.65 2020-11-30
55339.0 Western Region Bungoma Dormant Penalties 60.71 2020-07-31
38049.0 South Rift Region Kericho Dormant Penalties 0.15 2019-06-30

4.3. Sorting

Sorting is arranging values/index in data either in ascending or descending order. To sort index we use df.sort_index, to sort values in columns we use df.sort_values(cols) where cols is column name or list of column names we want to sort their values.

Code
status.sort_index(inplace=True)
status.head()
Region Office Client Status Loan Status Client Score Status Date
Client ID
2.0 Nairobi Region Ngong Road Dormant Penalties 37.98 2020-04-30
7.0 NaN Head Office Inactive Pending Renewal NaN 2015-02-28
10.0 South Rift Region Nakuru Active Penalties 42.19 2020-12-31
12.0 South Rift Region Nakuru Active P+I Current 65.34 2021-02-28
13.0 South Rift Region Nakuru Dormant Pending Renewal 57.59 2016-02-29
Code
status_date_sorted = status.sort_values('Status Date')
status_date_sorted.head()
Region Office Client Status Loan Status Client Score Status Date
Client ID
7.0 NaN Head Office Inactive Pending Renewal NaN 2015-02-28
2598.0 Nairobi Region Rongai Inactive Pending Renewal NaN 2015-02-28
1201.0 Nairobi Region Ngong Road Inactive Pending Renewal NaN 2015-02-28
2859.0 South Rift Region Naivasha Inactive Pending Renewal NaN 2015-06-30
1058.0 Nairobi Region Kahawa Dormant Pending Renewal NaN 2015-10-31

By defauld the sorting are in ascending order, we can also sort in descending order. To do this, we use ascending=False argument.

status.sort_values('Client Score', ascending=False)
Region Office Client Status Loan Status Client Score Status Date
Client ID
28468.0 North Rift Region Kapenguria Active P+I Current 92.54 2021-02-28
74947.0 South Rift AgriBiz Ndunyu Njeru Active P+I Current 92.17 2021-02-28
64738.0 Eastern Region Mwingi Active P+I Current 92.02 2021-02-28
31851.0 Nairobi Region Rongai Active P+I Current 91.34 2021-02-28
49324.0 South Rift AgriBiz Maili Nne Active P+I Current 91.21 2021-02-28
... ... ... ... ... ... ...
86301.0 Nairobi Region Ngong Road New P+I Current NaN 2021-02-28
86303.0 Nairobi Region Kahawa New Pending Renewal NaN 2021-02-28
86308.0 Nairobi Region Kahawa New P+I Current NaN 2021-02-28
86313.0 North Rift Region Eldoret2 New Pending Renewal NaN 2021-02-28
86654.0 Western Region Bondo New Pending Renewal NaN 2021-02-28

40020 rows × 6 columns

EXERCISE

For the dataframe (counties) we created sort the data frame as follows 1. name: descending 2. subCounty: ascending

Tip

you can use ascending=[False, True].

4.3. Duplicates

Duplicates are rows that are identical to each other, ie, they have the same values. to check for duplicates we use df.duplicated this will return a boolean series we can use the boolean series to filter out the repeting values and we can use df.drop_duplicates to remove any duplicates in the data frame.

Important

the first occurence won’t be removed by default.

to check for duplicates, we will first reset the index, this will make Client ID to be a column as it was before.

Code
status.reset_index(inplace=True)
Code
status.duplicated()
0        False
1        False
2        False
3        False
4        False
         ...  
40015    False
40016    False
40017    False
40018    False
40019    False
Length: 40020, dtype: bool
Code
status[status.duplicated()]
Client ID Region Office Client Status Loan Status Client Score Status Date
5051 11331.0 Nairobi Region Ngong Road Dormant Penalties 23.84 2018-02-28
5052 11331.0 Nairobi Region Ngong Road Dormant Penalties 23.84 2018-02-28
5053 11331.0 Nairobi Region Ngong Road Dormant Penalties 23.84 2018-02-28
5054 11331.0 Nairobi Region Ngong Road Dormant Penalties 23.84 2018-02-28
9624 21440.0 North Rift Region Eldoret1 Dormant Penalties 16.01 2018-08-31
9625 21440.0 North Rift Region Eldoret1 Dormant Penalties 16.01 2018-08-31
9626 21440.0 North Rift Region Eldoret1 Dormant Penalties 16.01 2018-08-31
9627 21440.0 North Rift Region Eldoret1 Dormant Penalties 16.01 2018-08-31
11194 24807.0 North Rift Region Kitale Dormant Penalties 36.94 2019-02-28
11195 24807.0 North Rift Region Kitale Dormant Penalties 36.94 2019-02-28
11196 24807.0 North Rift Region Kitale Dormant Penalties 36.94 2019-02-28
11197 24807.0 North Rift Region Kitale Dormant Penalties 36.94 2019-02-28
22397 49227.0 Coast Region Voi Dormant Penalties 0.24 2020-02-29
22398 49227.0 Coast Region Voi Dormant Penalties 0.24 2020-02-29
22399 49227.0 Coast Region Voi Dormant Penalties 0.24 2020-02-29
22400 49227.0 Coast Region Voi Dormant Penalties 0.24 2020-02-29
35401 76644.0 Western Region Kisumu Active Penalties 48.92 2020-09-30
35402 76644.0 Western Region Kisumu Active Penalties 48.92 2020-09-30
35403 76644.0 Western Region Kisumu Active Penalties 48.92 2020-09-30
35404 76644.0 Western Region Kisumu Active Penalties 48.92 2020-09-30

EXERCISE

For the dataframe (counties), are there any duplicates in that dataframe

4.4. Drop Duplicates

df.drop_duplicates removes duplicates keeping first occurence by default.

Code
status.drop_duplicates(inplace=True)
status.shape
(40000, 7)

4.5. Missing Values

Majorly involed during data cleaning. Missing values are values that are missing, alaaa!. In this session we won’t go deeper on how to handle missing values. In python, to check for missing values we use df.isnull() or df.isna, to remove missing values we use df.dropna

status.isna()
Client ID Region Office Client Status Loan Status Client Score Status Date
0 False False False False False False False
1 False True False False False True False
2 False False False False False False False
3 False False False False False False False
4 False False False False False False False
... ... ... ... ... ... ... ...
40015 False False False False False True False
40016 False False False False False True False
40017 False False False False False True False
40018 False False False False False True False
40019 False False False False False True False

40000 rows × 7 columns

That returns a dataframe with boolean showing if a value is null

status.isnull().sum()
Client ID           0
Region             80
Office              0
Client Status       0
Loan Status         0
Client Score     3300
Status Date         0
dtype: int64

df.isnull().sum() is the most widely used way of cheking the number of missing values per column.

5. Advanced Data Wrangling

Advanced Data Wrangling

5.1. Value counts

As a data scientist it is good to check the counts of unique values, these is mainly used to check for imbalance at later states. To check for the number of unique values, we use df[col].value_counts() where col is the column name.

Note

df[col] is selecting a column returning a series, where col is the column name

Code
status['Region'].value_counts()
Nairobi Region        9856
North Rift Region     8506
South Rift Region     6781
Coast Region          4661
South Rift AgriBiz    3848
Western Region        3360
Eastern Region        2252
Central AgriBiz        656
Name: Region, dtype: int64

By default .value_counts drops values which are missing values, we can include the counts of missing values by setting dropna=False

Code
status['Region'].value_counts(dropna=False)
Nairobi Region        9856
North Rift Region     8506
South Rift Region     6781
Coast Region          4661
South Rift AgriBiz    3848
Western Region        3360
Eastern Region        2252
Central AgriBiz        656
NaN                     80
Name: Region, dtype: int64

EXERCISE

For the dataframe (counties) Show the value counts of the name variable

5.2. Data Filteration

In pandas - we can use comparison operators to filter data meeting a certain condition, - to filter columns we can use df[[col1....coln], - to filter rows based on their index we can use - iloc[i] or loc[strn] for integer based or label based indexing respectively.

Code
status[status['Region'].isnull()]
Client ID Region Office Client Status Loan Status Client Score Status Date
1 7.0 NaN Head Office Inactive Pending Renewal NaN 2015-02-28
1993 4700.0 NaN Head Office Inactive Pending Renewal 15.15 2020-09-30
2156 5064.0 NaN Fraud Dormant Penalties 23.38 2020-08-31
2279 5322.0 NaN Fraud Dormant Penalties 23.96 2020-08-31
2407 5621.0 NaN Fraud Dormant Penalties 13.77 2020-08-31
... ... ... ... ... ... ... ...
38004 82121.0 NaN Fraud Active P+I In Default NaN 2020-12-31
38100 82284.0 NaN Fraud Active P+I In Default NaN 2020-12-31
38227 82560.0 NaN Fraud Active P+I In Default NaN 2020-12-31
38296 82709.0 NaN Fraud Active P+I In Default NaN 2020-12-31
38503 83150.0 NaN Fraud Active P+I In Default NaN 2020-12-31

80 rows × 7 columns

We can use comparison operators for filtering values that meet a certain condition. - > - Greater than - < - Less than - >= - Greater or equal to - <= - Less than or equal to - == equal to - != not equal to

Code
status['Client Score']>90
0        False
1        False
2        False
3        False
4        False
         ...  
40015    False
40016    False
40017    False
40018    False
40019    False
Name: Client Score, Length: 40000, dtype: bool

df[col]>value returns a boolean series, we can use the boolean series to filter out values in dataframe not meeting that condition.

Code
status[status['Client Score']>90]
Client ID Region Office Client Status Loan Status Client Score Status Date
2641 6150.0 South Rift Region Kericho Active Pending Renewal 90.59 2021-02-28
8535 19100.0 South Rift Region Litein Active P+I Current 90.85 2021-02-28
12657 27981.0 South Rift AgriBiz Ndunyu Njeru Active Pending Renewal 90.58 2021-02-28
12876 28468.0 North Rift Region Kapenguria Active P+I Current 92.54 2021-02-28
14405 31851.0 Nairobi Region Rongai Active P+I Current 91.34 2021-02-28
22443 49324.0 South Rift AgriBiz Maili Nne Active P+I Current 91.21 2021-02-28
26438 57810.0 South Rift AgriBiz Maili Nne Active P+I In Default 90.92 2021-02-28
29804 64738.0 Eastern Region Mwingi Active P+I Current 92.02 2021-02-28
30564 66344.0 Central AgriBiz Githunguri Active P+I Current 90.33 2021-02-28
32576 70685.0 South Rift AgriBiz Londiani Active P+I Current 90.77 2021-02-28
34593 74947.0 South Rift AgriBiz Ndunyu Njeru Active P+I Current 92.17 2021-02-28
35239 76301.0 Nairobi Region Kiambu Active P+I Current 90.04 2021-02-28

5.2.1. Multiple conditions filterring

We can use logical conditions to support two or more expressions. In pandas we we can use the following operators:

  • &: for logical and
  • |: for logical or
  • ~: for logical not
Important

Paranthesis are very important. enclose expressions in a paranthesis (.....)

Code
status[(status['Client Score']>90) & (status['Loan Status']=='P+I In Default')]
Client ID Region Office Client Status Loan Status Client Score Status Date
26438 57810.0 South Rift AgriBiz Maili Nne Active P+I In Default 90.92 2021-02-28

EXERCISE

For the dataframe (counties) Filter the data to select dataset where name is equal to 'Nakuru' & subCounty is not equal to 'Naivasha'

5.3. Grouping Data

Grouping data is a critical step in data wrangling. in pandas we can group DataFrame using a mapper or by a Series of columns.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Code
status.groupby('Region')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2757be5a00>

The groupby returns an object, we can use the object to perform actions in the groups.

Code
status.groupby('Region').agg('sum')
Client ID Client Score
Region
Central AgriBiz 36163955.0 20355.84
Coast Region 223714866.0 157343.10
Eastern Region 151738957.0 61644.78
Nairobi Region 306914995.0 309115.63
North Rift Region 326377934.0 270228.72
South Rift AgriBiz 201143618.0 168818.01
South Rift Region 270779301.0 211088.41
Western Region 229155555.0 107269.71
#Collapse-hide
status.groupby('Region').agg({'Client Score':np.sum})
Client Score
Region
Central AgriBiz 20355.84
Coast Region 157343.10
Eastern Region 61644.78
Nairobi Region 309115.63
North Rift Region 270228.72
South Rift AgriBiz 168818.01
South Rift Region 211088.41
Western Region 107269.71
Code
status.groupby('Region').agg('mean')
Client ID Client Score
Region
Central AgriBiz 55127.980183 49.527591
Coast Region 47997.182150 35.792334
Eastern Region 67379.643428 31.244187
Nairobi Region 31139.914265 32.839226
North Rift Region 38370.319069 34.059582
South Rift AgriBiz 52272.250000 53.729475
South Rift Region 39932.060316 33.389499
Western Region 68201.058036 35.205025

To check for the size of each group, we can use size() method of a groupby object.

Code
status.groupby('Region').size()
Region
Central AgriBiz        656
Coast Region          4661
Eastern Region        2252
Nairobi Region        9856
North Rift Region     8506
South Rift AgriBiz    3848
South Rift Region     6781
Western Region        3360
dtype: int64
Code
status.groupby('Region')['Office'].nunique()
Region
Central AgriBiz       2
Coast Region          8
Eastern Region        7
Nairobi Region        8
North Rift Region     9
South Rift AgriBiz    6
South Rift Region     8
Western Region        7
Name: Office, dtype: int64

Pivot table (derived from excel) is an advanced pandas grouping method. Here you can decide on the index, columns, values and the aggregate functions. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

the aggfunc by default is mean. You can pass string aggregates eg 'mean', 'sum' ect or aggregate functions such as np.mean, np.sum etc

Code
status_pivot=pd.pivot_table(status, index='Office', columns='Region', values='Client Score', aggfunc='mean')
status_pivot.head()
Region Central AgriBiz Coast Region Eastern Region Nairobi Region North Rift Region South Rift AgriBiz South Rift Region Western Region
Office
Bomet NaN NaN NaN NaN NaN NaN 33.836232 NaN
Bondo NaN NaN NaN NaN NaN NaN NaN 54.472500
Bungoma NaN NaN NaN NaN NaN NaN NaN 32.724249
Busia NaN NaN NaN NaN NaN NaN NaN 36.784428
Changamwe NaN 36.532172 NaN NaN NaN NaN NaN NaN

setting margins=True will result to row aggregate and column column aggregate.

Code
pd.pivot_table(status, index='Status Date', columns='Region', values='Client Score', aggfunc=np.mean, margins=True)
Region Central AgriBiz Coast Region Eastern Region Nairobi Region North Rift Region South Rift AgriBiz South Rift Region Western Region All
Status Date
2016-01-31 NaN NaN NaN 44.733333 NaN NaN NaN NaN 44.733333
2016-02-29 NaN NaN NaN 35.430541 NaN NaN 37.207143 NaN 35.713182
2016-03-31 NaN NaN NaN 34.614533 NaN NaN 33.599565 NaN 34.376327
2016-04-30 NaN NaN NaN 30.816381 NaN NaN 33.833704 NaN 31.433561
2016-05-31 NaN NaN NaN 37.960885 NaN NaN 41.225000 NaN 38.681241
... ... ... ... ... ... ... ... ... ...
2020-11-30 47.642000 34.752153 31.791702 34.263287 38.249737 50.445581 33.603857 32.730902 34.959187
2020-12-31 43.470714 37.190167 30.627154 35.245165 39.696990 49.850078 34.649405 38.350320 37.861864
2021-01-31 50.570377 36.696154 34.157852 38.617857 41.534072 54.294533 36.930952 35.335667 41.915679
2021-02-28 55.504430 48.855651 41.270897 47.413198 49.778757 56.811797 47.444950 45.280237 49.599350
All 49.527591 35.792334 31.244187 32.839226 34.059582 53.729475 33.389499 35.205025 35.642344

63 rows × 9 columns

5.4. Combining Data

the are varies ways of combining data sets - append - add rows to the end of the caller. - concatenate - merge - inner join - left join - right join

Type of merge to be performed. - left: use only keys from left frame, similar to a SQL left outer join; preserve key order. - right: use only keys from right frame, similar to a SQL right outer join; preserve key order. - outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically. - inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys. - cross: creates the cartesian product from both frames, preserves the order of the left keys.

We use df.merge/pd.merge to import columns from other datasets.

Code
counties = pd.read_csv('data/kenya_subcounty.csv')
counties.head()
name subCounty
0 Mombasa Changamwe
1 Mombasa Jomvu
2 Mombasa Kisauni
3 Mombasa Likoni
4 Mombasa Mvita

We can merge status and counties using Office and subCounty (this is because that is the key present in both columns. Since we have diffent column names in both data frames, we have to use left_on and right_on to specify the columns.

if the columns are the same on both columns, we can use on argument.

Code
combined_ = status.merge(counties, left_on='Office', right_on='subCounty')
combined_.head()
Client ID Region Office Client Status Loan Status Client Score Status Date name subCounty
0 33.0 South Rift Region Naivasha Dormant Penalties 40.00 2020-06-30 Nakuru Naivasha
1 34.0 South Rift Region Naivasha Dormant Pending Renewal 40.23 2016-03-31 Nakuru Naivasha
2 206.0 South Rift Region Naivasha Active P+I Current 76.87 2021-02-28 Nakuru Naivasha
3 615.0 South Rift Region Naivasha Active P+I Current 58.43 2021-02-28 Nakuru Naivasha
4 702.0 South Rift Region Naivasha Dormant Pending Renewal 43.73 2016-08-31 Nakuru Naivasha

Another method is using the pd.merge

Code
combined_2 = pd.merge(status, counties, left_on='Office', right_on='subCounty')
combined_2.head()
Client ID Region Office Client Status Loan Status Client Score Status Date name subCounty
0 33.0 South Rift Region Naivasha Dormant Penalties 40.00 2020-06-30 Nakuru Naivasha
1 34.0 South Rift Region Naivasha Dormant Pending Renewal 40.23 2016-03-31 Nakuru Naivasha
2 206.0 South Rift Region Naivasha Active P+I Current 76.87 2021-02-28 Nakuru Naivasha
3 615.0 South Rift Region Naivasha Active P+I Current 58.43 2021-02-28 Nakuru Naivasha
4 702.0 South Rift Region Naivasha Dormant Pending Renewal 43.73 2016-08-31 Nakuru Naivasha

EXERCISE

For the dataframe (counties) do a full outer join with status dataframe, how many rows does it have?

7. Exporting Data

After finishing data wrangling, it is good to export the data to the right file format for further uses. Let us do some final tweaks and export the final dataset as a csv.

Code
# We will use the newly combined dataset

# 1. Rename columns and remove unwanted columns
combined_2.rename(columns={'name':'county'}, inplace=True)
combined_2.drop(columns=['subCounty'], inplace=True)
combined_2.dropna(inplace=True)
combined_3 = pd.pivot_table(combined_2, index='county', columns='Loan Status', values='Client Score', aggfunc=np.median)
display(combined_3.head())
combined_3.to_csv('combined.csv', index=False)
Loan Status P+I Current P+I In Default Penalties Pending Renewal
county
Busia 64.19 47.110 32.130 41.91
Kericho 64.92 61.170 45.910 62.42
Kiambu 60.76 48.570 32.665 50.66
Kilifi 67.01 54.220 36.005 52.41
Laikipia NaN 44.435 26.275 52.40

6. More Resources

  1. https://pandas.pydata.org/pandas-docs/stable/index.html
  2. https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

About the Speaker

Victor

GitHub Twitter LinkedIn Sponsors

Thank you for attending todays session.