Introductions

Data wrangling is the process of gathering, selecting, and transforming data to answer an analytical question. Also known as data cleaning or “munging”, legend has it that this wrangling costs analytics professionals as much as 80% of their time, leaving only 20% for exploration and modeling.

Hello and welcome to Data Wrangling in Python. This is a notebook/presentatation/blog used in Data Science CIT club event, it depends on how you’ve accessed it. Depending on how you got this notebook, this is how to work around with it to access the data sets and the entire project. - Github - clone the entire project, instructions are on the README.md file - Blog - click on any of the banners shown at the top of the project - -

Data Sets

client status: Shows the client status and loan status at the end of the month their status remained the same.
kenya_subcounties: shows the county name and sub counties

1. Prerequisites

Before starting data wrangling, we need to first install the necessary libraries needed. - Pandas: Pandas is the popular library used for data analysis and data transformation - numpy: Numpy is a popular library for scientific computing

We must import these libraries first, as a convention during importing pandas is aliased to pd and numpy as np.

Code

import pandas as pd
import numpy as np

For the purpose of viewing, some datasets might be having very long or very many columns, therefore we need to make sure pandas will be able to show all columns

Note

Pandas usually truncates columns if they are many

Code

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

2. Reading Data

Data might come in very many formats. But mainly, for many data scientists data, the popular used are CSV files. We will use CSV file for our data. Below is how to read CSV with pandas. In pandas, we use pd.read_csv to create pandas dataframes from csv. You pass the csv location as a string/variablestring as the first argument. e.g. pd.read_csv("file.csv")

Code

status = pd.read_csv('data/client_status.csv')

EXERCISE

Create a dataframe known as counties, the csv location is https://raw.githubusercontent.com/VictorOmondi1997/data_wrangling_cit/master/data/kenya_subcounty.csv

3. Data Exploration

Before doing data wrangling or any data cleaning, it is important as a data scientist/analyst to explore your data. This will help you know how the data looks like, how many rows and columns it has. To check the first few rows we use df.head(n) where df is the dataframe name and n is the number of rows you want to be returned. The same is for df.tail and df.sample.

3.1 First few rows

By Default, without passing any argument in head(), it will return the first 5 rows.

Code

status.head()

	Region	Office	Client ID	Client Status	Loan Status	Client Score	Status Date
0	South Rift AgriBiz	Ndunyu Njeru	85119.0	New	P+I Current	NaN	2021-02-28
1	South Rift Region	Nakuru	2005.0	Dormant	Penalties	44.83	2018-09-30
2	Eastern Region	Wote	69865.0	Dormant	Penalties	50.65	2020-11-30
3	Western Region	Bungoma	55339.0	Dormant	Penalties	60.71	2020-07-31
4	South Rift Region	Kericho	38049.0	Dormant	Penalties	0.15	2019-06-30

Passing the number of rows you want

Code

status.head(10)

	Region	Office	Client ID	Client Status	Loan Status	Client Score	Status Date
0	South Rift AgriBiz	Ndunyu Njeru	85119.0	New	P+I Current	NaN	2021-02-28
1	South Rift Region	Nakuru	2005.0	Dormant	Penalties	44.83	2018-09-30
2	Eastern Region	Wote	69865.0	Dormant	Penalties	50.65	2020-11-30
3	Western Region	Bungoma	55339.0	Dormant	Penalties	60.71	2020-07-31
4	South Rift Region	Kericho	38049.0	Dormant	Penalties	0.15	2019-06-30
5	South Rift Region	Narok	33391.0	Dormant	Penalties	0.94	2019-04-30
6	North Rift Region	Iten	53675.0	Dormant	Penalties	35.24	2020-05-31
7	North Rift Region	Eldoret3	16893.0	Active	Penalties	26.21	2020-12-31
8	Nairobi Region	Kiambu	35681.0	Dormant	Penalties	46.78	2019-05-31
9	North Rift Region	Kitale	24807.0	Dormant	Penalties	36.94	2019-02-28

EXERCISE

For the dataframe we created (counties) return the first 15 rows.

3.2 Last few rows

The tail method returns the last rows of a dataframe, by default it returns the 5 last rows, but you can specify the number of last rows that you need to return.

Code

status.tail()

	Region	Office	Client ID	Client Status	Loan Status	Client Score	Status Date
40019	Eastern Region	Matuu	66991.0	Dormant	Penalties	33.38	2020-09-30
40020	Report Title: Client Status (Monthly)	NaN	NaN	NaN	NaN	NaN	NaN
40021	Generated By: Victor Omondi	NaN	NaN	NaN	NaN	NaN	NaN
40022	Generated Time: 2021-03-26 02:44:32	NaN	NaN	NaN	NaN	NaN	NaN
40023	Report Parameters:Office: Head Office month: 2 year: 2021	NaN	NaN	NaN	NaN	NaN	NaN

Tip

As you can see the last 4 rows have metadata on when the data was generated. We will hand this situation in the Reshaping data part.

Code

status.tail(10)

	Region	Office	Client ID	Client Status	Loan Status	Client Score	Status Date
40014	South Rift AgriBiz	Maili Nne	48513.0	Active	Penalties	56.17	2021-01-31
40015	North Rift Region	Iten	42221.0	Dormant	Penalties	44.50	2019-08-31
40016	Western Region	Mbale	54391.0	Active	P+I In Default	56.98	2021-02-28
40017	Coast Region	Mariakani	67944.0	Dormant	P+I In Default	0.15	2020-10-31
40018	North Rift Region	Kitale	6888.0	Dormant	Penalties	32.42	2020-11-30
40019	Eastern Region	Matuu	66991.0	Dormant	Penalties	33.38	2020-09-30
40020	Report Title: Client Status (Monthly)	NaN	NaN	NaN	NaN	NaN	NaN
40021	Generated By: Victor Omondi	NaN	NaN	NaN	NaN	NaN	NaN
40022	Generated Time: 2021-03-26 02:44:32	NaN	NaN	NaN	NaN	NaN	NaN
40023	Report Parameters:Office: Head Office month: 2 year: 2021	NaN	NaN	NaN	NaN	NaN	NaN

EXERCISE

For the dataframe (counties) we created return the last 15 rows.

3.3 sample

df.sample mainly is used during sampling techniques, it a random sample of items from an axis of object.

Code

status.sample(5)

	Region	Office	Client ID	Client Status	Loan Status	Client Score	Status Date
24131	North Rift Region	Nandi Hills	39393.0	Dormant	Penalties	43.14	2020-01-31
2638	Western Region	Bungoma	71754.0	Dormant	Penalties	55.37	2021-02-28
3398	North Rift Region	Eldoret1	32182.0	Dormant	Penalties	4.35	2019-04-30
24100	Coast Region	Voi	73956.0	Dormant	Penalties	40.49	2021-02-28
31920	North Rift Region	Kapsabet	6664.0	Dormant	Penalties	34.83	2019-03-31

EXERCISE

For the dataframe (counties) we created return a sample of 10 rows

3.4. Number of Rows and columns

In pandas, there are various ways to know how many columns (variables) and rows (observations) a dataframe has. 1. len(df) - returns number of rows 2. df.shape - returns number of rows and columns as a tuple 3. df.info() - returns data frame info, non null values, columns and data columns and the data type of columns.

Code

status.shape

(40024, 7)

EXERCISE

For the dataframe (counties) we created how many rows and columns does it have?

3.5 Info

df.info prints a summary of the data frame, ie, number of rows, columns, data column and their non-null values and the dtype

Code

status.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40024 entries, 0 to 40023
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Region         39944 non-null  object 
 1   Office         40020 non-null  object 
 2   Client ID      40020 non-null  float64
 3   Client Status  40020 non-null  object 
 4   Loan Status    40020 non-null  object 
 5   Client Score   36720 non-null  float64
 6   Status Date    40020 non-null  object 
dtypes: float64(2), object(5)
memory usage: 2.1+ MB

EXERCISE

For the dataframe (counties) we created how many missing values are there in the name columns?

3.6. Describe

We use df.describe() to get summary statistics, by default it returns the summary statistics of the numerical columns.

Code

status.describe()

	Client ID	Client Score
count	40020.000000	36720.000000
mean	43747.996652	35.593555
std	24860.118889	20.538354
min	2.000000	0.150000
25%	22291.750000	21.887500
50%	44105.000000	36.700000
75%	65170.500000	49.822500
max	86654.000000	92.540000

To show all summary statistics including those for objects (strings) you can pass include="all" to the df.describe.

Code

status.describe(include='all')

	Region	Office	Client ID	Client Status	Loan Status	Client Score	Status Date
count	39944	40020	40020.000000	40020	40020	36720.000000	40020
unique	12	57	NaN	4	4	NaN	67
top	Nairobi Region	Embakasi	NaN	Dormant	Penalties	NaN	2021-02-28
freq	9860	1561	NaN	30535	28071	NaN	8430
mean	NaN	NaN	43747.996652	NaN	NaN	35.593555	NaN
std	NaN	NaN	24860.118889	NaN	NaN	20.538354	NaN
min	NaN	NaN	2.000000	NaN	NaN	0.150000	NaN
25%	NaN	NaN	22291.750000	NaN	NaN	21.887500	NaN
50%	NaN	NaN	44105.000000	NaN	NaN	36.700000	NaN
75%	NaN	NaN	65170.500000	NaN	NaN	49.822500	NaN
max	NaN	NaN	86654.000000	NaN	NaN	92.540000	NaN

EXERCISE

For the dataframe (counties) we created how many unique sub counties are there in Kenya?

4. Reshaping Data

From the tail that we looked above, we saw that the last 4 rows have the metadata on the generation of the data. Those metadata are not very important in our analysis. We will first remove them.

4.1. Droping Irrelevant Data.

Let’s check some of the irrelevant data.

Code

status.tail()

	Region	Office	Client ID	Client Status	Loan Status	Client Score	Status Date
40019	Eastern Region	Matuu	66991.0	Dormant	Penalties	33.38	2020-09-30
40020	Report Title: Client Status (Monthly)	NaN	NaN	NaN	NaN	NaN	NaN
40021	Generated By: Victor Omondi	NaN	NaN	NaN	NaN	NaN	NaN
40022	Generated Time: 2021-03-26 02:44:32	NaN	NaN	NaN	NaN	NaN	NaN
40023	Report Parameters:Office: Head Office month: 2 year: 2021	NaN	NaN	NaN	NaN	NaN	NaN

we can drop these 4 last columns by using df.drop and assigning the index argument to a list of indeces we want to drop.

Important

If you want to drop columns, use df.drop(columns=cols) were cols is a list of column names you want to drop.

Code

status_2 = status.drop(index=[40020, 40021, 40022, 40023])
status_2.tail()

	Region	Office	Client ID	Client Status	Loan Status	Client Score	Status Date
40015	North Rift Region	Iten	42221.0	Dormant	Penalties	44.50	2019-08-31
40016	Western Region	Mbale	54391.0	Active	P+I In Default	56.98	2021-02-28
40017	Coast Region	Mariakani	67944.0	Dormant	P+I In Default	0.15	2020-10-31
40018	North Rift Region	Kitale	6888.0	Dormant	Penalties	32.42	2020-11-30
40019	Eastern Region	Matuu	66991.0	Dormant	Penalties	33.38	2020-09-30

Note

We can also drop inplace, no need of assigning it to a variable. This can be done by specifying the inplace argument to True

Code

status.drop(index=[40020, 40021, 40022, 40023], inplace=True)
status.tail()

	Region	Office	Client ID	Client Status	Loan Status	Client Score	Status Date
40015	North Rift Region	Iten	42221.0	Dormant	Penalties	44.50	2019-08-31
40016	Western Region	Mbale	54391.0	Active	P+I In Default	56.98	2021-02-28
40017	Coast Region	Mariakani	67944.0	Dormant	P+I In Default	0.15	2020-10-31
40018	North Rift Region	Kitale	6888.0	Dormant	Penalties	32.42	2020-11-30
40019	Eastern Region	Matuu	66991.0	Dormant	Penalties	33.38	2020-09-30

4.2. Set Index

For our data frame, Client ID column uniquely identifies rows, for our analysis, we won’t be doing analysis on this column, we can set Client ID to be the index of our dataframe.

Code

status.set_index('Client ID', inplace=True)
status.head()

	Region	Office	Client Status	Loan Status	Client Score	Status Date
Client ID
85119.0	South Rift AgriBiz	Ndunyu Njeru	New	P+I Current	NaN	2021-02-28
2005.0	South Rift Region	Nakuru	Dormant	Penalties	44.83	2018-09-30
69865.0	Eastern Region	Wote	Dormant	Penalties	50.65	2020-11-30
55339.0	Western Region	Bungoma	Dormant	Penalties	60.71	2020-07-31
38049.0	South Rift Region	Kericho	Dormant	Penalties	0.15	2019-06-30

4.3. Sorting

Sorting is arranging values/index in data either in ascending or descending order. To sort index we use df.sort_index, to sort values in columns we use df.sort_values(cols) where cols is column name or list of column names we want to sort their values.

Code

status.sort_index(inplace=True)
status.head()

	Region	Office	Client Status	Loan Status	Client Score	Status Date
Client ID
2.0	Nairobi Region	Ngong Road	Dormant	Penalties	37.98	2020-04-30
7.0	NaN	Head Office	Inactive	Pending Renewal	NaN	2015-02-28
10.0	South Rift Region	Nakuru	Active	Penalties	42.19	2020-12-31
12.0	South Rift Region	Nakuru	Active	P+I Current	65.34	2021-02-28
13.0	South Rift Region	Nakuru	Dormant	Pending Renewal	57.59	2016-02-29

Code

status_date_sorted = status.sort_values('Status Date')
status_date_sorted.head()

	Region	Office	Client Status	Loan Status	Client Score	Status Date
Client ID
7.0	NaN	Head Office	Inactive	Pending Renewal	NaN	2015-02-28
2598.0	Nairobi Region	Rongai	Inactive	Pending Renewal	NaN	2015-02-28
1201.0	Nairobi Region	Ngong Road	Inactive	Pending Renewal	NaN	2015-02-28
2859.0	South Rift Region	Naivasha	Inactive	Pending Renewal	NaN	2015-06-30
1058.0	Nairobi Region	Kahawa	Dormant	Pending Renewal	NaN	2015-10-31

By defauld the sorting are in ascending order, we can also sort in descending order. To do this, we use ascending=False argument.

status.sort_values('Client Score', ascending=False)

	Region	Office	Client Status	Loan Status	Client Score	Status Date
Client ID
28468.0	North Rift Region	Kapenguria	Active	P+I Current	92.54	2021-02-28
74947.0	South Rift AgriBiz	Ndunyu Njeru	Active	P+I Current	92.17	2021-02-28
64738.0	Eastern Region	Mwingi	Active	P+I Current	92.02	2021-02-28
31851.0	Nairobi Region	Rongai	Active	P+I Current	91.34	2021-02-28
49324.0	South Rift AgriBiz	Maili Nne	Active	P+I Current	91.21	2021-02-28
...	...	...	...	...	...	...
86301.0	Nairobi Region	Ngong Road	New	P+I Current	NaN	2021-02-28
86303.0	Nairobi Region	Kahawa	New	Pending Renewal	NaN	2021-02-28
86308.0	Nairobi Region	Kahawa	New	P+I Current	NaN	2021-02-28
86313.0	North Rift Region	Eldoret2	New	Pending Renewal	NaN	2021-02-28
86654.0	Western Region	Bondo	New	Pending Renewal	NaN	2021-02-28

40020 rows × 6 columns

EXERCISE

For the dataframe (counties) we created sort the data frame as follows 1. name: descending 2. subCounty: ascending

Tip

you can use ascending=[False, True].

4.3. Duplicates

Duplicates are rows that are identical to each other, ie, they have the same values. to check for duplicates we use df.duplicated this will return a boolean series we can use the boolean series to filter out the repeting values and we can use df.drop_duplicates to remove any duplicates in the data frame.

Important

the first occurence won’t be removed by default.

to check for duplicates, we will first reset the index, this will make Client ID to be a column as it was before.

Code

status.reset_index(inplace=True)

Code

status.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
40015    False
40016    False
40017    False
40018    False
40019    False
Length: 40020, dtype: bool

Code

status[status.duplicated()]

	Client ID	Region	Office	Client Status	Loan Status	Client Score	Status Date
5051	11331.0	Nairobi Region	Ngong Road	Dormant	Penalties	23.84	2018-02-28
5052	11331.0	Nairobi Region	Ngong Road	Dormant	Penalties	23.84	2018-02-28
5053	11331.0	Nairobi Region	Ngong Road	Dormant	Penalties	23.84	2018-02-28
5054	11331.0	Nairobi Region	Ngong Road	Dormant	Penalties	23.84	2018-02-28
9624	21440.0	North Rift Region	Eldoret1	Dormant	Penalties	16.01	2018-08-31
9625	21440.0	North Rift Region	Eldoret1	Dormant	Penalties	16.01	2018-08-31
9626	21440.0	North Rift Region	Eldoret1	Dormant	Penalties	16.01	2018-08-31
9627	21440.0	North Rift Region	Eldoret1	Dormant	Penalties	16.01	2018-08-31
11194	24807.0	North Rift Region	Kitale	Dormant	Penalties	36.94	2019-02-28
11195	24807.0	North Rift Region	Kitale	Dormant	Penalties	36.94	2019-02-28
11196	24807.0	North Rift Region	Kitale	Dormant	Penalties	36.94	2019-02-28
11197	24807.0	North Rift Region	Kitale	Dormant	Penalties	36.94	2019-02-28
22397	49227.0	Coast Region	Voi	Dormant	Penalties	0.24	2020-02-29
22398	49227.0	Coast Region	Voi	Dormant	Penalties	0.24	2020-02-29
22399	49227.0	Coast Region	Voi	Dormant	Penalties	0.24	2020-02-29
22400	49227.0	Coast Region	Voi	Dormant	Penalties	0.24	2020-02-29
35401	76644.0	Western Region	Kisumu	Active	Penalties	48.92	2020-09-30
35402	76644.0	Western Region	Kisumu	Active	Penalties	48.92	2020-09-30
35403	76644.0	Western Region	Kisumu	Active	Penalties	48.92	2020-09-30
35404	76644.0	Western Region	Kisumu	Active	Penalties	48.92	2020-09-30

EXERCISE

For the dataframe (counties), are there any duplicates in that dataframe

4.4. Drop Duplicates

df.drop_duplicates removes duplicates keeping first occurence by default.

Code

status.drop_duplicates(inplace=True)

status.shape

(40000, 7)

4.5. Missing Values

Majorly involed during data cleaning. Missing values are values that are missing, alaaa!. In this session we won’t go deeper on how to handle missing values. In python, to check for missing values we use df.isnull() or df.isna, to remove missing values we use df.dropna

status.isna()

	Client ID	Region	Office	Client Status	Loan Status	Client Score	Status Date
0	False	False	False	False	False	False	False
1	False	True	False	False	False	True	False
2	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...
40015	False	False	False	False	False	True	False
40016	False	False	False	False	False	True	False
40017	False	False	False	False	False	True	False
40018	False	False	False	False	False	True	False
40019	False	False	False	False	False	True	False

40000 rows × 7 columns

That returns a dataframe with boolean showing if a value is null

status.isnull().sum()

Client ID           0
Region             80
Office              0
Client Status       0
Loan Status         0
Client Score     3300
Status Date         0
dtype: int64

df.isnull().sum() is the most widely used way of cheking the number of missing values per column.

5. Advanced Data Wrangling

5.1. Value counts

As a data scientist it is good to check the counts of unique values, these is mainly used to check for imbalance at later states. To check for the number of unique values, we use df[col].value_counts() where col is the column name.

Note

df[col] is selecting a column returning a series, where col is the column name

Code

status['Region'].value_counts()

Nairobi Region        9856
North Rift Region     8506
South Rift Region     6781
Coast Region          4661
South Rift AgriBiz    3848
Western Region        3360
Eastern Region        2252
Central AgriBiz        656
Name: Region, dtype: int64

By default .value_counts drops values which are missing values, we can include the counts of missing values by setting dropna=False

Code

status['Region'].value_counts(dropna=False)

Nairobi Region        9856
North Rift Region     8506
South Rift Region     6781
Coast Region          4661
South Rift AgriBiz    3848
Western Region        3360
Eastern Region        2252
Central AgriBiz        656
NaN                     80
Name: Region, dtype: int64

EXERCISE

For the dataframe (counties) Show the value counts of the name variable

5.2. Data Filteration

In pandas - we can use comparison operators to filter data meeting a certain condition, - to filter columns we can use df[[col1....coln], - to filter rows based on their index we can use - iloc[i] or loc[strn] for integer based or label based indexing respectively.

Code

status[status['Region'].isnull()]

	Client ID	Region	Office	Client Status	Loan Status	Client Score	Status Date
1	7.0	NaN	Head Office	Inactive	Pending Renewal	NaN	2015-02-28
1993	4700.0	NaN	Head Office	Inactive	Pending Renewal	15.15	2020-09-30
2156	5064.0	NaN	Fraud	Dormant	Penalties	23.38	2020-08-31
2279	5322.0	NaN	Fraud	Dormant	Penalties	23.96	2020-08-31
2407	5621.0	NaN	Fraud	Dormant	Penalties	13.77	2020-08-31
...	...	...	...	...	...	...	...
38004	82121.0	NaN	Fraud	Active	P+I In Default	NaN	2020-12-31
38100	82284.0	NaN	Fraud	Active	P+I In Default	NaN	2020-12-31
38227	82560.0	NaN	Fraud	Active	P+I In Default	NaN	2020-12-31
38296	82709.0	NaN	Fraud	Active	P+I In Default	NaN	2020-12-31
38503	83150.0	NaN	Fraud	Active	P+I In Default	NaN	2020-12-31

80 rows × 7 columns

We can use comparison operators for filtering values that meet a certain condition. - > - Greater than - < - Less than - >= - Greater or equal to - <= - Less than or equal to - == equal to - != not equal to

Code

status['Client Score']>90

0        False
1        False
2        False
3        False
4        False
         ...  
40015    False
40016    False
40017    False
40018    False
40019    False
Name: Client Score, Length: 40000, dtype: bool

df[col]>value returns a boolean series, we can use the boolean series to filter out values in dataframe not meeting that condition.

Code

status[status['Client Score']>90]

	Client ID	Region	Office	Client Status	Loan Status	Client Score	Status Date
2641	6150.0	South Rift Region	Kericho	Active	Pending Renewal	90.59	2021-02-28
8535	19100.0	South Rift Region	Litein	Active	P+I Current	90.85	2021-02-28
12657	27981.0	South Rift AgriBiz	Ndunyu Njeru	Active	Pending Renewal	90.58	2021-02-28
12876	28468.0	North Rift Region	Kapenguria	Active	P+I Current	92.54	2021-02-28
14405	31851.0	Nairobi Region	Rongai	Active	P+I Current	91.34	2021-02-28
22443	49324.0	South Rift AgriBiz	Maili Nne	Active	P+I Current	91.21	2021-02-28
26438	57810.0	South Rift AgriBiz	Maili Nne	Active	P+I In Default	90.92	2021-02-28
29804	64738.0	Eastern Region	Mwingi	Active	P+I Current	92.02	2021-02-28
30564	66344.0	Central AgriBiz	Githunguri	Active	P+I Current	90.33	2021-02-28
32576	70685.0	South Rift AgriBiz	Londiani	Active	P+I Current	90.77	2021-02-28
34593	74947.0	South Rift AgriBiz	Ndunyu Njeru	Active	P+I Current	92.17	2021-02-28
35239	76301.0	Nairobi Region	Kiambu	Active	P+I Current	90.04	2021-02-28

5.2.1. Multiple conditions filterring

We can use logical conditions to support two or more expressions. In pandas we we can use the following operators:

&: for logical and
|: for logical or
~: for logical not

Important

Paranthesis are very important. enclose expressions in a paranthesis (.....)

Code

status[(status['Client Score']>90) & (status['Loan Status']=='P+I In Default')]

	Client ID	Region	Office	Client Status	Loan Status	Client Score	Status Date
26438	57810.0	South Rift AgriBiz	Maili Nne	Active	P+I In Default	90.92	2021-02-28

EXERCISE

For the dataframe (counties) Filter the data to select dataset where name is equal to 'Nakuru' & subCounty is not equal to 'Naivasha'

5.3. Grouping Data

Grouping data is a critical step in data wrangling. in pandas we can group DataFrame using a mapper or by a Series of columns.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Code

status.groupby('Region')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2757be5a00>

The groupby returns an object, we can use the object to perform actions in the groups.

Code

status.groupby('Region').agg('sum')

	Client ID	Client Score
Region
Central AgriBiz	36163955.0	20355.84
Coast Region	223714866.0	157343.10
Eastern Region	151738957.0	61644.78
Nairobi Region	306914995.0	309115.63
North Rift Region	326377934.0	270228.72
South Rift AgriBiz	201143618.0	168818.01
South Rift Region	270779301.0	211088.41
Western Region	229155555.0	107269.71

#Collapse-hide
status.groupby('Region').agg({'Client Score':np.sum})

	Client Score
Region
Central AgriBiz	20355.84
Coast Region	157343.10
Eastern Region	61644.78
Nairobi Region	309115.63
North Rift Region	270228.72
South Rift AgriBiz	168818.01
South Rift Region	211088.41
Western Region	107269.71

Code

status.groupby('Region').agg('mean')

	Client ID	Client Score
Region
Central AgriBiz	55127.980183	49.527591
Coast Region	47997.182150	35.792334
Eastern Region	67379.643428	31.244187
Nairobi Region	31139.914265	32.839226
North Rift Region	38370.319069	34.059582
South Rift AgriBiz	52272.250000	53.729475
South Rift Region	39932.060316	33.389499
Western Region	68201.058036	35.205025

To check for the size of each group, we can use size() method of a groupby object.

Code

status.groupby('Region').size()

Region
Central AgriBiz        656
Coast Region          4661
Eastern Region        2252
Nairobi Region        9856
North Rift Region     8506
South Rift AgriBiz    3848
South Rift Region     6781
Western Region        3360
dtype: int64

Code

status.groupby('Region')['Office'].nunique()

Region
Central AgriBiz       2
Coast Region          8
Eastern Region        7
Nairobi Region        8
North Rift Region     9
South Rift AgriBiz    6
South Rift Region     8
Western Region        7
Name: Office, dtype: int64

Pivot table (derived from excel) is an advanced pandas grouping method. Here you can decide on the index, columns, values and the aggregate functions. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

the aggfunc by default is mean. You can pass string aggregates eg 'mean', 'sum' ect or aggregate functions such as np.mean, np.sum etc

Code

status_pivot=pd.pivot_table(status, index='Office', columns='Region', values='Client Score', aggfunc='mean')
status_pivot.head()

Region	Central AgriBiz	Coast Region	Eastern Region	Nairobi Region	North Rift Region	South Rift AgriBiz	South Rift Region	Western Region
Office
Bomet	NaN	NaN	NaN	NaN	NaN	NaN	33.836232	NaN
Bondo	NaN	NaN	NaN	NaN	NaN	NaN	NaN	54.472500
Bungoma	NaN	NaN	NaN	NaN	NaN	NaN	NaN	32.724249
Busia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	36.784428
Changamwe	NaN	36.532172	NaN	NaN	NaN	NaN	NaN	NaN

setting margins=True will result to row aggregate and column column aggregate.

Code

pd.pivot_table(status, index='Status Date', columns='Region', values='Client Score', aggfunc=np.mean, margins=True)

Region	Central AgriBiz	Coast Region	Eastern Region	Nairobi Region	North Rift Region	South Rift AgriBiz	South Rift Region	Western Region	All
Status Date
2016-01-31	NaN	NaN	NaN	44.733333	NaN	NaN	NaN	NaN	44.733333
2016-02-29	NaN	NaN	NaN	35.430541	NaN	NaN	37.207143	NaN	35.713182
2016-03-31	NaN	NaN	NaN	34.614533	NaN	NaN	33.599565	NaN	34.376327
2016-04-30	NaN	NaN	NaN	30.816381	NaN	NaN	33.833704	NaN	31.433561
2016-05-31	NaN	NaN	NaN	37.960885	NaN	NaN	41.225000	NaN	38.681241
...	...	...	...	...	...	...	...	...	...
2020-11-30	47.642000	34.752153	31.791702	34.263287	38.249737	50.445581	33.603857	32.730902	34.959187
2020-12-31	43.470714	37.190167	30.627154	35.245165	39.696990	49.850078	34.649405	38.350320	37.861864
2021-01-31	50.570377	36.696154	34.157852	38.617857	41.534072	54.294533	36.930952	35.335667	41.915679
2021-02-28	55.504430	48.855651	41.270897	47.413198	49.778757	56.811797	47.444950	45.280237	49.599350
All	49.527591	35.792334	31.244187	32.839226	34.059582	53.729475	33.389499	35.205025	35.642344

63 rows × 9 columns

5.4. Combining Data

the are varies ways of combining data sets - append - add rows to the end of the caller. - concatenate - merge - inner join - left join - right join

Type of merge to be performed. - left: use only keys from left frame, similar to a SQL left outer join; preserve key order. - right: use only keys from right frame, similar to a SQL right outer join; preserve key order. - outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically. - inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys. - cross: creates the cartesian product from both frames, preserves the order of the left keys.

We use df.merge/pd.merge to import columns from other datasets.

Code

counties = pd.read_csv('data/kenya_subcounty.csv')
counties.head()

	name	subCounty
0	Mombasa	Changamwe
1	Mombasa	Jomvu
2	Mombasa	Kisauni
3	Mombasa	Likoni
4	Mombasa	Mvita

We can merge status and counties using Office and subCounty (this is because that is the key present in both columns. Since we have diffent column names in both data frames, we have to use left_on and right_on to specify the columns.

if the columns are the same on both columns, we can use on argument.

Code

combined_ = status.merge(counties, left_on='Office', right_on='subCounty')
combined_.head()

	Client ID	Region	Office	Client Status	Loan Status	Client Score	Status Date	name	subCounty
0	33.0	South Rift Region	Naivasha	Dormant	Penalties	40.00	2020-06-30	Nakuru	Naivasha
1	34.0	South Rift Region	Naivasha	Dormant	Pending Renewal	40.23	2016-03-31	Nakuru	Naivasha
2	206.0	South Rift Region	Naivasha	Active	P+I Current	76.87	2021-02-28	Nakuru	Naivasha
3	615.0	South Rift Region	Naivasha	Active	P+I Current	58.43	2021-02-28	Nakuru	Naivasha
4	702.0	South Rift Region	Naivasha	Dormant	Pending Renewal	43.73	2016-08-31	Nakuru	Naivasha

Another method is using the pd.merge

Code

combined_2 = pd.merge(status, counties, left_on='Office', right_on='subCounty')
combined_2.head()

	Client ID	Region	Office	Client Status	Loan Status	Client Score	Status Date	name	subCounty
0	33.0	South Rift Region	Naivasha	Dormant	Penalties	40.00	2020-06-30	Nakuru	Naivasha
1	34.0	South Rift Region	Naivasha	Dormant	Pending Renewal	40.23	2016-03-31	Nakuru	Naivasha
2	206.0	South Rift Region	Naivasha	Active	P+I Current	76.87	2021-02-28	Nakuru	Naivasha
3	615.0	South Rift Region	Naivasha	Active	P+I Current	58.43	2021-02-28	Nakuru	Naivasha
4	702.0	South Rift Region	Naivasha	Dormant	Pending Renewal	43.73	2016-08-31	Nakuru	Naivasha

EXERCISE

For the dataframe (counties) do a full outer join with status dataframe, how many rows does it have?

7. Exporting Data

After finishing data wrangling, it is good to export the data to the right file format for further uses. Let us do some final tweaks and export the final dataset as a csv.

Code

# We will use the newly combined dataset

# 1. Rename columns and remove unwanted columns
combined_2.rename(columns={'name':'county'}, inplace=True)
combined_2.drop(columns=['subCounty'], inplace=True)
combined_2.dropna(inplace=True)
combined_3 = pd.pivot_table(combined_2, index='county', columns='Loan Status', values='Client Score', aggfunc=np.median)
display(combined_3.head())
combined_3.to_csv('combined.csv', index=False)

Loan Status	P+I Current	P+I In Default	Penalties	Pending Renewal
county
Busia	64.19	47.110	32.130	41.91
Kericho	64.92	61.170	45.910	62.42
Kiambu	60.76	48.570	32.665	50.66
Kilifi	67.01	54.220	36.005	52.41
Laikipia	NaN	44.435	26.275	52.40

6. More Resources

https://pandas.pydata.org/pandas-docs/stable/index.html
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

About the Speaker

Thank you for attending todays session.

For more Consultation you can schedule a meeting at Calendly
Send Email: info.victoromondi@gmail.com or victor.omondi@inukaafrica.com
Call/Text/Whatsapp: +254797817059