Code
import pandas as pd
import numpy as np
Victor Omondi
June 11, 2021
Data wrangling is the process of gathering, selecting, and transforming data to answer an analytical question. Also known as data cleaning or “munging”, legend has it that this wrangling costs analytics professionals as much as 80% of their time, leaving only 20% for exploration and modeling.
Hello and welcome to Data Wrangling in Python. This is a notebook/presentatation/blog used in Data Science CIT club event, it depends on how you’ve accessed it. Depending on how you got this notebook, this is how to work around with it to access the data sets and the entire project. - Github - clone the entire project, instructions are on the README.md
file - Blog - click on any of the banners shown at the top of the project - -
client status
: Shows the client status and loan status at the end of the month their status remained the same.kenya_subcounties
: shows the county name and sub countiesBefore starting data wrangling, we need to first install the necessary libraries needed. - Pandas: Pandas is the popular library used for data analysis and data transformation - numpy: Numpy is a popular library for scientific computing
We must import these libraries first, as a convention during importing pandas is aliased to pd
and numpy as np
.
For the purpose of viewing, some datasets might be having very long or very many columns, therefore we need to make sure pandas will be able to show all columns
Data might come in very many formats. But mainly, for many data scientists data, the popular used are CSV files. We will use CSV file for our data. Below is how to read CSV with pandas. In pandas, we use pd.read_csv
to create pandas dataframes from csv. You pass the csv location as a string/variablestring as the first argument. e.g. pd.read_csv("file.csv")
Create a dataframe known as counties
, the csv location is https://raw.githubusercontent.com/VictorOmondi1997/data_wrangling_cit/master/data/kenya_subcounty.csv
Before doing data wrangling or any data cleaning, it is important as a data scientist/analyst to explore your data. This will help you know how the data looks like, how many rows and columns it has. To check the first few rows we use df.head(n)
where df is the dataframe name and n
is the number of rows you want to be returned. The same is for df.tail
and df.sample
.
By Default, without passing any argument in head()
, it will return the first 5
rows.
Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|---|
0 | South Rift AgriBiz | Ndunyu Njeru | 85119.0 | New | P+I Current | NaN | 2021-02-28 |
1 | South Rift Region | Nakuru | 2005.0 | Dormant | Penalties | 44.83 | 2018-09-30 |
2 | Eastern Region | Wote | 69865.0 | Dormant | Penalties | 50.65 | 2020-11-30 |
3 | Western Region | Bungoma | 55339.0 | Dormant | Penalties | 60.71 | 2020-07-31 |
4 | South Rift Region | Kericho | 38049.0 | Dormant | Penalties | 0.15 | 2019-06-30 |
Passing the number of rows you want
Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|---|
0 | South Rift AgriBiz | Ndunyu Njeru | 85119.0 | New | P+I Current | NaN | 2021-02-28 |
1 | South Rift Region | Nakuru | 2005.0 | Dormant | Penalties | 44.83 | 2018-09-30 |
2 | Eastern Region | Wote | 69865.0 | Dormant | Penalties | 50.65 | 2020-11-30 |
3 | Western Region | Bungoma | 55339.0 | Dormant | Penalties | 60.71 | 2020-07-31 |
4 | South Rift Region | Kericho | 38049.0 | Dormant | Penalties | 0.15 | 2019-06-30 |
5 | South Rift Region | Narok | 33391.0 | Dormant | Penalties | 0.94 | 2019-04-30 |
6 | North Rift Region | Iten | 53675.0 | Dormant | Penalties | 35.24 | 2020-05-31 |
7 | North Rift Region | Eldoret3 | 16893.0 | Active | Penalties | 26.21 | 2020-12-31 |
8 | Nairobi Region | Kiambu | 35681.0 | Dormant | Penalties | 46.78 | 2019-05-31 |
9 | North Rift Region | Kitale | 24807.0 | Dormant | Penalties | 36.94 | 2019-02-28 |
For the dataframe we created (counties
) return the first 15 rows.
The tail
method returns the last rows of a dataframe, by default it returns the 5 last rows, but you can specify the number of last rows that you need to return.
Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|---|
40019 | Eastern Region | Matuu | 66991.0 | Dormant | Penalties | 33.38 | 2020-09-30 |
40020 | Report Title: Client Status (Monthly) | NaN | NaN | NaN | NaN | NaN | NaN |
40021 | Generated By: Victor Omondi | NaN | NaN | NaN | NaN | NaN | NaN |
40022 | Generated Time: 2021-03-26 02:44:32 | NaN | NaN | NaN | NaN | NaN | NaN |
40023 | Report Parameters:Office: Head Office month: 2 year: 2021 | NaN | NaN | NaN | NaN | NaN | NaN |
Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|---|
40014 | South Rift AgriBiz | Maili Nne | 48513.0 | Active | Penalties | 56.17 | 2021-01-31 |
40015 | North Rift Region | Iten | 42221.0 | Dormant | Penalties | 44.50 | 2019-08-31 |
40016 | Western Region | Mbale | 54391.0 | Active | P+I In Default | 56.98 | 2021-02-28 |
40017 | Coast Region | Mariakani | 67944.0 | Dormant | P+I In Default | 0.15 | 2020-10-31 |
40018 | North Rift Region | Kitale | 6888.0 | Dormant | Penalties | 32.42 | 2020-11-30 |
40019 | Eastern Region | Matuu | 66991.0 | Dormant | Penalties | 33.38 | 2020-09-30 |
40020 | Report Title: Client Status (Monthly) | NaN | NaN | NaN | NaN | NaN | NaN |
40021 | Generated By: Victor Omondi | NaN | NaN | NaN | NaN | NaN | NaN |
40022 | Generated Time: 2021-03-26 02:44:32 | NaN | NaN | NaN | NaN | NaN | NaN |
40023 | Report Parameters:Office: Head Office month: 2 year: 2021 | NaN | NaN | NaN | NaN | NaN | NaN |
For the dataframe (counties
) we created return the last 15 rows.
df.sample
mainly is used during sampling techniques, it a random sample of items from an axis of object.
Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|---|
24131 | North Rift Region | Nandi Hills | 39393.0 | Dormant | Penalties | 43.14 | 2020-01-31 |
2638 | Western Region | Bungoma | 71754.0 | Dormant | Penalties | 55.37 | 2021-02-28 |
3398 | North Rift Region | Eldoret1 | 32182.0 | Dormant | Penalties | 4.35 | 2019-04-30 |
24100 | Coast Region | Voi | 73956.0 | Dormant | Penalties | 40.49 | 2021-02-28 |
31920 | North Rift Region | Kapsabet | 6664.0 | Dormant | Penalties | 34.83 | 2019-03-31 |
For the dataframe (counties
) we created return a sample of 10
rows
In pandas, there are various ways to know how many columns (variables) and rows (observations) a dataframe has. 1. len(df)
- returns number of rows 2. df.shape
- returns number of rows and columns as a tuple 3. df.info()
- returns data frame info, non null values, columns and data columns and the data type of columns.
For the dataframe (counties
) we created how many rows and columns does it have?
df.info
prints a summary of the data frame, ie, number of rows, columns, data column and their non-null values and the dtype
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40024 entries, 0 to 40023
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Region 39944 non-null object
1 Office 40020 non-null object
2 Client ID 40020 non-null float64
3 Client Status 40020 non-null object
4 Loan Status 40020 non-null object
5 Client Score 36720 non-null float64
6 Status Date 40020 non-null object
dtypes: float64(2), object(5)
memory usage: 2.1+ MB
For the dataframe (counties
) we created how many missing values are there in the name
columns?
We use df.describe()
to get summary statistics, by default it returns the summary statistics of the numerical columns.
Client ID | Client Score | |
---|---|---|
count | 40020.000000 | 36720.000000 |
mean | 43747.996652 | 35.593555 |
std | 24860.118889 | 20.538354 |
min | 2.000000 | 0.150000 |
25% | 22291.750000 | 21.887500 |
50% | 44105.000000 | 36.700000 |
75% | 65170.500000 | 49.822500 |
max | 86654.000000 | 92.540000 |
To show all summary statistics including those for objects (strings) you can pass include="all"
to the df.describe
.
Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|---|
count | 39944 | 40020 | 40020.000000 | 40020 | 40020 | 36720.000000 | 40020 |
unique | 12 | 57 | NaN | 4 | 4 | NaN | 67 |
top | Nairobi Region | Embakasi | NaN | Dormant | Penalties | NaN | 2021-02-28 |
freq | 9860 | 1561 | NaN | 30535 | 28071 | NaN | 8430 |
mean | NaN | NaN | 43747.996652 | NaN | NaN | 35.593555 | NaN |
std | NaN | NaN | 24860.118889 | NaN | NaN | 20.538354 | NaN |
min | NaN | NaN | 2.000000 | NaN | NaN | 0.150000 | NaN |
25% | NaN | NaN | 22291.750000 | NaN | NaN | 21.887500 | NaN |
50% | NaN | NaN | 44105.000000 | NaN | NaN | 36.700000 | NaN |
75% | NaN | NaN | 65170.500000 | NaN | NaN | 49.822500 | NaN |
max | NaN | NaN | 86654.000000 | NaN | NaN | 92.540000 | NaN |
For the dataframe (counties
) we created how many unique sub counties are there in Kenya?
From the tail that we looked above, we saw that the last 4 rows have the metadata on the generation of the data. Those metadata are not very important in our analysis. We will first remove them.
Let’s check some of the irrelevant data.
Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|---|
40019 | Eastern Region | Matuu | 66991.0 | Dormant | Penalties | 33.38 | 2020-09-30 |
40020 | Report Title: Client Status (Monthly) | NaN | NaN | NaN | NaN | NaN | NaN |
40021 | Generated By: Victor Omondi | NaN | NaN | NaN | NaN | NaN | NaN |
40022 | Generated Time: 2021-03-26 02:44:32 | NaN | NaN | NaN | NaN | NaN | NaN |
40023 | Report Parameters:Office: Head Office month: 2 year: 2021 | NaN | NaN | NaN | NaN | NaN | NaN |
we can drop these 4 last columns by using df.drop
and assigning the index
argument to a list of indeces we want to drop.
Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|---|
40015 | North Rift Region | Iten | 42221.0 | Dormant | Penalties | 44.50 | 2019-08-31 |
40016 | Western Region | Mbale | 54391.0 | Active | P+I In Default | 56.98 | 2021-02-28 |
40017 | Coast Region | Mariakani | 67944.0 | Dormant | P+I In Default | 0.15 | 2020-10-31 |
40018 | North Rift Region | Kitale | 6888.0 | Dormant | Penalties | 32.42 | 2020-11-30 |
40019 | Eastern Region | Matuu | 66991.0 | Dormant | Penalties | 33.38 | 2020-09-30 |
Region | Office | Client ID | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|---|
40015 | North Rift Region | Iten | 42221.0 | Dormant | Penalties | 44.50 | 2019-08-31 |
40016 | Western Region | Mbale | 54391.0 | Active | P+I In Default | 56.98 | 2021-02-28 |
40017 | Coast Region | Mariakani | 67944.0 | Dormant | P+I In Default | 0.15 | 2020-10-31 |
40018 | North Rift Region | Kitale | 6888.0 | Dormant | Penalties | 32.42 | 2020-11-30 |
40019 | Eastern Region | Matuu | 66991.0 | Dormant | Penalties | 33.38 | 2020-09-30 |
For our data frame, Client ID
column uniquely identifies rows, for our analysis, we won’t be doing analysis on this column, we can set Client ID
to be the index of our dataframe.
Region | Office | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|
Client ID | ||||||
85119.0 | South Rift AgriBiz | Ndunyu Njeru | New | P+I Current | NaN | 2021-02-28 |
2005.0 | South Rift Region | Nakuru | Dormant | Penalties | 44.83 | 2018-09-30 |
69865.0 | Eastern Region | Wote | Dormant | Penalties | 50.65 | 2020-11-30 |
55339.0 | Western Region | Bungoma | Dormant | Penalties | 60.71 | 2020-07-31 |
38049.0 | South Rift Region | Kericho | Dormant | Penalties | 0.15 | 2019-06-30 |
Sorting is arranging values/index in data either in ascending or descending order. To sort index we use df.sort_index
, to sort values in columns we use df.sort_values(cols)
where cols
is column name or list of column names we want to sort their values.
Region | Office | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|
Client ID | ||||||
2.0 | Nairobi Region | Ngong Road | Dormant | Penalties | 37.98 | 2020-04-30 |
7.0 | NaN | Head Office | Inactive | Pending Renewal | NaN | 2015-02-28 |
10.0 | South Rift Region | Nakuru | Active | Penalties | 42.19 | 2020-12-31 |
12.0 | South Rift Region | Nakuru | Active | P+I Current | 65.34 | 2021-02-28 |
13.0 | South Rift Region | Nakuru | Dormant | Pending Renewal | 57.59 | 2016-02-29 |
Region | Office | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|
Client ID | ||||||
7.0 | NaN | Head Office | Inactive | Pending Renewal | NaN | 2015-02-28 |
2598.0 | Nairobi Region | Rongai | Inactive | Pending Renewal | NaN | 2015-02-28 |
1201.0 | Nairobi Region | Ngong Road | Inactive | Pending Renewal | NaN | 2015-02-28 |
2859.0 | South Rift Region | Naivasha | Inactive | Pending Renewal | NaN | 2015-06-30 |
1058.0 | Nairobi Region | Kahawa | Dormant | Pending Renewal | NaN | 2015-10-31 |
By defauld the sorting are in ascending order, we can also sort in descending order. To do this, we use ascending=False
argument.
Region | Office | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|
Client ID | ||||||
28468.0 | North Rift Region | Kapenguria | Active | P+I Current | 92.54 | 2021-02-28 |
74947.0 | South Rift AgriBiz | Ndunyu Njeru | Active | P+I Current | 92.17 | 2021-02-28 |
64738.0 | Eastern Region | Mwingi | Active | P+I Current | 92.02 | 2021-02-28 |
31851.0 | Nairobi Region | Rongai | Active | P+I Current | 91.34 | 2021-02-28 |
49324.0 | South Rift AgriBiz | Maili Nne | Active | P+I Current | 91.21 | 2021-02-28 |
... | ... | ... | ... | ... | ... | ... |
86301.0 | Nairobi Region | Ngong Road | New | P+I Current | NaN | 2021-02-28 |
86303.0 | Nairobi Region | Kahawa | New | Pending Renewal | NaN | 2021-02-28 |
86308.0 | Nairobi Region | Kahawa | New | P+I Current | NaN | 2021-02-28 |
86313.0 | North Rift Region | Eldoret2 | New | Pending Renewal | NaN | 2021-02-28 |
86654.0 | Western Region | Bondo | New | Pending Renewal | NaN | 2021-02-28 |
40020 rows × 6 columns
For the dataframe (counties
) we created sort the data frame as follows 1. name
: descending 2. subCounty
: ascending
Duplicates are rows that are identical to each other, ie, they have the same values. to check for duplicates we use df.duplicated
this will return a boolean series we can use the boolean series to filter out the repeting values and we can use df.drop_duplicates
to remove any duplicates in the data frame.
to check for duplicates, we will first reset the index, this will make Client ID
to be a column as it was before.
0 False
1 False
2 False
3 False
4 False
...
40015 False
40016 False
40017 False
40018 False
40019 False
Length: 40020, dtype: bool
Client ID | Region | Office | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|---|
5051 | 11331.0 | Nairobi Region | Ngong Road | Dormant | Penalties | 23.84 | 2018-02-28 |
5052 | 11331.0 | Nairobi Region | Ngong Road | Dormant | Penalties | 23.84 | 2018-02-28 |
5053 | 11331.0 | Nairobi Region | Ngong Road | Dormant | Penalties | 23.84 | 2018-02-28 |
5054 | 11331.0 | Nairobi Region | Ngong Road | Dormant | Penalties | 23.84 | 2018-02-28 |
9624 | 21440.0 | North Rift Region | Eldoret1 | Dormant | Penalties | 16.01 | 2018-08-31 |
9625 | 21440.0 | North Rift Region | Eldoret1 | Dormant | Penalties | 16.01 | 2018-08-31 |
9626 | 21440.0 | North Rift Region | Eldoret1 | Dormant | Penalties | 16.01 | 2018-08-31 |
9627 | 21440.0 | North Rift Region | Eldoret1 | Dormant | Penalties | 16.01 | 2018-08-31 |
11194 | 24807.0 | North Rift Region | Kitale | Dormant | Penalties | 36.94 | 2019-02-28 |
11195 | 24807.0 | North Rift Region | Kitale | Dormant | Penalties | 36.94 | 2019-02-28 |
11196 | 24807.0 | North Rift Region | Kitale | Dormant | Penalties | 36.94 | 2019-02-28 |
11197 | 24807.0 | North Rift Region | Kitale | Dormant | Penalties | 36.94 | 2019-02-28 |
22397 | 49227.0 | Coast Region | Voi | Dormant | Penalties | 0.24 | 2020-02-29 |
22398 | 49227.0 | Coast Region | Voi | Dormant | Penalties | 0.24 | 2020-02-29 |
22399 | 49227.0 | Coast Region | Voi | Dormant | Penalties | 0.24 | 2020-02-29 |
22400 | 49227.0 | Coast Region | Voi | Dormant | Penalties | 0.24 | 2020-02-29 |
35401 | 76644.0 | Western Region | Kisumu | Active | Penalties | 48.92 | 2020-09-30 |
35402 | 76644.0 | Western Region | Kisumu | Active | Penalties | 48.92 | 2020-09-30 |
35403 | 76644.0 | Western Region | Kisumu | Active | Penalties | 48.92 | 2020-09-30 |
35404 | 76644.0 | Western Region | Kisumu | Active | Penalties | 48.92 | 2020-09-30 |
For the dataframe (counties
), are there any duplicates in that dataframe
df.drop_duplicates
removes duplicates keeping first occurence by default.
Majorly involed during data cleaning. Missing values are values that are missing, alaaa!. In this session we won’t go deeper on how to handle missing values. In python, to check for missing values we use df.isnull()
or df.isna
, to remove missing values we use df.dropna
Client ID | Region | Office | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False |
1 | False | True | False | False | False | True | False |
2 | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... |
40015 | False | False | False | False | False | True | False |
40016 | False | False | False | False | False | True | False |
40017 | False | False | False | False | False | True | False |
40018 | False | False | False | False | False | True | False |
40019 | False | False | False | False | False | True | False |
40000 rows × 7 columns
That returns a dataframe with boolean showing if a value is null
Client ID 0
Region 80
Office 0
Client Status 0
Loan Status 0
Client Score 3300
Status Date 0
dtype: int64
df.isnull().sum()
is the most widely used way of cheking the number of missing values per column.
As a data scientist it is good to check the counts of unique values, these is mainly used to check for imbalance at later states. To check for the number of unique values, we use df[col].value_counts()
where col
is the column name.
Nairobi Region 9856
North Rift Region 8506
South Rift Region 6781
Coast Region 4661
South Rift AgriBiz 3848
Western Region 3360
Eastern Region 2252
Central AgriBiz 656
Name: Region, dtype: int64
By default .value_counts
drops values which are missing values, we can include the counts of missing values by setting dropna=False
Nairobi Region 9856
North Rift Region 8506
South Rift Region 6781
Coast Region 4661
South Rift AgriBiz 3848
Western Region 3360
Eastern Region 2252
Central AgriBiz 656
NaN 80
Name: Region, dtype: int64
For the dataframe (counties
) Show the value counts of the name
variable
In pandas - we can use comparison operators to filter data meeting a certain condition, - to filter columns we can use df[[col1....coln]
, - to filter rows based on their index we can use - iloc[i]
or loc[strn]
for integer based or label based indexing respectively.
Client ID | Region | Office | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|---|
1 | 7.0 | NaN | Head Office | Inactive | Pending Renewal | NaN | 2015-02-28 |
1993 | 4700.0 | NaN | Head Office | Inactive | Pending Renewal | 15.15 | 2020-09-30 |
2156 | 5064.0 | NaN | Fraud | Dormant | Penalties | 23.38 | 2020-08-31 |
2279 | 5322.0 | NaN | Fraud | Dormant | Penalties | 23.96 | 2020-08-31 |
2407 | 5621.0 | NaN | Fraud | Dormant | Penalties | 13.77 | 2020-08-31 |
... | ... | ... | ... | ... | ... | ... | ... |
38004 | 82121.0 | NaN | Fraud | Active | P+I In Default | NaN | 2020-12-31 |
38100 | 82284.0 | NaN | Fraud | Active | P+I In Default | NaN | 2020-12-31 |
38227 | 82560.0 | NaN | Fraud | Active | P+I In Default | NaN | 2020-12-31 |
38296 | 82709.0 | NaN | Fraud | Active | P+I In Default | NaN | 2020-12-31 |
38503 | 83150.0 | NaN | Fraud | Active | P+I In Default | NaN | 2020-12-31 |
80 rows × 7 columns
We can use comparison operators for filtering values that meet a certain condition. - >
- Greater than - <
- Less than - >=
- Greater or equal to - <=
- Less than or equal to - ==
equal to - !=
not equal to
0 False
1 False
2 False
3 False
4 False
...
40015 False
40016 False
40017 False
40018 False
40019 False
Name: Client Score, Length: 40000, dtype: bool
df[col]>value
returns a boolean series, we can use the boolean series to filter out values in dataframe not meeting that condition.
Client ID | Region | Office | Client Status | Loan Status | Client Score | Status Date | |
---|---|---|---|---|---|---|---|
2641 | 6150.0 | South Rift Region | Kericho | Active | Pending Renewal | 90.59 | 2021-02-28 |
8535 | 19100.0 | South Rift Region | Litein | Active | P+I Current | 90.85 | 2021-02-28 |
12657 | 27981.0 | South Rift AgriBiz | Ndunyu Njeru | Active | Pending Renewal | 90.58 | 2021-02-28 |
12876 | 28468.0 | North Rift Region | Kapenguria | Active | P+I Current | 92.54 | 2021-02-28 |
14405 | 31851.0 | Nairobi Region | Rongai | Active | P+I Current | 91.34 | 2021-02-28 |
22443 | 49324.0 | South Rift AgriBiz | Maili Nne | Active | P+I Current | 91.21 | 2021-02-28 |
26438 | 57810.0 | South Rift AgriBiz | Maili Nne | Active | P+I In Default | 90.92 | 2021-02-28 |
29804 | 64738.0 | Eastern Region | Mwingi | Active | P+I Current | 92.02 | 2021-02-28 |
30564 | 66344.0 | Central AgriBiz | Githunguri | Active | P+I Current | 90.33 | 2021-02-28 |
32576 | 70685.0 | South Rift AgriBiz | Londiani | Active | P+I Current | 90.77 | 2021-02-28 |
34593 | 74947.0 | South Rift AgriBiz | Ndunyu Njeru | Active | P+I Current | 92.17 | 2021-02-28 |
35239 | 76301.0 | Nairobi Region | Kiambu | Active | P+I Current | 90.04 | 2021-02-28 |
We can use logical conditions to support two or more expressions. In pandas we we can use the following operators:
&
: for logical and|
: for logical or~
: for logical notFor the dataframe (counties
) Filter the data to select dataset where name
is equal to 'Nakuru'
& subCounty
is not equal to 'Naivasha'
Grouping data is a critical step in data wrangling. in pandas we can group DataFrame using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2757be5a00>
The groupby returns an object, we can use the object to perform actions in the groups.
Client ID | Client Score | |
---|---|---|
Region | ||
Central AgriBiz | 36163955.0 | 20355.84 |
Coast Region | 223714866.0 | 157343.10 |
Eastern Region | 151738957.0 | 61644.78 |
Nairobi Region | 306914995.0 | 309115.63 |
North Rift Region | 326377934.0 | 270228.72 |
South Rift AgriBiz | 201143618.0 | 168818.01 |
South Rift Region | 270779301.0 | 211088.41 |
Western Region | 229155555.0 | 107269.71 |
Client Score | |
---|---|
Region | |
Central AgriBiz | 20355.84 |
Coast Region | 157343.10 |
Eastern Region | 61644.78 |
Nairobi Region | 309115.63 |
North Rift Region | 270228.72 |
South Rift AgriBiz | 168818.01 |
South Rift Region | 211088.41 |
Western Region | 107269.71 |
Client ID | Client Score | |
---|---|---|
Region | ||
Central AgriBiz | 55127.980183 | 49.527591 |
Coast Region | 47997.182150 | 35.792334 |
Eastern Region | 67379.643428 | 31.244187 |
Nairobi Region | 31139.914265 | 32.839226 |
North Rift Region | 38370.319069 | 34.059582 |
South Rift AgriBiz | 52272.250000 | 53.729475 |
South Rift Region | 39932.060316 | 33.389499 |
Western Region | 68201.058036 | 35.205025 |
To check for the size of each group, we can use size()
method of a groupby object.
Region
Central AgriBiz 656
Coast Region 4661
Eastern Region 2252
Nairobi Region 9856
North Rift Region 8506
South Rift AgriBiz 3848
South Rift Region 6781
Western Region 3360
dtype: int64
Region
Central AgriBiz 2
Coast Region 8
Eastern Region 7
Nairobi Region 8
North Rift Region 9
South Rift AgriBiz 6
South Rift Region 8
Western Region 7
Name: Office, dtype: int64
Pivot table (derived from excel) is an advanced pandas grouping method. Here you can decide on the index, columns, values and the aggregate functions. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
the aggfunc by default is mean. You can pass string aggregates eg 'mean'
, 'sum'
ect or aggregate functions such as np.mean
, np.sum
etc
Region | Central AgriBiz | Coast Region | Eastern Region | Nairobi Region | North Rift Region | South Rift AgriBiz | South Rift Region | Western Region |
---|---|---|---|---|---|---|---|---|
Office | ||||||||
Bomet | NaN | NaN | NaN | NaN | NaN | NaN | 33.836232 | NaN |
Bondo | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 54.472500 |
Bungoma | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 32.724249 |
Busia | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 36.784428 |
Changamwe | NaN | 36.532172 | NaN | NaN | NaN | NaN | NaN | NaN |
setting margins=True
will result to row aggregate and column column aggregate.
Region | Central AgriBiz | Coast Region | Eastern Region | Nairobi Region | North Rift Region | South Rift AgriBiz | South Rift Region | Western Region | All |
---|---|---|---|---|---|---|---|---|---|
Status Date | |||||||||
2016-01-31 | NaN | NaN | NaN | 44.733333 | NaN | NaN | NaN | NaN | 44.733333 |
2016-02-29 | NaN | NaN | NaN | 35.430541 | NaN | NaN | 37.207143 | NaN | 35.713182 |
2016-03-31 | NaN | NaN | NaN | 34.614533 | NaN | NaN | 33.599565 | NaN | 34.376327 |
2016-04-30 | NaN | NaN | NaN | 30.816381 | NaN | NaN | 33.833704 | NaN | 31.433561 |
2016-05-31 | NaN | NaN | NaN | 37.960885 | NaN | NaN | 41.225000 | NaN | 38.681241 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-11-30 | 47.642000 | 34.752153 | 31.791702 | 34.263287 | 38.249737 | 50.445581 | 33.603857 | 32.730902 | 34.959187 |
2020-12-31 | 43.470714 | 37.190167 | 30.627154 | 35.245165 | 39.696990 | 49.850078 | 34.649405 | 38.350320 | 37.861864 |
2021-01-31 | 50.570377 | 36.696154 | 34.157852 | 38.617857 | 41.534072 | 54.294533 | 36.930952 | 35.335667 | 41.915679 |
2021-02-28 | 55.504430 | 48.855651 | 41.270897 | 47.413198 | 49.778757 | 56.811797 | 47.444950 | 45.280237 | 49.599350 |
All | 49.527591 | 35.792334 | 31.244187 | 32.839226 | 34.059582 | 53.729475 | 33.389499 | 35.205025 | 35.642344 |
63 rows × 9 columns
the are varies ways of combining data sets - append - add rows to the end of the caller. - concatenate - merge - inner join - left join - right join
Type of merge to be performed. - left: use only keys from left frame, similar to a SQL left outer join; preserve key order. - right: use only keys from right frame, similar to a SQL right outer join; preserve key order. - outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically. - inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys. - cross: creates the cartesian product from both frames, preserves the order of the left keys.
We use df.merge
/pd.merge
to import columns from other datasets.
name | subCounty | |
---|---|---|
0 | Mombasa | Changamwe |
1 | Mombasa | Jomvu |
2 | Mombasa | Kisauni |
3 | Mombasa | Likoni |
4 | Mombasa | Mvita |
We can merge status
and counties
using Office
and subCounty
(this is because that is the key present in both columns. Since we have diffent column names in both data frames, we have to use left_on
and right_on
to specify the columns.
if the columns are the same on both columns, we can use on
argument.
Client ID | Region | Office | Client Status | Loan Status | Client Score | Status Date | name | subCounty | |
---|---|---|---|---|---|---|---|---|---|
0 | 33.0 | South Rift Region | Naivasha | Dormant | Penalties | 40.00 | 2020-06-30 | Nakuru | Naivasha |
1 | 34.0 | South Rift Region | Naivasha | Dormant | Pending Renewal | 40.23 | 2016-03-31 | Nakuru | Naivasha |
2 | 206.0 | South Rift Region | Naivasha | Active | P+I Current | 76.87 | 2021-02-28 | Nakuru | Naivasha |
3 | 615.0 | South Rift Region | Naivasha | Active | P+I Current | 58.43 | 2021-02-28 | Nakuru | Naivasha |
4 | 702.0 | South Rift Region | Naivasha | Dormant | Pending Renewal | 43.73 | 2016-08-31 | Nakuru | Naivasha |
Another method is using the pd.merge
Client ID | Region | Office | Client Status | Loan Status | Client Score | Status Date | name | subCounty | |
---|---|---|---|---|---|---|---|---|---|
0 | 33.0 | South Rift Region | Naivasha | Dormant | Penalties | 40.00 | 2020-06-30 | Nakuru | Naivasha |
1 | 34.0 | South Rift Region | Naivasha | Dormant | Pending Renewal | 40.23 | 2016-03-31 | Nakuru | Naivasha |
2 | 206.0 | South Rift Region | Naivasha | Active | P+I Current | 76.87 | 2021-02-28 | Nakuru | Naivasha |
3 | 615.0 | South Rift Region | Naivasha | Active | P+I Current | 58.43 | 2021-02-28 | Nakuru | Naivasha |
4 | 702.0 | South Rift Region | Naivasha | Dormant | Pending Renewal | 43.73 | 2016-08-31 | Nakuru | Naivasha |
For the dataframe (counties
) do a full outer join with status
dataframe, how many rows does it have?
After finishing data wrangling, it is good to export the data to the right file format for further uses. Let us do some final tweaks and export the final dataset as a csv.
# We will use the newly combined dataset
# 1. Rename columns and remove unwanted columns
combined_2.rename(columns={'name':'county'}, inplace=True)
combined_2.drop(columns=['subCounty'], inplace=True)
combined_2.dropna(inplace=True)
combined_3 = pd.pivot_table(combined_2, index='county', columns='Loan Status', values='Client Score', aggfunc=np.median)
display(combined_3.head())
combined_3.to_csv('combined.csv', index=False)
Loan Status | P+I Current | P+I In Default | Penalties | Pending Renewal |
---|---|---|---|---|
county | ||||
Busia | 64.19 | 47.110 | 32.130 | 41.91 |
Kericho | 64.92 | 61.170 | 45.910 | 62.42 |
Kiambu | 60.76 | 48.570 | 32.665 | 50.66 |
Kilifi | 67.01 | 54.220 | 36.005 | 52.41 |
Laikipia | NaN | 44.435 | 26.275 | 52.40 |
Thank you for attending todays session.