# Introduction and Exproling Raw data

An overview of the process of data cleaning with R and the basics of exploring raw data.

## The Essential Parts of a Data Cleaning Process

1. Exploring raw data
2. Tidying data
3. Preparing data for analysis

No one likes missing data, but it is dangerous to assume that it can simply be removed or replaced. Sometimes missing data tells us something important about whatever it is that we're measuring (i.e. the value of the variable that is missing may be related to - the reason it is missing). Such data is called Missing not at Random, or MNAR.

We are presented with a messy, real-world dataset containing an entire year's worth of weather data from Boston, USA. Among other things, we'll be presented with variables that contain column names, column names that should be values, numbers coded as character strings, and values that are missing, extreme, and downright erroneous!

## Exploring raw data

### Getting a feel for your data

library(readr)
library(dplyr)
library(lubridate)
library(stringr)
library(installr)

Warning message:
"package 'stringr' was built under R version 3.6.3"Warning message:
"package 'installr' was built under R version 3.6.3"
Welcome to installr version 0.22.0

https://github.com/talgalili/installr/

Contact: <tal.galili@gmail.com>
Suggestions and bug-reports can be submitted at: https://github.com/talgalili/installr/issues

To suppress this message use:
suppressPackageStartupMessages(library(installr))


weather = readRDS(gzcon(url('https://assets.datacamp.com/production/repositories/34/datasets/b3c1036d9a60a9dfe0f99051d2474a54f76055ea/weather.rds')))

class(weather)

'data.frame'
head(weather)

XyearmonthmeasureX1X2X3X4X5X6...X22X23X24X25X26X27X28X29X30X31
1 2014 12 Max.TemperatureF 64 42 51 43 42 45 ... 44 47 46 59 50 52 52 41 30 30
2 2014 12 Mean.TemperatureF52 38 44 37 34 42 ... 39 45 44 52 44 45 46 36 26 25
3 2014 12 Min.TemperatureF 39 33 37 30 26 38 ... 33 42 41 44 37 38 40 30 22 20
4 2014 12 Max.Dew.PointF 46 40 49 24 37 45 ... 39 45 46 58 31 34 42 26 10 8
5 2014 12 MeanDew.PointF 40 27 42 21 25 40 ... 34 42 44 43 29 31 35 20 4 5
6 2014 12 Min.DewpointF 26 17 24 13 12 36 ... 25 37 41 29 28 29 27 10 -6 1
tail(weather)

XyearmonthmeasureX1X2X3X4X5X6...X22X23X24X25X26X27X28X29X30X31
281281 2015 12 Mean.Wind.SpeedMPH6 NA NA NA NA NA ... NA NA NA NA NA NA NA NA NA NA
282282 2015 12 Max.Gust.SpeedMPH 17 NA NA NA NA NA ... NA NA NA NA NA NA NA NA NA NA
283283 2015 12 PrecipitationIn 0.14 NA NA NA NA NA ... NA NA NA NA NA NA NA NA NA NA
284284 2015 12 CloudCover 7 NA NA NA NA NA ... NA NA NA NA NA NA NA NA NA NA
285285 2015 12 Events Rain NA NA NA NA NA ... NA NA NA NA NA NA NA NA NA NA
286286 2015 12 WindDirDegrees 109 NA NA NA NA NA ... NA NA NA NA NA NA NA NA NA NA
str(weather)

'data.frame':	286 obs. of  35 variables:
$X : int 1 2 3 4 5 6 7 8 9 10 ...$ year   : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
$month : int 12 12 12 12 12 12 12 12 12 12 ...$ measure: chr  "Max.TemperatureF" "Mean.TemperatureF" "Min.TemperatureF" "Max.Dew.PointF" ...
$X1 : chr "64" "52" "39" "46" ...$ X2     : chr  "42" "38" "33" "40" ...
$X3 : chr "51" "44" "37" "49" ...$ X4     : chr  "43" "37" "30" "24" ...
$X5 : chr "42" "34" "26" "37" ...$ X6     : chr  "45" "42" "38" "45" ...
$X7 : chr "38" "30" "21" "36" ...$ X8     : chr  "29" "24" "18" "28" ...
$X9 : chr "49" "39" "29" "49" ...$ X10    : chr  "48" "43" "38" "45" ...
$X11 : chr "39" "36" "32" "37" ...$ X12    : chr  "39" "35" "31" "28" ...
$X13 : chr "42" "37" "32" "28" ...$ X14    : chr  "45" "39" "33" "29" ...
$X15 : chr "42" "37" "32" "33" ...$ X16    : chr  "44" "40" "35" "42" ...
$X17 : chr "49" "45" "41" "46" ...$ X18    : chr  "44" "40" "36" "34" ...
$X19 : chr "37" "33" "29" "25" ...$ X20    : chr  "36" "32" "27" "30" ...
$X21 : chr "36" "33" "30" "30" ...$ X22    : chr  "44" "39" "33" "39" ...
$X23 : chr "47" "45" "42" "45" ...$ X24    : chr  "46" "44" "41" "46" ...
$X25 : chr "59" "52" "44" "58" ...$ X26    : chr  "50" "44" "37" "31" ...
$X27 : chr "52" "45" "38" "34" ...$ X28    : chr  "52" "46" "40" "42" ...
$X29 : chr "41" "36" "30" "26" ...$ X30    : chr  "30" "26" "22" "10" ...
$X31 : chr "30" "25" "20" "8" ...  dim(weather)  <ol class=list-inline> • 286 • 35 • </ol> bmi=read_csv('https://assets.datacamp.com/production/repositories/34/datasets/a0a569ebbb34500d11979eba95360125127e6434/bmi_clean.csv')  Parsed with column specification: cols( .default = col_double(), Country = col_character() ) See spec(...) for full column specifications.  class(bmi)  <ol class=list-inline> • 'spec_tbl_df' • 'tbl_df' • 'tbl' • 'data.frame' • </ol> dim(bmi)  <ol class=list-inline> • 199 • 30 • </ol> head(bmi)  CountryY1980Y1981Y1982Y1983Y1984Y1985Y1986Y1987Y1988...Y1999Y2000Y2001Y2002Y2003Y2004Y2005Y2006Y2007Y2008 Afghanistan 21.48678 21.46552 21.45145 21.43822 21.42734 21.41222 21.40132 21.37679 21.34018 ... 20.75469 20.69521 20.62643 20.59848 20.58706 20.57759 20.58084 20.58749 20.60246 20.62058 Albania 25.22533 25.23981 25.25636 25.27176 25.27901 25.28669 25.29451 25.30217 25.30450 ... 25.46555 25.55835 25.66701 25.77167 25.87274 25.98136 26.08939 26.20867 26.32753 26.44657 Algeria 22.25703 22.34745 22.43647 22.52105 22.60633 22.69501 22.76979 22.84096 22.90644 ... 23.69486 23.77659 23.86256 23.95294 24.05243 24.15957 24.27001 24.38270 24.48846 24.59620 Andorra 25.66652 25.70868 25.74681 25.78250 25.81874 25.85236 25.89089 25.93414 25.98477 ... 26.75078 26.83179 26.92373 27.02525 27.12481 27.23107 27.32827 27.43588 27.53363 27.63048 Angola 20.94876 20.94371 20.93754 20.93187 20.93569 20.94857 20.96030 20.98025 21.01375 ... 21.31954 21.37480 21.43664 21.51765 21.59924 21.69218 21.80564 21.93881 22.08962 22.25083 Antigua and Barbuda23.31424 23.39054 23.45883 23.53735 23.63584 23.73109 23.83449 23.93649 24.05364 ... 24.91721 24.99158 25.05857 25.13039 25.20713 25.29898 25.39965 25.51382 25.64247 25.76602 tail(bmi)  CountryY1980Y1981Y1982Y1983Y1984Y1985Y1986Y1987Y1988...Y1999Y2000Y2001Y2002Y2003Y2004Y2005Y2006Y2007Y2008 Venezuela 24.58052 24.69666 24.80082 24.89208 24.98440 25.07104 25.15587 25.24624 25.35274 ... 26.50035 26.61021 26.71688 26.79210 26.85498 26.95162 27.05633 27.17698 27.30849 27.44500 Vietnam 19.01394 19.03902 19.06804 19.09675 19.13046 19.16397 19.19740 19.23481 19.27090 ... 20.02081 20.10343 20.18623 20.27145 20.36402 20.46585 20.57277 20.68655 20.80189 20.91630 West Bank and Gaza24.31624 24.40192 24.48713 24.57107 24.65582 24.74148 24.82984 24.91615 25.00108 ... 26.28240 26.39074 26.45700 26.48925 26.51152 26.52924 26.54329 26.54449 26.55460 26.57750 Yemen, Rep. 22.90384 22.96813 23.02669 23.07279 23.12566 23.16944 23.20933 23.25043 23.29401 ... 23.85482 23.92467 23.99129 24.05692 24.12459 24.19204 24.25638 24.32120 24.37949 24.44157 Zambia 19.66295 19.69512 19.72538 19.75420 19.78070 19.80335 19.82396 19.85065 19.88320 ... 20.15094 20.17261 20.20266 20.24298 20.29474 20.35966 20.43398 20.51422 20.59770 20.68321 Zimbabwe 21.46989 21.48867 21.50738 21.52936 21.53383 21.54341 21.54859 21.54590 21.55396 ... 21.68873 21.72652 21.76514 21.79645 21.82499 21.85806 21.89495 21.93371 21.97405 22.02660 str(bmi)  tibble [199 x 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)$ Country: chr [1:199] "Afghanistan" "Albania" "Algeria" "Andorra" ...
$Y1980 : num [1:199] 21.5 25.2 22.3 25.7 20.9 ...$ Y1981  : num [1:199] 21.5 25.2 22.3 25.7 20.9 ...
$Y1982 : num [1:199] 21.5 25.3 22.4 25.7 20.9 ...$ Y1983  : num [1:199] 21.4 25.3 22.5 25.8 20.9 ...
$Y1984 : num [1:199] 21.4 25.3 22.6 25.8 20.9 ...$ Y1985  : num [1:199] 21.4 25.3 22.7 25.9 20.9 ...
$Y1986 : num [1:199] 21.4 25.3 22.8 25.9 21 ...$ Y1987  : num [1:199] 21.4 25.3 22.8 25.9 21 ...
$Y1988 : num [1:199] 21.3 25.3 22.9 26 21 ...$ Y1989  : num [1:199] 21.3 25.3 23 26 21.1 ...
$Y1990 : num [1:199] 21.2 25.3 23 26.1 21.1 ...$ Y1991  : num [1:199] 21.2 25.3 23.1 26.2 21.1 ...
$Y1992 : num [1:199] 21.1 25.2 23.2 26.2 21.1 ...$ Y1993  : num [1:199] 21.1 25.2 23.3 26.3 21.1 ...
$Y1994 : num [1:199] 21 25.2 23.3 26.4 21.1 ...$ Y1995  : num [1:199] 20.9 25.3 23.4 26.4 21.2 ...
$Y1996 : num [1:199] 20.9 25.3 23.5 26.5 21.2 ...$ Y1997  : num [1:199] 20.8 25.3 23.5 26.6 21.2 ...
$Y1998 : num [1:199] 20.8 25.4 23.6 26.7 21.3 ...$ Y1999  : num [1:199] 20.8 25.5 23.7 26.8 21.3 ...
$Y2000 : num [1:199] 20.7 25.6 23.8 26.8 21.4 ...$ Y2001  : num [1:199] 20.6 25.7 23.9 26.9 21.4 ...
$Y2002 : num [1:199] 20.6 25.8 24 27 21.5 ...$ Y2003  : num [1:199] 20.6 25.9 24.1 27.1 21.6 ...
$Y2004 : num [1:199] 20.6 26 24.2 27.2 21.7 ...$ Y2005  : num [1:199] 20.6 26.1 24.3 27.3 21.8 ...
$Y2006 : num [1:199] 20.6 26.2 24.4 27.4 21.9 ...$ Y2007  : num [1:199] 20.6 26.3 24.5 27.5 22.1 ...
$Y2008 : num [1:199] 20.6 26.4 24.6 27.6 22.3 ... - attr(*, "spec")= .. cols( .. Country = col_character(), .. Y1980 = col_double(), .. Y1981 = col_double(), .. Y1982 = col_double(), .. Y1983 = col_double(), .. Y1984 = col_double(), .. Y1985 = col_double(), .. Y1986 = col_double(), .. Y1987 = col_double(), .. Y1988 = col_double(), .. Y1989 = col_double(), .. Y1990 = col_double(), .. Y1991 = col_double(), .. Y1992 = col_double(), .. Y1993 = col_double(), .. Y1994 = col_double(), .. Y1995 = col_double(), .. Y1996 = col_double(), .. Y1997 = col_double(), .. Y1998 = col_double(), .. Y1999 = col_double(), .. Y2000 = col_double(), .. Y2001 = col_double(), .. Y2002 = col_double(), .. Y2003 = col_double(), .. Y2004 = col_double(), .. Y2005 = col_double(), .. Y2006 = col_double(), .. Y2007 = col_double(), .. Y2008 = col_double() .. )  glimpse(bmi)  Rows: 199 Columns: 30$ Country <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "...
$Y1980 <dbl> 21.48678, 25.22533, 22.25703, 25.66652, 20.94876, 23.31424,...$ Y1981   <dbl> 21.46552, 25.23981, 22.34745, 25.70868, 20.94371, 23.39054,...
$Y1982 <dbl> 21.45145, 25.25636, 22.43647, 25.74681, 20.93754, 23.45883,...$ Y1983   <dbl> 21.43822, 25.27176, 22.52105, 25.78250, 20.93187, 23.53735,...
$Y1984 <dbl> 21.42734, 25.27901, 22.60633, 25.81874, 20.93569, 23.63584,...$ Y1985   <dbl> 21.41222, 25.28669, 22.69501, 25.85236, 20.94857, 23.73109,...
$Y1986 <dbl> 21.40132, 25.29451, 22.76979, 25.89089, 20.96030, 23.83449,...$ Y1987   <dbl> 21.37679, 25.30217, 22.84096, 25.93414, 20.98025, 23.93649,...
$Y1988 <dbl> 21.34018, 25.30450, 22.90644, 25.98477, 21.01375, 24.05364,...$ Y1989   <dbl> 21.29845, 25.31944, 22.97931, 26.04450, 21.05269, 24.16347,...
$Y1990 <dbl> 21.24818, 25.32357, 23.04600, 26.10936, 21.09007, 24.26782,...$ Y1991   <dbl> 21.20269, 25.28452, 23.11333, 26.17912, 21.12136, 24.36568,...
$Y1992 <dbl> 21.14238, 25.23077, 23.18776, 26.24017, 21.14987, 24.45644,...$ Y1993   <dbl> 21.06376, 25.21192, 23.25764, 26.30356, 21.13938, 24.54096,...
$Y1994 <dbl> 20.97987, 25.22115, 23.32273, 26.36793, 21.14186, 24.60945,...$ Y1995   <dbl> 20.91132, 25.25874, 23.39526, 26.43569, 21.16022, 24.66461,...
$Y1996 <dbl> 20.85155, 25.31097, 23.46811, 26.50769, 21.19076, 24.72544,...$ Y1997   <dbl> 20.81307, 25.33988, 23.54160, 26.58255, 21.22621, 24.78714,...
$Y1998 <dbl> 20.78591, 25.39116, 23.61592, 26.66337, 21.27082, 24.84936,...$ Y1999   <dbl> 20.75469, 25.46555, 23.69486, 26.75078, 21.31954, 24.91721,...
$Y2000 <dbl> 20.69521, 25.55835, 23.77659, 26.83179, 21.37480, 24.99158,...$ Y2001   <dbl> 20.62643, 25.66701, 23.86256, 26.92373, 21.43664, 25.05857,...
$Y2002 <dbl> 20.59848, 25.77167, 23.95294, 27.02525, 21.51765, 25.13039,...$ Y2003   <dbl> 20.58706, 25.87274, 24.05243, 27.12481, 21.59924, 25.20713,...
$Y2004 <dbl> 20.57759, 25.98136, 24.15957, 27.23107, 21.69218, 25.29898,...$ Y2005   <dbl> 20.58084, 26.08939, 24.27001, 27.32827, 21.80564, 25.39965,...
$Y2006 <dbl> 20.58749, 26.20867, 24.38270, 27.43588, 21.93881, 25.51382,...$ Y2007   <dbl> 20.60246, 26.32753, 24.48846, 27.53363, 22.08962, 25.64247,...
$Y2008 <dbl> 20.62058, 26.44657, 24.59620, 27.63048, 22.25083, 25.76602,...  # View the first 6 rows head(bmi, n=6) # View the first 15 rows head(bmi, n=15) # View the last 6 rows tail(bmi, n=6) # View the last 10 rows tail(bmi, n=10)  CountryY1980Y1981Y1982Y1983Y1984Y1985Y1986Y1987Y1988...Y1999Y2000Y2001Y2002Y2003Y2004Y2005Y2006Y2007Y2008 Afghanistan 21.48678 21.46552 21.45145 21.43822 21.42734 21.41222 21.40132 21.37679 21.34018 ... 20.75469 20.69521 20.62643 20.59848 20.58706 20.57759 20.58084 20.58749 20.60246 20.62058 Albania 25.22533 25.23981 25.25636 25.27176 25.27901 25.28669 25.29451 25.30217 25.30450 ... 25.46555 25.55835 25.66701 25.77167 25.87274 25.98136 26.08939 26.20867 26.32753 26.44657 Algeria 22.25703 22.34745 22.43647 22.52105 22.60633 22.69501 22.76979 22.84096 22.90644 ... 23.69486 23.77659 23.86256 23.95294 24.05243 24.15957 24.27001 24.38270 24.48846 24.59620 Andorra 25.66652 25.70868 25.74681 25.78250 25.81874 25.85236 25.89089 25.93414 25.98477 ... 26.75078 26.83179 26.92373 27.02525 27.12481 27.23107 27.32827 27.43588 27.53363 27.63048 Angola 20.94876 20.94371 20.93754 20.93187 20.93569 20.94857 20.96030 20.98025 21.01375 ... 21.31954 21.37480 21.43664 21.51765 21.59924 21.69218 21.80564 21.93881 22.08962 22.25083 Antigua and Barbuda23.31424 23.39054 23.45883 23.53735 23.63584 23.73109 23.83449 23.93649 24.05364 ... 24.91721 24.99158 25.05857 25.13039 25.20713 25.29898 25.39965 25.51382 25.64247 25.76602 CountryY1980Y1981Y1982Y1983Y1984Y1985Y1986Y1987Y1988...Y1999Y2000Y2001Y2002Y2003Y2004Y2005Y2006Y2007Y2008 Afghanistan 21.48678 21.46552 21.45145 21.43822 21.42734 21.41222 21.40132 21.37679 21.34018 ... 20.75469 20.69521 20.62643 20.59848 20.58706 20.57759 20.58084 20.58749 20.60246 20.62058 Albania 25.22533 25.23981 25.25636 25.27176 25.27901 25.28669 25.29451 25.30217 25.30450 ... 25.46555 25.55835 25.66701 25.77167 25.87274 25.98136 26.08939 26.20867 26.32753 26.44657 Algeria 22.25703 22.34745 22.43647 22.52105 22.60633 22.69501 22.76979 22.84096 22.90644 ... 23.69486 23.77659 23.86256 23.95294 24.05243 24.15957 24.27001 24.38270 24.48846 24.59620 Andorra 25.66652 25.70868 25.74681 25.78250 25.81874 25.85236 25.89089 25.93414 25.98477 ... 26.75078 26.83179 26.92373 27.02525 27.12481 27.23107 27.32827 27.43588 27.53363 27.63048 Angola 20.94876 20.94371 20.93754 20.93187 20.93569 20.94857 20.96030 20.98025 21.01375 ... 21.31954 21.37480 21.43664 21.51765 21.59924 21.69218 21.80564 21.93881 22.08962 22.25083 Antigua and Barbuda23.31424 23.39054 23.45883 23.53735 23.63584 23.73109 23.83449 23.93649 24.05364 ... 24.91721 24.99158 25.05857 25.13039 25.20713 25.29898 25.39965 25.51382 25.64247 25.76602 Argentina 25.37913 25.44951 25.50242 25.55644 25.61271 25.66593 25.72364 25.78529 25.84428 ... 26.79005 26.88103 26.96067 26.99882 27.04738 27.11001 27.18941 27.28179 27.38889 27.50170 Armenia 23.82469 23.86401 23.91023 23.95649 24.00181 24.04083 24.08736 24.13334 24.17219 ... 24.11699 24.18045 24.26670 24.37698 24.50332 24.64178 24.81447 24.99160 25.17590 25.35542 Australia 24.92729 25.00216 25.07660 25.14938 25.22894 25.31849 25.41017 25.50528 25.60001 ... 26.65506 26.74486 26.84397 26.93858 27.03801 27.13871 27.24614 27.35267 27.45878 27.56373 Austria 24.84097 24.88110 24.93482 24.98118 25.02208 25.06015 25.10680 25.14747 25.19333 ... 25.81773 25.87471 25.93806 25.99583 26.06356 26.14360 26.21107 26.29374 26.38136 26.46741 Azerbaijan 24.49375 24.52584 24.56064 24.60150 24.64121 24.67566 24.71906 24.75799 24.78894 ... 24.47842 24.51287 24.57202 24.66021 24.77164 24.89376 25.06256 25.25706 25.45513 25.65117 Bahamas 24.21064 24.30814 24.42750 24.54415 24.66558 24.78408 24.90724 25.03166 25.14778 ... 26.12080 26.25748 26.38653 26.51184 26.62607 26.75612 26.88517 27.00715 27.12653 27.24594 Bahrain 23.97588 24.09045 24.20617 24.32335 24.43174 24.53684 24.63328 24.74914 24.86604 ... 26.50245 26.65409 26.80388 26.94923 27.09298 27.23908 27.38693 27.53868 27.68865 27.83721 Bangladesh 20.51918 20.47766 20.43741 20.40075 20.36524 20.32983 20.29654 20.26401 20.23497 ... 20.13361 20.14774 20.16802 20.18621 20.20948 20.23957 20.27648 20.31554 20.35493 20.39742 Barbados 24.36372 24.43455 24.49314 24.54713 24.59913 24.64998 24.71728 24.77976 24.84265 ... 25.51681 25.60292 25.68910 25.77615 25.87020 25.95660 26.06074 26.16874 26.27575 26.38439 CountryY1980Y1981Y1982Y1983Y1984Y1985Y1986Y1987Y1988...Y1999Y2000Y2001Y2002Y2003Y2004Y2005Y2006Y2007Y2008 Venezuela 24.58052 24.69666 24.80082 24.89208 24.98440 25.07104 25.15587 25.24624 25.35274 ... 26.50035 26.61021 26.71688 26.79210 26.85498 26.95162 27.05633 27.17698 27.30849 27.44500 Vietnam 19.01394 19.03902 19.06804 19.09675 19.13046 19.16397 19.19740 19.23481 19.27090 ... 20.02081 20.10343 20.18623 20.27145 20.36402 20.46585 20.57277 20.68655 20.80189 20.91630 West Bank and Gaza24.31624 24.40192 24.48713 24.57107 24.65582 24.74148 24.82984 24.91615 25.00108 ... 26.28240 26.39074 26.45700 26.48925 26.51152 26.52924 26.54329 26.54449 26.55460 26.57750 Yemen, Rep. 22.90384 22.96813 23.02669 23.07279 23.12566 23.16944 23.20933 23.25043 23.29401 ... 23.85482 23.92467 23.99129 24.05692 24.12459 24.19204 24.25638 24.32120 24.37949 24.44157 Zambia 19.66295 19.69512 19.72538 19.75420 19.78070 19.80335 19.82396 19.85065 19.88320 ... 20.15094 20.17261 20.20266 20.24298 20.29474 20.35966 20.43398 20.51422 20.59770 20.68321 Zimbabwe 21.46989 21.48867 21.50738 21.52936 21.53383 21.54341 21.54859 21.54590 21.55396 ... 21.68873 21.72652 21.76514 21.79645 21.82499 21.85806 21.89495 21.93371 21.97405 22.02660 CountryY1980Y1981Y1982Y1983Y1984Y1985Y1986Y1987Y1988...Y1999Y2000Y2001Y2002Y2003Y2004Y2005Y2006Y2007Y2008 United States 25.46406 25.57524 25.67883 25.78812 25.90690 26.02568 26.13740 26.25939 26.37687 ... 27.60386 27.71039 27.80569 27.90479 28.00041 28.10039 28.19703 28.28959 28.37574 28.45698 Uruguay 24.24001 24.31948 24.39260 24.44209 24.49525 24.54516 24.59804 24.67024 24.73972 ... 25.78625 25.86898 25.93469 25.96627 26.00585 26.06073 26.13136 26.20624 26.29256 26.39123 Uzbekistan 24.56500 24.60077 24.62187 24.64780 24.66890 24.69832 24.72305 24.74603 24.77115 ... 24.72082 24.75326 24.79418 24.83998 24.88965 24.95455 25.03331 25.12717 25.22226 25.32054 Vanuatu 23.20701 23.32990 23.46016 23.60431 23.75134 23.89466 24.03171 24.15571 24.27529 ... 25.72398 25.85208 25.96032 26.05661 26.16060 26.27087 26.38887 26.51376 26.64903 26.78926 Venezuela 24.58052 24.69666 24.80082 24.89208 24.98440 25.07104 25.15587 25.24624 25.35274 ... 26.50035 26.61021 26.71688 26.79210 26.85498 26.95162 27.05633 27.17698 27.30849 27.44500 Vietnam 19.01394 19.03902 19.06804 19.09675 19.13046 19.16397 19.19740 19.23481 19.27090 ... 20.02081 20.10343 20.18623 20.27145 20.36402 20.46585 20.57277 20.68655 20.80189 20.91630 West Bank and Gaza24.31624 24.40192 24.48713 24.57107 24.65582 24.74148 24.82984 24.91615 25.00108 ... 26.28240 26.39074 26.45700 26.48925 26.51152 26.52924 26.54329 26.54449 26.55460 26.57750 Yemen, Rep. 22.90384 22.96813 23.02669 23.07279 23.12566 23.16944 23.20933 23.25043 23.29401 ... 23.85482 23.92467 23.99129 24.05692 24.12459 24.19204 24.25638 24.32120 24.37949 24.44157 Zambia 19.66295 19.69512 19.72538 19.75420 19.78070 19.80335 19.82396 19.85065 19.88320 ... 20.15094 20.17261 20.20266 20.24298 20.29474 20.35966 20.43398 20.51422 20.59770 20.68321 Zimbabwe 21.46989 21.48867 21.50738 21.52936 21.53383 21.54341 21.54859 21.54590 21.55396 ... 21.68873 21.72652 21.76514 21.79645 21.82499 21.85806 21.89495 21.93371 21.97405 22.02660 ### Visualizing data # Histogram of BMIs from 2008 hist(bmi$Y2008)

# Scatter plot comparing BMIs from 1980 to those from 2008
plot(bmi$Y1980, bmi$Y2008)


### Spreading key-value pairs into columns

The opposite of gather() is spread(), which takes key-values pairs and spreads them across multiple columns. This is useful when values in a column should actually be column names (i.e. variables). It can also make data more compact and easier to read.

The easiest way to visualize the effect of spread() is that it makes long datasets wide. As you saw in the video, running the following command will make long_df wide:

spread(long_df, my_key, my_val)


### Separating columns

The separate() function allows you to separate one column into multiple columns. Unless you tell it otherwise, it will attempt to separate on any character that is not a letter or number. You can also specify a specific separator using the sep argument.

treatments dataset obeys the principles of tidy data, but we'd like to split the treatment dates into two separate columns: year and month. This can be accomplished with the following:

separate(treatments, year_mo, c("year", "month"))


### Uniting columns

The opposite of separate() is unite(), which takes multiple columns and pastes them together. By default, the contents of the columns will be separated by underscores in the new column, but this behavior can be altered via the sep argument.

treatments but this time the year_mo column has been separated into year and month. The original column can be recreated by putting year and month back together:

unite(treatments, year_mo, year, month)


### Column headers are values, not variable names

You saw earlier, how we sometimes come across datasets where column names are actually values of a variable (e.g. months of the year). This is often the case when working with repeated measures data, where measurements are taken on subjects of interest on multiple occasions over time. The gather() function is helpful in these situations.

### Variables are stored in both rows and columns

Sometimes you'll run into situations where variables are stored in both rows and columns. Although it may not be immediately obvious, if we treat the values in the type column as variables and create a separate column for each of them, we can set things straight. To do this, we use the spread() function.

### Multiple values are stored in one column

It's also fairly common that you will find two variables stored in a single column of data. These variables may be joined by a separator like a dash, underscore, space, or forward slash.

The separate() function comes in handy in these situations. Keep in mind that the into argument, which specifies the names of the 2 new columns being formed, must be given as a character vector (e.g. c("column1", "column2")).

# Preparing data for analysis

### Type conversions

Loosely speaking, the class() function tells you what type of object you're working with. (There are subtle differences between the class, type, and mode of an object,

# Make this evaluate to "character"
class(as.character(TRUE))

# Make this evaluate to "numeric"
class(as.numeric("8484.00"))

# Make this evaluate to "integer"
class(as.integer(99L))

# Make this evaluate to "factor"
class(as.factor("factor"))

# Make this evaluate to "logical"
class(as.logical("FALSE"))

'character'
'numeric'
'integer'
'factor'
'logical'

### Common type conversions

It is often necessary to change, or coerce, the way that variables in a dataset are stored.

This could be because of the way they were read into R (with read.csv(), for example) or perhaps the function you are using to analyze the data requires variables to be coded a certain way.

Only certain coercions are allowed, but the rules for what works are generally pretty intuitive. For example, trying to convert a character string to a number gives an error: as.numeric("some text").

There are a few less intuitive results. For example, under the hood, the logical values TRUE and FALSE are coded as 1 and 0, respectively. Therefore, as.logical(1) returns TRUE and as.numeric(TRUE) returns 1.

### Working with dates

Dates can be a challenge to work with in any programming language, but thanks to the lubridate package, working with dates in R isn't so bad. Since this project is about cleaning data, we only cover the most basic functions from lubridate to help us standardize the format of dates and times in our data. these functions combine the letters y, m, d, h, m, s, which stand for year, month, day, hour, minute, and second, respectively. The order of the letters in the function should match the order of the date/time you are attempting to read in, although not all combinations are valid. Notice that the functions are "smart" in that they are capable of parsing multiple formats.

students = read_csv('https://assets.datacamp.com/production/repositories/34/datasets/f75a87dbbdf2cf79e2286f97b2af22146cb717b1/students_with_dates.csv')

Warning message:
"Missing column names filled in: 'X1' [1]"Parsed with column specification:
cols(
.default = col_character(),
X1 = col_double(),
dob = col_date(format = ""),
Medu = col_double(),
Fedu = col_double(),
traveltime = col_double(),
studytime = col_double(),
failures = col_double(),
famrel = col_double(),
freetime = col_double(),
goout = col_double(),
Dalc = col_double(),
Walc = col_double(),
health = col_double(),
nurse_visit = col_datetime(format = ""),
absences = col_double()
)
See spec(...) for full column specifications.

1 GP F 2000-06-05 U GT3 A 4 4 at_home ... no 4 3 4 1 1 3 2014-04-10 14:59:54 6 5/6/6
2 GP F 1999-11-25 U GT3 T 1 1 at_home ... no 5 3 3 1 1 3 2015-03-12 14:59:54 4 5/5/6
3 GP F 1998-02-02 U LE3 T 1 1 at_home ... no 4 3 2 2 3 3 2015-09-21 14:59:5410 7/8/10
4 GP F 1997-12-20 U GT3 T 4 2 health ... yes 3 2 2 1 1 5 2015-09-03 14:59:54 2 15/14/15
5 GP F 1998-10-04 U GT3 T 3 3 other ... no 4 3 2 1 2 5 2015-04-07 14:59:54 4 6/10/10
6 GP M 1999-06-16 U LE3 T 4 3 services ... no 5 4 2 1 2 5 2013-11-15 14:59:5410 15/15/15
str(students)

tibble [395 x 33] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$X1 : num [1:395] 1 2 3 4 5 6 7 8 9 10 ...$ school     : chr [1:395] "GP" "GP" "GP" "GP" ...
$sex : chr [1:395] "F" "F" "F" "F" ...$ dob        : Date[1:395], format: "2000-06-05" "1999-11-25" ...
$address : chr [1:395] "U" "U" "U" "U" ...$ famsize    : chr [1:395] "GT3" "GT3" "LE3" "GT3" ...
$Pstatus : chr [1:395] "A" "T" "T" "T" ...$ Medu       : num [1:395] 4 1 1 4 3 4 2 4 3 3 ...
$Fedu : num [1:395] 4 1 1 2 3 3 2 4 2 4 ...$ Mjob       : chr [1:395] "at_home" "at_home" "at_home" "health" ...
$Fjob : chr [1:395] "teacher" "other" "other" "services" ...$ reason     : chr [1:395] "course" "course" "other" "home" ...
$guardian : chr [1:395] "mother" "father" "mother" "mother" ...$ traveltime : num [1:395] 2 1 1 1 1 1 1 2 1 1 ...
$studytime : num [1:395] 2 2 2 3 2 2 2 2 2 2 ...$ failures   : num [1:395] 0 0 3 0 0 0 0 0 0 0 ...
$schoolsup : chr [1:395] "yes" "no" "yes" "no" ...$ famsup     : chr [1:395] "no" "yes" "no" "yes" ...
$paid : chr [1:395] "no" "no" "yes" "yes" ...$ activities : chr [1:395] "no" "no" "no" "yes" ...
$nursery : chr [1:395] "yes" "no" "yes" "yes" ...$ higher     : chr [1:395] "yes" "yes" "yes" "yes" ...
$internet : chr [1:395] "no" "yes" "yes" "yes" ...$ romantic   : chr [1:395] "no" "no" "no" "yes" ...
$famrel : num [1:395] 4 5 4 3 4 5 4 4 4 5 ...$ freetime   : num [1:395] 3 3 3 2 3 4 4 1 2 5 ...
$goout : num [1:395] 4 3 2 2 2 2 4 4 2 1 ...$ Dalc       : num [1:395] 1 1 2 1 1 1 1 1 1 1 ...
$Walc : num [1:395] 1 1 3 1 2 2 1 1 1 1 ...$ health     : num [1:395] 3 3 3 5 5 5 3 1 1 5 ...
$nurse_visit: POSIXct[1:395], format: "2014-04-10 14:59:54" "2015-03-12 14:59:54" ...$ absences   : num [1:395] 6 4 10 2 4 10 0 6 0 0 ...
$Grades : chr [1:395] "5/6/6" "5/5/6" "7/8/10" "15/14/15" ... - attr(*, "spec")= .. cols( .. X1 = col_double(), .. school = col_character(), .. sex = col_character(), .. dob = col_date(format = ""), .. address = col_character(), .. famsize = col_character(), .. Pstatus = col_character(), .. Medu = col_double(), .. Fedu = col_double(), .. Mjob = col_character(), .. Fjob = col_character(), .. reason = col_character(), .. guardian = col_character(), .. traveltime = col_double(), .. studytime = col_double(), .. failures = col_double(), .. schoolsup = col_character(), .. famsup = col_character(), .. paid = col_character(), .. activities = col_character(), .. nursery = col_character(), .. higher = col_character(), .. internet = col_character(), .. romantic = col_character(), .. famrel = col_double(), .. freetime = col_double(), .. goout = col_double(), .. Dalc = col_double(), .. Walc = col_double(), .. health = col_double(), .. nurse_visit = col_datetime(format = ""), .. absences = col_double(), .. Grades = col_character() .. )  # Coerce Grades to character students$Grades <- as.character(students$Grades) # Coerce Medu to factor students$Medu <- as.factor(students$Medu) # Coerce Fedu to factor students$Fedu <- as.factor(students$Fedu) # Look at students once more with str() str(students)  tibble [395 x 33] (S3: spec_tbl_df/tbl_df/tbl/data.frame)$ X1         : num [1:395] 1 2 3 4 5 6 7 8 9 10 ...
$school : chr [1:395] "GP" "GP" "GP" "GP" ...$ sex        : chr [1:395] "F" "F" "F" "F" ...
$dob : Date[1:395], format: "2000-06-05" "1999-11-25" ...$ address    : chr [1:395] "U" "U" "U" "U" ...
$famsize : chr [1:395] "GT3" "GT3" "LE3" "GT3" ...$ Pstatus    : chr [1:395] "A" "T" "T" "T" ...
$Medu : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 5 4 5 3 5 4 4 ...$ Fedu       : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 3 4 4 3 5 3 5 ...
$Mjob : chr [1:395] "at_home" "at_home" "at_home" "health" ...$ Fjob       : chr [1:395] "teacher" "other" "other" "services" ...
$reason : chr [1:395] "course" "course" "other" "home" ...$ guardian   : chr [1:395] "mother" "father" "mother" "mother" ...
$traveltime : num [1:395] 2 1 1 1 1 1 1 2 1 1 ...$ studytime  : num [1:395] 2 2 2 3 2 2 2 2 2 2 ...
$failures : num [1:395] 0 0 3 0 0 0 0 0 0 0 ...$ schoolsup  : chr [1:395] "yes" "no" "yes" "no" ...
$famsup : chr [1:395] "no" "yes" "no" "yes" ...$ paid       : chr [1:395] "no" "no" "yes" "yes" ...
$activities : chr [1:395] "no" "no" "no" "yes" ...$ nursery    : chr [1:395] "yes" "no" "yes" "yes" ...
$higher : chr [1:395] "yes" "yes" "yes" "yes" ...$ internet   : chr [1:395] "no" "yes" "yes" "yes" ...
$romantic : chr [1:395] "no" "no" "no" "yes" ...$ famrel     : num [1:395] 4 5 4 3 4 5 4 4 4 5 ...
$freetime : num [1:395] 3 3 3 2 3 4 4 1 2 5 ...$ goout      : num [1:395] 4 3 2 2 2 2 4 4 2 1 ...
$Dalc : num [1:395] 1 1 2 1 1 1 1 1 1 1 ...$ Walc       : num [1:395] 1 1 3 1 2 2 1 1 1 1 ...
$health : num [1:395] 3 3 3 5 5 5 3 1 1 5 ...$ nurse_visit: POSIXct[1:395], format: "2014-04-10 14:59:54" "2015-03-12 14:59:54" ...
$absences : num [1:395] 6 4 10 2 4 10 0 6 0 0 ...$ Grades     : chr [1:395] "5/6/6" "5/5/6" "7/8/10" "15/14/15" ...
- attr(*, "spec")=
.. cols(
..   X1 = col_double(),
..   school = col_character(),
..   sex = col_character(),
..   dob = col_date(format = ""),
..   famsize = col_character(),
..   Pstatus = col_character(),
..   Medu = col_double(),
..   Fedu = col_double(),
..   Mjob = col_character(),
..   Fjob = col_character(),
..   reason = col_character(),
..   guardian = col_character(),
..   traveltime = col_double(),
..   studytime = col_double(),
..   failures = col_double(),
..   schoolsup = col_character(),
..   famsup = col_character(),
..   paid = col_character(),
..   activities = col_character(),
..   nursery = col_character(),
..   higher = col_character(),
..   internet = col_character(),
..   romantic = col_character(),
..   famrel = col_double(),
..   freetime = col_double(),
..   goout = col_double(),
..   Dalc = col_double(),
..   Walc = col_double(),
..   health = col_double(),
..   nurse_visit = col_datetime(format = ""),
..   absences = col_double(),
.. )

# Preview students2 with str()
str(students)

# Parse as date
dmy("17 Sep 2015")

# Parse as date and time (with no seconds!)
mdy_hm("July 15, 2012 12:56")

# Coerce dob to a date (with no time)
students$dob <- ymd(students$dob)

# Coerce nurse_visit to a date and time
students$nurse_visit <- ymd_hms(students$nurse_visit)

# Look at students2 once more with str()
str(students)

tibble [395 x 33] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$X1 : num [1:395] 1 2 3 4 5 6 7 8 9 10 ...$ school     : chr [1:395] "GP" "GP" "GP" "GP" ...
$sex : chr [1:395] "F" "F" "F" "F" ...$ dob        : Date[1:395], format: "2000-06-05" "1999-11-25" ...
$address : chr [1:395] "U" "U" "U" "U" ...$ famsize    : chr [1:395] "GT3" "GT3" "LE3" "GT3" ...
$Pstatus : chr [1:395] "A" "T" "T" "T" ...$ Medu       : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 5 4 5 3 5 4 4 ...
$Fedu : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 3 4 4 3 5 3 5 ...$ Mjob       : chr [1:395] "at_home" "at_home" "at_home" "health" ...
$Fjob : chr [1:395] "teacher" "other" "other" "services" ...$ reason     : chr [1:395] "course" "course" "other" "home" ...
$guardian : chr [1:395] "mother" "father" "mother" "mother" ...$ traveltime : num [1:395] 2 1 1 1 1 1 1 2 1 1 ...
$studytime : num [1:395] 2 2 2 3 2 2 2 2 2 2 ...$ failures   : num [1:395] 0 0 3 0 0 0 0 0 0 0 ...
$schoolsup : chr [1:395] "yes" "no" "yes" "no" ...$ famsup     : chr [1:395] "no" "yes" "no" "yes" ...
$paid : chr [1:395] "no" "no" "yes" "yes" ...$ activities : chr [1:395] "no" "no" "no" "yes" ...
$nursery : chr [1:395] "yes" "no" "yes" "yes" ...$ higher     : chr [1:395] "yes" "yes" "yes" "yes" ...
$internet : chr [1:395] "no" "yes" "yes" "yes" ...$ romantic   : chr [1:395] "no" "no" "no" "yes" ...
$famrel : num [1:395] 4 5 4 3 4 5 4 4 4 5 ...$ freetime   : num [1:395] 3 3 3 2 3 4 4 1 2 5 ...
$goout : num [1:395] 4 3 2 2 2 2 4 4 2 1 ...$ Dalc       : num [1:395] 1 1 2 1 1 1 1 1 1 1 ...
$Walc : num [1:395] 1 1 3 1 2 2 1 1 1 1 ...$ health     : num [1:395] 3 3 3 5 5 5 3 1 1 5 ...
$nurse_visit: POSIXct[1:395], format: "2014-04-10 14:59:54" "2015-03-12 14:59:54" ...$ absences   : num [1:395] 6 4 10 2 4 10 0 6 0 0 ...
$Grades : chr [1:395] "5/6/6" "5/5/6" "7/8/10" "15/14/15" ... - attr(*, "spec")= .. cols( .. X1 = col_double(), .. school = col_character(), .. sex = col_character(), .. dob = col_date(format = ""), .. address = col_character(), .. famsize = col_character(), .. Pstatus = col_character(), .. Medu = col_double(), .. Fedu = col_double(), .. Mjob = col_character(), .. Fjob = col_character(), .. reason = col_character(), .. guardian = col_character(), .. traveltime = col_double(), .. studytime = col_double(), .. failures = col_double(), .. schoolsup = col_character(), .. famsup = col_character(), .. paid = col_character(), .. activities = col_character(), .. nursery = col_character(), .. higher = col_character(), .. internet = col_character(), .. romantic = col_character(), .. famrel = col_double(), .. freetime = col_double(), .. goout = col_double(), .. Dalc = col_double(), .. Walc = col_double(), .. health = col_double(), .. nurse_visit = col_datetime(format = ""), .. absences = col_double(), .. Grades = col_character() .. )  [1] "2012-07-15 12:56:00 UTC" tibble [395 x 33] (S3: spec_tbl_df/tbl_df/tbl/data.frame)$ X1         : num [1:395] 1 2 3 4 5 6 7 8 9 10 ...
$school : chr [1:395] "GP" "GP" "GP" "GP" ...$ sex        : chr [1:395] "F" "F" "F" "F" ...
$dob : Date[1:395], format: "2000-06-05" "1999-11-25" ...$ address    : chr [1:395] "U" "U" "U" "U" ...
$famsize : chr [1:395] "GT3" "GT3" "LE3" "GT3" ...$ Pstatus    : chr [1:395] "A" "T" "T" "T" ...
$Medu : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 5 4 5 3 5 4 4 ...$ Fedu       : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 3 4 4 3 5 3 5 ...
$Mjob : chr [1:395] "at_home" "at_home" "at_home" "health" ...$ Fjob       : chr [1:395] "teacher" "other" "other" "services" ...
$reason : chr [1:395] "course" "course" "other" "home" ...$ guardian   : chr [1:395] "mother" "father" "mother" "mother" ...
$traveltime : num [1:395] 2 1 1 1 1 1 1 2 1 1 ...$ studytime  : num [1:395] 2 2 2 3 2 2 2 2 2 2 ...
$failures : num [1:395] 0 0 3 0 0 0 0 0 0 0 ...$ schoolsup  : chr [1:395] "yes" "no" "yes" "no" ...
$famsup : chr [1:395] "no" "yes" "no" "yes" ...$ paid       : chr [1:395] "no" "no" "yes" "yes" ...
$activities : chr [1:395] "no" "no" "no" "yes" ...$ nursery    : chr [1:395] "yes" "no" "yes" "yes" ...
$higher : chr [1:395] "yes" "yes" "yes" "yes" ...$ internet   : chr [1:395] "no" "yes" "yes" "yes" ...
$romantic : chr [1:395] "no" "no" "no" "yes" ...$ famrel     : num [1:395] 4 5 4 3 4 5 4 4 4 5 ...
$freetime : num [1:395] 3 3 3 2 3 4 4 1 2 5 ...$ goout      : num [1:395] 4 3 2 2 2 2 4 4 2 1 ...
$Dalc : num [1:395] 1 1 2 1 1 1 1 1 1 1 ...$ Walc       : num [1:395] 1 1 3 1 2 2 1 1 1 1 ...
$health : num [1:395] 3 3 3 5 5 5 3 1 1 5 ...$ nurse_visit: POSIXct[1:395], format: "2014-04-10 14:59:54" "2015-03-12 14:59:54" ...
$absences : num [1:395] 6 4 10 2 4 10 0 6 0 0 ...$ Grades     : chr [1:395] "5/6/6" "5/5/6" "7/8/10" "15/14/15" ...
- attr(*, "spec")=
.. cols(
..   X1 = col_double(),
..   school = col_character(),
..   sex = col_character(),
..   dob = col_date(format = ""),
..   famsize = col_character(),
..   Pstatus = col_character(),
..   Medu = col_double(),
..   Fedu = col_double(),
..   Mjob = col_character(),
..   Fjob = col_character(),
..   reason = col_character(),
..   guardian = col_character(),
..   traveltime = col_double(),
..   studytime = col_double(),
..   failures = col_double(),
..   schoolsup = col_character(),
..   famsup = col_character(),
..   paid = col_character(),
..   activities = col_character(),
..   nursery = col_character(),
..   higher = col_character(),
..   internet = col_character(),
..   romantic = col_character(),
..   famrel = col_double(),
..   freetime = col_double(),
..   goout = col_double(),
..   Dalc = col_double(),
..   Walc = col_double(),
..   health = col_double(),
..   nurse_visit = col_datetime(format = ""),
..   absences = col_double(),
.. )


## String manipulation

One common issue that comes up when cleaning data is the need to remove leading and/or trailing white space. The str_trim() function from stringr makes it easy to do this while leaving intact the part of the string that you actually want.

> str_trim("  this is a test     ")
[1] "this is a test"


A similar issue is when you need to pad strings to make them a certain number of characters wide. One example is if you had a bunch of employee ID numbers, some of which begin with one or more zeros. When reading these data in, you find that the leading zeros have been dropped somewhere along the way (probably because the variable was thought to be numeric and in that case, leading zeros would be unnecessary.)

> str_pad("24493", width = 7, side = "left", pad = "0")
[1] "0024493"

# Load the stringr package
library(stringr)

# Trim all leading and trailing whitespace
str_trim(c("   Filip ", "Nick  ", " Jonathan"))


<ol class=list-inline>
• 'Filip'
• 'Nick'
• 'Jonathan'
• </ol>
<ol class=list-inline>
• '00023485W'
• '08823453Q'
• '00000994Z'
• </ol>

Examples like this are certainly handy in R. For example, the str_pad() function is useful when importing a dataset with US zip codes. Occasionally R will drop the leading 0 in a zipcode, thinking it's numeric.

### Upper and lower case

In addition to trimming and padding strings, you may need to adjust their case from time to time. Making strings uppercase or lowercase is very straightforward in (base) R thanks to toupper() and tolower(). Each function takes exactly one argument:the character string (or vector/column of strings) to be converted to the desired case.

states <- c("al", "ak", "az", "ar", "ca", "co", "ct", "de", "fl", "ga", "hi", "id", "il", "in", "ia", "ks", "ky", "la", "me", "md", "ma", "mi", "mn", "ms", "mo", "mt", "ne", "nv", "nh", "nj", "nm", "ny", "nc", "nd", "oh", "ok", "or", "pa", "ri", "sc", "sd", "tn", "tx", "ut", "vt", "va", "wa", "wv", "wi", "wy")
# Print state abbreviations
states

# Make states all uppercase and save result to states_upper
states_upper <- toupper(states)
states_upper

# Make states_upper all lowercase again
tolower(states_upper)

<ol class=list-inline>
• 'al'
• 'ak'
• 'az'
• 'ar'
• 'ca'
• 'co'
• 'ct'
• 'de'
• 'fl'
• 'ga'
• 'hi'
• 'id'
• 'il'
• 'in'
• 'ia'
• 'ks'
• 'ky'
• 'la'
• 'me'
• 'md'
• 'ma'
• 'mi'
• 'mn'
• 'ms'
• 'mo'
• 'mt'
• 'ne'
• 'nv'
• 'nh'
• 'nj'
• 'nm'
• 'ny'
• 'nc'
• 'nd'
• 'oh'
• 'ok'
• 'or'
• 'pa'
• 'ri'
• 'sc'
• 'sd'
• 'tn'
• 'tx'
• 'ut'
• 'vt'
• 'va'
• 'wa'
• 'wv'
• 'wi'
• 'wy'
• </ol>
<ol class=list-inline>
• 'AL'
• 'AK'
• 'AZ'
• 'AR'
• 'CA'
• 'CO'
• 'CT'
• 'DE'
• 'FL'
• 'GA'
• 'HI'
• 'ID'
• 'IL'
• 'IN'
• 'IA'
• 'KS'
• 'KY'
• 'LA'
• 'ME'
• 'MD'
• 'MA'
• 'MI'
• 'MN'
• 'MS'
• 'MO'
• 'MT'
• 'NE'
• 'NV'
• 'NH'
• 'NJ'
• 'NM'
• 'NY'
• 'NC'
• 'ND'
• 'OH'
• 'OK'
• 'OR'
• 'PA'
• 'RI'
• 'SC'
• 'SD'
• 'TN'
• 'TX'
• 'UT'
• 'VT'
• 'VA'
• 'WA'
• 'WV'
• 'WI'
• 'WY'
• </ol>
<ol class=list-inline>
• 'al'
• 'ak'
• 'az'
• 'ar'
• 'ca'
• 'co'
• 'ct'
• 'de'
• 'fl'
• 'ga'
• 'hi'
• 'id'
• 'il'
• 'in'
• 'ia'
• 'ks'
• 'ky'
• 'la'
• 'me'
• 'md'
• 'ma'
• 'mi'
• 'mn'
• 'ms'
• 'mo'
• 'mt'
• 'ne'
• 'nv'
• 'nh'
• 'nj'
• 'nm'
• 'ny'
• 'nc'
• 'nd'
• 'oh'
• 'ok'
• 'or'
• 'pa'
• 'ri'
• 'sc'
• 'sd'
• 'tn'
• 'tx'
• 'ut'
• 'vt'
• 'va'
• 'wa'
• 'wv'
• 'wi'
• 'wy'
• </ol>

### Finding and replacing strings

The stringr package provides two functions that are very useful for finding and/or replacing patterns in strings: str_detect() and str_replace().

Like all functions in stringr, the first argument of each is the string of interest. The second argument of each is the pattern of interest. In the case of str_detect(), this is the pattern we are searching for. In the case of str_replace(), this is the pattern we want to replace. Finally, str_replace() has a third argument, which is the string to replace with.

> str_detect(c("banana", "kiwi"), "a")
[1]  TRUE FALSE
> str_replace(c("banana", "kiwi"), "a", "o")
[1] "bonana" "kiwi"

# Copy of students2: students3
students3 <- students

# Look at the head of students3

# Detect all dates of birth (dob) in 1997
str_detect(students3$dob, "1997") # In the sex column, replace "F" with "Female" ... students3$sex <- str_replace(students3$sex, "F", "Female") # ... and "M" with "Male" students3$sex <- str_replace(students3$sex, "M", "Male") # View the head of students3 head(students3)  X1schoolsexdobaddressfamsizePstatusMeduFeduMjob...romanticfamrelfreetimegooutDalcWalchealthnurse_visitabsencesGrades 1 GP F 2000-06-05 U GT3 A 4 4 at_home ... no 4 3 4 1 1 3 2014-04-10 14:59:54 6 5/6/6 2 GP F 1999-11-25 U GT3 T 1 1 at_home ... no 5 3 3 1 1 3 2015-03-12 14:59:54 4 5/5/6 3 GP F 1998-02-02 U LE3 T 1 1 at_home ... no 4 3 2 2 3 3 2015-09-21 14:59:5410 7/8/10 4 GP F 1997-12-20 U GT3 T 4 2 health ... yes 3 2 2 1 1 5 2015-09-03 14:59:54 2 15/14/15 5 GP F 1998-10-04 U GT3 T 3 3 other ... no 4 3 2 1 2 5 2015-04-07 14:59:54 4 6/10/10 6 GP M 1999-06-16 U LE3 T 4 3 services ... no 5 4 2 1 2 5 2013-11-15 14:59:5410 15/15/15 <ol class=list-inline> • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • TRUE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • TRUE • TRUE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • TRUE • TRUE • FALSE • FALSE • FALSE • TRUE • TRUE • TRUE • TRUE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • TRUE • TRUE • FALSE • TRUE • FALSE • TRUE • TRUE • FALSE • TRUE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • TRUE • TRUE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • TRUE • FALSE • TRUE • FALSE • FALSE • TRUE • TRUE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • TRUE • TRUE • TRUE • FALSE • FALSE • FALSE • FALSE • TRUE • TRUE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • TRUE • TRUE • TRUE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • TRUE • FALSE • FALSE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • TRUE • FALSE • FALSE • TRUE • FALSE • TRUE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • FALSE • FALSE • TRUE • FALSE • TRUE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • TRUE • TRUE • TRUE • FALSE • FALSE • FALSE • FALSE • TRUE • TRUE • TRUE • TRUE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • FALSE • </ol> X1schoolsexdobaddressfamsizePstatusMeduFeduMjob...romanticfamrelfreetimegooutDalcWalchealthnurse_visitabsencesGrades 1 GP Female 2000-06-05 U GT3 A 4 4 at_home ... no 4 3 4 1 1 3 2014-04-10 14:59:54 6 5/6/6 2 GP Female 1999-11-25 U GT3 T 1 1 at_home ... no 5 3 3 1 1 3 2015-03-12 14:59:54 4 5/5/6 3 GP Female 1998-02-02 U LE3 T 1 1 at_home ... no 4 3 2 2 3 3 2015-09-21 14:59:5410 7/8/10 4 GP Female 1997-12-20 U GT3 T 4 2 health ... yes 3 2 2 1 1 5 2015-09-03 14:59:54 2 15/14/15 5 GP Female 1998-10-04 U GT3 T 3 3 other ... no 4 3 2 1 2 5 2015-04-07 14:59:54 4 6/10/10 6 GP Male 1999-06-16 U LE3 T 4 3 services ... no 5 4 2 1 2 5 2013-11-15 14:59:5410 15/15/15 ## Missing and special values ### Finding missing values social_df = data.frame(name=c('Sarah', 'Tom', 'David', 'Alice'), n_friends=c(244,NA,145,43),status=c('going out', "",'Movie Night', ""))  # Call is.na() on the full social_df to spot all NAs is.na(social_df) # Use the any() function to ask whether there are any NAs in the data any(is.na(social_df)) # View a summary() of the dataset summary(social_df) # Call table() on the status column table(social_df$status)

namen_friendsstatus
FALSEFALSEFALSE
FALSE TRUEFALSE
FALSEFALSEFALSE
FALSEFALSEFALSE
TRUE
    name     n_friends             status
Alice:1   Min.   : 43.0              :2
David:1   1st Qu.: 94.0   going out  :1
Sarah:1   Median :145.0   Movie Night:1
Tom  :1   Mean   :144.0
3rd Qu.:194.5
Max.   :244.0
NA's   :1                      
              going out Movie Night
2           1           1 

### Dealing with missing values

Missing values can be a rather complex subject, but here we'll only look at the simple case where you are simply interested in normalizing and/or removing all missing values from your data. For more information on why this is not always the best strategy, search online for "missing not at random."

Looking at the social_df dataset again, we asked around a bit and figured out what's causing the missing values that you saw in the last exercise. Tom doesn't have a social media account on this particular platform, which explains why his number of friends and current status are missing (although coded in two different ways). Alice is on the platform, but is a passive user and never sets her status, hence the reason it's missing for her.

# Replace all empty strings in status with NA
social_df$status[social_df$status == ""] <- NA

# Print social_df to the console
social_df

# Use complete.cases() to see which rows have no missing values
complete.cases(social_df)

# Use na.omit() to remove all rows with any missing values
na.omit(social_df)

namen_friendsstatus
Sarah 244 going out
Tom NA NA
David 145 Movie Night
Alice 43 NA
<ol class=list-inline>
• TRUE
• FALSE
• TRUE
• FALSE
• </ol>
namen_friendsstatus
1Sarah 244 going out
3David 145 Movie Night

## Outliers and obvious errors

### Dealing with outliers and obvious errors

When dealing with strange values in your data, you often must decide whether they are just extreme or actually erroneous. Extreme values show up all over the place, but you, the data analyst, must figure out when they are plausible and when they are not.

# Look at a summary() of students3
summary(students3)

# View a histogram of the studytime variable
hist(students3$studytime) # View a histogram of the failures variable hist(students3$failures)

# View a histogram of absences, but force zeros to be bucketed to the right of zero