R tidyverse: tidyr.

Usually when you work with data it is common that it is not ready to analize it, the information could be available but the functions in require a precise format.

library(tidyverse)
library(nycflights13)

Tidydata

There are different ways to organize a dataframe, in community the most extended is the “tidydata” concept, with this format each variable has a column and each row represent an observation.

Next table is a example of tidydata

(#tab:unnamed-chunk-3)
countryyearcasespopulationrate
Afghanistan1999745199870713.73e-05
Afghanistan20002666205953600.000129
Brazil1999377371720063620.000219
Brazil2000804881745048980.000461
China199921225812729152720.000167
China200021376612804285830.000167

When you organize data “tidy” most of fuctions works smoothly, but you should be aware and know that “tidydata” is not the only correct way to organize data, for example there are some field with well establish formats even when there are not practical to work directly.

If you have you data organized as “tidydata” you can easily create fancy plots with one o two code lines, like the following.

Spread and gathering

the tibble “tabla4a” has not sorted their values as tidydata, specifically the values of “year” are put in two different columns.

table4a
(#tab:unnamed-chunk-6)
country19992000
Afghanistan7452666
Brazil3773780488
China212258213766

To solve this problem we will gather the columns 1999 and 2000 in only a column named year,and isolate the old values in other column named value.

table4a %>% gather("1999", "2000", key = "year", value = "cases")
# we use "" in this columns because they are nonsyctactic names
# rememeber that you should avoid use numbers and symbol as colum names
(#tab:unnamed-chunk-8)
countryyearcases
Afghanistan1999745
Brazil199937737
China1999212258
Afghanistan20002666
Brazil200080488
China2000213766

Another example with table4b

table4b %>% gather("1999", "2000", key = "year", value = "population")
(#tab:unnamed-chunk-10)
countryyearpopulation
Afghanistan199919987071
Brazil1999172006362
China19991272915272
Afghanistan200020595360
Brazil2000174504898
China20001280428583

Now we can combine (if we want it) two dataframes with left_join() from dplyr package, It is already loaded because it is part of tidyverse

table4a <- table4a %>% gather("1999", "2000", key = "year", value = "cases")

table4b <- table4b %>% gather("1999", "2000", key = "year", value = "population")

joined_table4 <- left_join(table4a,table4b)
joined_table4
(#tab:unnamed-chunk-12)
countryyearcasespopulation
Afghanistan199974519987071
Brazil199937737172006362
China19992122581272915272
Afghanistan2000266620595360
Brazil200080488174504898
China20002137661280428583

There are other fuctions to gather dataframes like right_join(), full_join(), semi_joint(), inner_joint() and more.

Spreading

Other times the observations are distributed wrong, for example the column type in the next table:

table2
(#tab:unnamed-chunk-14)
countryyeartypecount
Afghanistan1999cases745
Afghanistan1999population19987071
Afghanistan2000cases2666
Afghanistan2000population20595360
Brazil1999cases37737
Brazil1999population172006362
Brazil2000cases80488
Brazil2000population174504898
China1999cases212258
China1999population1272915272
China2000cases213766
China2000population1280428583

To solve this problem we will use spred()

table2 %>% spread(key = type, value = count)
(#tab:unnamed-chunk-16)
countryyearcasespopulation
Afghanistan199974519987071
Afghanistan2000266620595360
Brazil199937737172006362
Brazil200080488174504898
China19992122581272915272
China20002137661280428583

Separating and pulling

Next you will see a column with a awful problem, we will use sepate() to solve it, this function take any non-alphanumeric character to break a column, to change the separator use the parameter sep; another important thing is that separate() can convert the new colums as character, to convert use the argument convert = TRUE

table3
#column rate have 2 observations sepatated by a slash
(#tab:unnamed-chunk-18)
countryyearrate
Afghanistan1999745/19987071
Afghanistan20002666/20595360
Brazil199937737/172006362
Brazil200080488/174504898
China1999212258/1272915272
China2000213766/1280428583
table3 %>% separate(rate, into = c("cases", "population"), sep = "/", convert = T)
table3
(#tab:unnamed-chunk-20)
countryyearcasespopulation
Afghanistan199974519987071
Afghanistan2000266620595360
Brazil199937737172006362
Brazil200080488174504898
China19992122581272915272
China20002137661280428583

Also, You can break a column by the position of each character if there are not a common characters to use sep argument, for example:

table3 %>% separate(year, into = c("century", "year"), sep = 2)
(#tab:unnamed-chunk-22)
countrycenturyyearrate
Afghanistan1999745/19987071
Afghanistan20002666/20595360
Brazil199937737/172006362
Brazil200080488/174504898
China1999212258/1272915272
China2000213766/1280428583

unite with unite()

The less frequently problem is that a observation are divided in two separates columns, for example in the table5 each observation of years are separate in century and year.

table5
(#tab:unnamed-chunk-24)
countrycenturyyearrate
Afghanistan1999745/19987071
Afghanistan20002666/20595360
Brazil199937737/172006362
Brazil200080488/174504898
China1999212258/1272915272
China2000213766/1280428583
table5 %>% unite(new_date,century,year ,sep = "")
#we use sep = argument to put (or not) a tag in the joint
(#tab:unnamed-chunk-26)
countrynew_daterate
Afghanistan1999745/19987071
Afghanistan20002666/20595360
Brazil199937737/172006362
Brazil200080488/174504898
China1999212258/1272915272
China2000213766/1280428583

Tidying missing data

When you change the disposition of a dataframe or tibble sometimes appear missing values (NA) and them can be explicit or implicit, let’s see some exercises.

stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA , 0.92, 0.17, 2.66))
# year 2015 - qtr 4 - return is a explicit NA
# year 2016 -qrt 1 is a impocit NA
stocks
(#tab:unnamed-chunk-28)
yearqtrreturn
2.02e+0311.88
2.02e+0320.59
2.02e+0330.35
2.02e+034   
2.02e+0320.92
2.02e+0330.17
2.02e+0342.66

Implicit missing values can be converted to explicit with spread() when we change the column values.

stocks %>% spread(year, return) 
stocks %>% spread(year, return) %>% as_huxtable() %>% theme_blue()
(#tab:unnamed-chunk-30)
qtr20152016
11.88   
20.590.92
30.350.17
4   2.66
(#tab:unnamed-chunk-31)
qtryearreturn
120151.88
220150.59
320150.35
42015   
12016   
220160.92
320160.17
420162.66
(#tab:unnamed-chunk-31)
qtryearreturn
120151.88
220150.59
320150.35
220160.92
320160.17
420162.66
stocks %>% spread(year, return) %>% gather(year,return, "2015":"2016")
stocks %>% spread(year, return) %>% gather(year,return, "2015":"2016", na.rm = T) #leave it as it was.

Another way to make explicit missing values is using complet(), this function generate NAs automatically.

stocks %>% complete(year, qtr)
(#tab:unnamed-chunk-34)
yearqtrreturn
2.02e+0311.88
2.02e+0320.59
2.02e+0330.35
2.02e+034   
2.02e+031   
2.02e+0320.92
2.02e+0330.17
2.02e+0342.66

Relational data with dplyr

Mutating joints

This tools allows you o combine two tables that have observations in common,in the next tibble you can see the variable carrier, on the other hand the tibble airlines have the same variable, then we can use the left_joint() to unite them.

flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
head(flights2)
(#tab:unnamed-chunk-36)
yearmonthdayhourorigindesttailnumcarrier
2013115EWRIAHN14228UA
2013115LGAIAHN24211UA
2013115JFKMIAN619AAAA
2013115JFKBQNN804JBB6
2013116LGAATLN668DNDL
2013115EWRORDN39463UA
airlines
flights2 %>%
select(-origin, -dest, - day) %>%
left_join(airlines, by = "carrier") %>% select(carrier, name , everything()) %>% head()
airlines
flights2 %>%
select(-origin, -dest, - day) %>%
left_join(airlines, by = "carrier") %>% select(carrier, name , everything()) %>% head()
(#tab:unnamed-chunk-39)
carriername
9EEndeavor Air Inc.
AAAmerican Airlines Inc.
ASAlaska Airlines Inc.
B6JetBlue Airways
DLDelta Air Lines Inc.
EVExpressJet Airlines Inc.
F9Frontier Airlines Inc.
FLAirTran Airways Corporation
HAHawaiian Airlines Inc.
MQEnvoy Air
OOSkyWest Airlines Inc.
UAUnited Air Lines Inc.
USUS Airways Inc.
VXVirgin America
WNSouthwest Airlines Co.
YVMesa Airlines Inc.
(#tab:unnamed-chunk-39)
carriernameyearmonthhourtailnum
UAUnited Air Lines Inc.201315N14228
UAUnited Air Lines Inc.201315N24211
AAAmerican Airlines Inc.201315N619AA
B6JetBlue Airways201315N804JB
DLDelta Air Lines Inc.201316N668DN
UAUnited Air Lines Inc.201315N39463

Now, we have added the information from the tibble airlines in a new tibble only.

There are type of joints like inner joints (exclude the mismatches using the inner_joint() function) and outer joints who conserved the mismatched fulling the gaps with NAs.

  • A left join keeps all observations in x. left_joint_()
  • A right join keeps all observations in y. right_joint_()
  • A full join keeps all observations in x and y. full_joint_()

Defining keys

If you do not use the argument by then by default dplyr will search for columns in common.

flights2 %>%
left_join(weather) %>% head()
(#tab:unnamed-chunk-41)
yearmonthdayhourorigindesttailnumcarriertempdewphumidwind_dirwind_speedwind_gustprecippressurevisibtime_hour
2013115EWRIAHN14228UA39  2864.426012.7  01.01e+03102013-01-01 05:00:00
2013115LGAIAHN24211UA39.92554.825015  21.901.01e+03102013-01-01 05:00:00
2013115JFKMIAN619AAAA39  2761.626015    01.01e+03102013-01-01 05:00:00
2013115JFKBQNN804JBB639  2761.626015    01.01e+03102013-01-01 05:00:00
2013116LGAATLN668DNDL39.92554.826016.123  01.01e+03102013-01-01 06:00:00
2013115EWRORDN39463UA39  2864.426012.7  01.01e+03102013-01-01 05:00:00

Filter joins

Instead match variables like in regular joints this kind of joints match observations.

  • semi_join(x, y) keeps all observations in x that have a match in y.
  • anti_join(x, y) drops all observations in x that have a match in y.
top_dest <- flights %>% 
        count(dest, sort = TRUE) %>% head(10)
#select top 10 destines

flights %>%
        semi_join(top_dest) %>%
        head()
#take from flights just the colums that belong to top_dest 
(#tab:unnamed-chunk-43)
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
201311542540292385033AA1141N619AAJFKMIA1601.09e+035402013-01-01 05:00:00
201311554600-6812837-25DL461N668DNLGAATL116762       602013-01-01 06:00:00
201311554558-474072812UA1696N39463EWRORD150719       5582013-01-01 05:00:00
201311555600-591385419B6507N516JBEWRFLL1581.06e+03602013-01-01 06:00:00
201311557600-3838846-8B679N593JBJFKMCO140944       602013-01-01 06:00:00
201311558600-27537458AA301N3ALAALGAORD138733       602013-01-01 06:00:00
Diego Sierra Ramírez
Diego Sierra Ramírez
Msc. in Biological Science / Data analyst

Related