R tidyverse: tidyr.

Last updated on Mar 13, 2022

Usually when you work with data it is common that it is not ready to analize it, the information could be available but the functions in require a precise format.

library(tidyverse)
library(nycflights13)

Tidydata

There are different ways to organize a dataframe, in community the most extended is the “tidydata” concept, with this format each variable has a column and each row represent an observation.

Next table is a example of tidydata

(#tab:unnamed-chunk-3)
country	year	cases	population	rate
Afghanistan	1999	745	19987071	3.73e-05
Afghanistan	2000	2666	20595360	0.000129
Brazil	1999	37737	172006362	0.000219
Brazil	2000	80488	174504898	0.000461
China	1999	212258	1272915272	0.000167
China	2000	213766	1280428583	0.000167

When you organize data “tidy” most of fuctions works smoothly, but you should be aware and know that “tidydata” is not the only correct way to organize data, for example there are some field with well establish formats even when there are not practical to work directly.

If you have you data organized as “tidydata” you can easily create fancy plots with one o two code lines, like the following.

Spread and gathering

the tibble “tabla4a” has not sorted their values as tidydata, specifically the values of “year” are put in two different columns.

table4a

(#tab:unnamed-chunk-6)
country	1999	2000
Afghanistan	745	2666
Brazil	37737	80488
China	212258	213766

To solve this problem we will gather the columns 1999 and 2000 in only a column named year,and isolate the old values in other column named value.

table4a %>% gather("1999", "2000", key = "year", value = "cases")
# we use "" in this columns because they are nonsyctactic names
# rememeber that you should avoid use numbers and symbol as colum names

(#tab:unnamed-chunk-8)
country	year	cases
Afghanistan	1999	745
Brazil	1999	37737
China	1999	212258
Afghanistan	2000	2666
Brazil	2000	80488
China	2000	213766

Another example with table4b

table4b %>% gather("1999", "2000", key = "year", value = "population")

(#tab:unnamed-chunk-10)
country	year	population
Afghanistan	1999	19987071
Brazil	1999	172006362
China	1999	1272915272
Afghanistan	2000	20595360
Brazil	2000	174504898
China	2000	1280428583

Now we can combine (if we want it) two dataframes with left_join() from dplyr package, It is already loaded because it is part of tidyverse

table4a <- table4a %>% gather("1999", "2000", key = "year", value = "cases")

table4b <- table4b %>% gather("1999", "2000", key = "year", value = "population")

joined_table4 <- left_join(table4a,table4b)
joined_table4

(#tab:unnamed-chunk-12)
country	year	cases	population
Afghanistan	1999	745	19987071
Brazil	1999	37737	172006362
China	1999	212258	1272915272
Afghanistan	2000	2666	20595360
Brazil	2000	80488	174504898
China	2000	213766	1280428583

There are other fuctions to gather dataframes like right_join(), full_join(), semi_joint(), inner_joint() and more.

Spreading

Other times the observations are distributed wrong, for example the column type in the next table:

table2

(#tab:unnamed-chunk-14)
country	year	type	count
Afghanistan	1999	cases	745
Afghanistan	1999	population	19987071
Afghanistan	2000	cases	2666
Afghanistan	2000	population	20595360
Brazil	1999	cases	37737
Brazil	1999	population	172006362
Brazil	2000	cases	80488
Brazil	2000	population	174504898
China	1999	cases	212258
China	1999	population	1272915272
China	2000	cases	213766
China	2000	population	1280428583

To solve this problem we will use spred()

table2 %>% spread(key = type, value = count)

(#tab:unnamed-chunk-16)
country	year	cases	population
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

Separating and pulling

Next you will see a column with a awful problem, we will use sepate() to solve it, this function take any non-alphanumeric character to break a column, to change the separator use the parameter sep; another important thing is that separate() can convert the new colums as character, to convert use the argument convert = TRUE

table3
#column rate have 2 observations sepatated by a slash

(#tab:unnamed-chunk-18)
country	year	rate
Afghanistan	1999	745/19987071
Afghanistan	2000	2666/20595360
Brazil	1999	37737/172006362
Brazil	2000	80488/174504898
China	1999	212258/1272915272
China	2000	213766/1280428583

table3 %>% separate(rate, into = c("cases", "population"), sep = "/", convert = T)
table3

(#tab:unnamed-chunk-20)
country	year	cases	population
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

Also, You can break a column by the position of each character if there are not a common characters to use sep argument, for example:

table3 %>% separate(year, into = c("century", "year"), sep = 2)

(#tab:unnamed-chunk-22)
country	century	year	rate
Afghanistan	19	99	745/19987071
Afghanistan	20	00	2666/20595360
Brazil	19	99	37737/172006362
Brazil	20	00	80488/174504898
China	19	99	212258/1272915272
China	20	00	213766/1280428583

unite with unite()

The less frequently problem is that a observation are divided in two separates columns, for example in the table5 each observation of years are separate in century and year.

table5

(#tab:unnamed-chunk-24)
country	century	year	rate
Afghanistan	19	99	745/19987071
Afghanistan	20	00	2666/20595360
Brazil	19	99	37737/172006362
Brazil	20	00	80488/174504898
China	19	99	212258/1272915272
China	20	00	213766/1280428583

table5 %>% unite(new_date,century,year ,sep = "")
#we use sep = argument to put (or not) a tag in the joint

(#tab:unnamed-chunk-26)
country	new_date	rate
Afghanistan	1999	745/19987071
Afghanistan	2000	2666/20595360
Brazil	1999	37737/172006362
Brazil	2000	80488/174504898
China	1999	212258/1272915272
China	2000	213766/1280428583

Tidying missing data

When you change the disposition of a dataframe or tibble sometimes appear missing values (NA) and them can be explicit or implicit, let’s see some exercises.

stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA , 0.92, 0.17, 2.66))
# year 2015 - qtr 4 - return is a explicit NA
# year 2016 -qrt 1 is a impocit NA
stocks

(#tab:unnamed-chunk-28)
year	qtr	return
2.02e+03	1	1.88
2.02e+03	2	0.59
2.02e+03	3	0.35
2.02e+03	4
2.02e+03	2	0.92
2.02e+03	3	0.17
2.02e+03	4	2.66

Implicit missing values can be converted to explicit with spread() when we change the column values.

stocks %>% spread(year, return)

stocks %>% spread(year, return) %>% as_huxtable() %>% theme_blue()

(#tab:unnamed-chunk-30)
qtr	2015	2016
1	1.88
2	0.59	0.92
3	0.35	0.17
4		2.66

(#tab:unnamed-chunk-31)
qtr	year	return
1	2015	1.88
2	2015	0.59
3	2015	0.35
4	2015
1	2016
2	2016	0.92
3	2016	0.17
4	2016	2.66

(#tab:unnamed-chunk-31)
qtr	year	return
1	2015	1.88
2	2015	0.59
3	2015	0.35
2	2016	0.92
3	2016	0.17
4	2016	2.66

stocks %>% spread(year, return) %>% gather(year,return, "2015":"2016")
stocks %>% spread(year, return) %>% gather(year,return, "2015":"2016", na.rm = T) #leave it as it was.

Another way to make explicit missing values is using complet(), this function generate NAs automatically.

stocks %>% complete(year, qtr)

(#tab:unnamed-chunk-34)
year	qtr	return
2.02e+03	1	1.88
2.02e+03	2	0.59
2.02e+03	3	0.35
2.02e+03	4
2.02e+03	1
2.02e+03	2	0.92
2.02e+03	3	0.17
2.02e+03	4	2.66

Relational data with dplyr

Mutating joints

This tools allows you o combine two tables that have observations in common,in the next tibble you can see the variable carrier, on the other hand the tibble airlines have the same variable, then we can use the left_joint() to unite them.

flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
head(flights2)

(#tab:unnamed-chunk-36)
year	month	day	hour	origin	dest	tailnum	carrier
2013	1	1	5	EWR	IAH	N14228	UA
2013	1	1	5	LGA	IAH	N24211	UA
2013	1	1	5	JFK	MIA	N619AA	AA
2013	1	1	5	JFK	BQN	N804JB	B6
2013	1	1	6	LGA	ATL	N668DN	DL
2013	1	1	5	EWR	ORD	N39463	UA

airlines
flights2 %>%
select(-origin, -dest, - day) %>%
left_join(airlines, by = "carrier") %>% select(carrier, name , everything()) %>% head()

airlines
flights2 %>%
select(-origin, -dest, - day) %>%
left_join(airlines, by = "carrier") %>% select(carrier, name , everything()) %>% head()

(#tab:unnamed-chunk-39)
carrier	name
9E	Endeavor Air Inc.
AA	American Airlines Inc.
AS	Alaska Airlines Inc.
B6	JetBlue Airways
DL	Delta Air Lines Inc.
EV	ExpressJet Airlines Inc.
F9	Frontier Airlines Inc.
FL	AirTran Airways Corporation
HA	Hawaiian Airlines Inc.
MQ	Envoy Air
OO	SkyWest Airlines Inc.
UA	United Air Lines Inc.
US	US Airways Inc.
VX	Virgin America
WN	Southwest Airlines Co.
YV	Mesa Airlines Inc.

(#tab:unnamed-chunk-39)
carrier	name	year	month	hour	tailnum
UA	United Air Lines Inc.	2013	1	5	N14228
UA	United Air Lines Inc.	2013	1	5	N24211
AA	American Airlines Inc.	2013	1	5	N619AA
B6	JetBlue Airways	2013	1	5	N804JB
DL	Delta Air Lines Inc.	2013	1	6	N668DN
UA	United Air Lines Inc.	2013	1	5	N39463

Now, we have added the information from the tibble airlines in a new tibble only.

There are type of joints like inner joints (exclude the mismatches using the inner_joint() function) and outer joints who conserved the mismatched fulling the gaps with NAs.

A left join keeps all observations in x. left_joint_()
A right join keeps all observations in y. right_joint_()
A full join keeps all observations in x and y. full_joint_()

Defining keys

If you do not use the argument by then by default dplyr will search for columns in common.

flights2 %>%
left_join(weather) %>% head()

(#tab:unnamed-chunk-41)
year	month	day	hour	origin	dest	tailnum	carrier	temp	dewp	humid	wind_dir	wind_speed	wind_gust	pressure	visib	time_hour
2013	1	1	5	EWR	IAH	N14228	UA	39	28	64.4	260	12.7		1.01e+03	10	2013-01-01 05:00:00
2013	1	1	5	LGA	IAH	N24211	UA	39.9	25	54.8	250	15	21.9	1.01e+03	10	2013-01-01 05:00:00
2013	1	1	5	JFK	MIA	N619AA	AA	39	27	61.6	260	15		1.01e+03	10	2013-01-01 05:00:00
2013	1	1	5	JFK	BQN	N804JB	B6	39	27	61.6	260	15		1.01e+03	10	2013-01-01 05:00:00
2013	1	1	6	LGA	ATL	N668DN	DL	39.9	25	54.8	260	16.1	23	1.01e+03	10	2013-01-01 06:00:00
2013	1	1	5	EWR	ORD	N39463	UA	39	28	64.4	260	12.7		1.01e+03	10	2013-01-01 05:00:00

Filter joins

Instead match variables like in regular joints this kind of joints match observations.

semi_join(x, y) keeps all observations in x that have a match in y.
anti_join(x, y) drops all observations in x that have a match in y.

top_dest <- flights %>% 
        count(dest, sort = TRUE) %>% head(10)
#select top 10 destines

flights %>%
        semi_join(top_dest) %>%
        head()
#take from flights just the colums that belong to top_dest

(#tab:unnamed-chunk-43)
year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1.09e+03	5	40	2013-01-01 05:00:00
2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0	2013-01-01 06:00:00
2013	1	1	554	558	-4	740	728	12	UA	1696	N39463	EWR	ORD	150	719	5	58	2013-01-01 05:00:00
2013	1	1	555	600	-5	913	854	19	B6	507	N516JB	EWR	FLL	158	1.06e+03	6	0	2013-01-01 06:00:00
2013	1	1	557	600	-3	838	846	-8	B6	79	N593JB	JFK	MCO	140	944	6	0	2013-01-01 06:00:00
2013	1	1	558	600	-2	753	745	8	AA	301	N3ALAA	LGA	ORD	138	733	6	0	2013-01-01 06:00:00