R tidyverse: tidyr.
Usually when you work with data it is common that it is not ready to analize it, the information could be available but the functions in require a precise format.
library(tidyverse)
library(nycflights13)
Tidydata
There are different ways to organize a dataframe, in community the most extended is the “tidydata” concept, with this format each variable has a column and each row represent an observation.
Next table is a example of tidydata
country | year | cases | population | rate |
---|---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 | 3.73e-05 |
Afghanistan | 2000 | 2666 | 20595360 | 0.000129 |
Brazil | 1999 | 37737 | 172006362 | 0.000219 |
Brazil | 2000 | 80488 | 174504898 | 0.000461 |
China | 1999 | 212258 | 1272915272 | 0.000167 |
China | 2000 | 213766 | 1280428583 | 0.000167 |
When you organize data “tidy” most of fuctions works smoothly, but you should be aware and know that “tidydata” is not the only correct way to organize data, for example there are some field with well establish formats even when there are not practical to work directly.
If you have you data organized as “tidydata” you can easily create fancy plots with one o two code lines, like the following.
Spread and gathering
the tibble “tabla4a” has not sorted their values as tidydata, specifically the values of “year” are put in two different columns.
table4a
country | 1999 | 2000 |
---|---|---|
Afghanistan | 745 | 2666 |
Brazil | 37737 | 80488 |
China | 212258 | 213766 |
To solve this problem we will gather the columns 1999 and 2000 in only a column named year,and isolate the old values in other column named value.
table4a %>% gather("1999", "2000", key = "year", value = "cases")
# we use "" in this columns because they are nonsyctactic names
# rememeber that you should avoid use numbers and symbol as colum names
country | year | cases |
---|---|---|
Afghanistan | 1999 | 745 |
Brazil | 1999 | 37737 |
China | 1999 | 212258 |
Afghanistan | 2000 | 2666 |
Brazil | 2000 | 80488 |
China | 2000 | 213766 |
Another example with table4b
table4b %>% gather("1999", "2000", key = "year", value = "population")
country | year | population |
---|---|---|
Afghanistan | 1999 | 19987071 |
Brazil | 1999 | 172006362 |
China | 1999 | 1272915272 |
Afghanistan | 2000 | 20595360 |
Brazil | 2000 | 174504898 |
China | 2000 | 1280428583 |
Now we can combine (if we want it) two dataframes with left_join() from dplyr
package, It is already loaded because it is part of tidyverse
table4a <- table4a %>% gather("1999", "2000", key = "year", value = "cases")
table4b <- table4b %>% gather("1999", "2000", key = "year", value = "population")
joined_table4 <- left_join(table4a,table4b)
joined_table4
country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 |
Brazil | 1999 | 37737 | 172006362 |
China | 1999 | 212258 | 1272915272 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 2000 | 80488 | 174504898 |
China | 2000 | 213766 | 1280428583 |
There are other fuctions to gather dataframes like right_join(), full_join(), semi_joint(), inner_joint() and more.
Spreading
Other times the observations are distributed wrong, for example the column type in the next table:
table2
country | year | type | count |
---|---|---|---|
Afghanistan | 1999 | cases | 745 |
Afghanistan | 1999 | population | 19987071 |
Afghanistan | 2000 | cases | 2666 |
Afghanistan | 2000 | population | 20595360 |
Brazil | 1999 | cases | 37737 |
Brazil | 1999 | population | 172006362 |
Brazil | 2000 | cases | 80488 |
Brazil | 2000 | population | 174504898 |
China | 1999 | cases | 212258 |
China | 1999 | population | 1272915272 |
China | 2000 | cases | 213766 |
China | 2000 | population | 1280428583 |
To solve this problem we will use spred()
table2 %>% spread(key = type, value = count)
country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
Separating and pulling
Next you will see a column with a awful problem, we will use sepate() to solve it, this function take any non-alphanumeric character to break a column, to change the separator use the parameter sep; another important thing is that separate() can convert the new colums as character, to convert use the argument convert = TRUE
table3
#column rate have 2 observations sepatated by a slash
country | year | rate |
---|---|---|
Afghanistan | 1999 | 745/19987071 |
Afghanistan | 2000 | 2666/20595360 |
Brazil | 1999 | 37737/172006362 |
Brazil | 2000 | 80488/174504898 |
China | 1999 | 212258/1272915272 |
China | 2000 | 213766/1280428583 |
table3 %>% separate(rate, into = c("cases", "population"), sep = "/", convert = T)
table3
country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
Also, You can break a column by the position of each character if there are not a common characters to use sep argument, for example:
table3 %>% separate(year, into = c("century", "year"), sep = 2)
country | century | year | rate |
---|---|---|---|
Afghanistan | 19 | 99 | 745/19987071 |
Afghanistan | 20 | 00 | 2666/20595360 |
Brazil | 19 | 99 | 37737/172006362 |
Brazil | 20 | 00 | 80488/174504898 |
China | 19 | 99 | 212258/1272915272 |
China | 20 | 00 | 213766/1280428583 |
unite with unite()
The less frequently problem is that a observation are divided in two separates columns, for example in the table5 each observation of years are separate in century and year.
table5
country | century | year | rate |
---|---|---|---|
Afghanistan | 19 | 99 | 745/19987071 |
Afghanistan | 20 | 00 | 2666/20595360 |
Brazil | 19 | 99 | 37737/172006362 |
Brazil | 20 | 00 | 80488/174504898 |
China | 19 | 99 | 212258/1272915272 |
China | 20 | 00 | 213766/1280428583 |
table5 %>% unite(new_date,century,year ,sep = "")
#we use sep = argument to put (or not) a tag in the joint
country | new_date | rate |
---|---|---|
Afghanistan | 1999 | 745/19987071 |
Afghanistan | 2000 | 2666/20595360 |
Brazil | 1999 | 37737/172006362 |
Brazil | 2000 | 80488/174504898 |
China | 1999 | 212258/1272915272 |
China | 2000 | 213766/1280428583 |
Tidying missing data
When you change the disposition of a dataframe or tibble sometimes appear missing values (NA) and them can be explicit or implicit, let’s see some exercises.
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA , 0.92, 0.17, 2.66))
# year 2015 - qtr 4 - return is a explicit NA
# year 2016 -qrt 1 is a impocit NA
stocks
year | qtr | return |
---|---|---|
2.02e+03 | 1 | 1.88 |
2.02e+03 | 2 | 0.59 |
2.02e+03 | 3 | 0.35 |
2.02e+03 | 4 | |
2.02e+03 | 2 | 0.92 |
2.02e+03 | 3 | 0.17 |
2.02e+03 | 4 | 2.66 |
Implicit missing values can be converted to explicit with spread() when we change the column values.
stocks %>% spread(year, return)
stocks %>% spread(year, return) %>% as_huxtable() %>% theme_blue()
qtr | 2015 | 2016 |
---|---|---|
1 | 1.88 | |
2 | 0.59 | 0.92 |
3 | 0.35 | 0.17 |
4 | 2.66 |
qtr | year | return |
---|---|---|
1 | 2015 | 1.88 |
2 | 2015 | 0.59 |
3 | 2015 | 0.35 |
4 | 2015 | |
1 | 2016 | |
2 | 2016 | 0.92 |
3 | 2016 | 0.17 |
4 | 2016 | 2.66 |
qtr | year | return |
---|---|---|
1 | 2015 | 1.88 |
2 | 2015 | 0.59 |
3 | 2015 | 0.35 |
2 | 2016 | 0.92 |
3 | 2016 | 0.17 |
4 | 2016 | 2.66 |
stocks %>% spread(year, return) %>% gather(year,return, "2015":"2016")
stocks %>% spread(year, return) %>% gather(year,return, "2015":"2016", na.rm = T) #leave it as it was.
Another way to make explicit missing values is using complet(), this function generate NAs automatically.
stocks %>% complete(year, qtr)
year | qtr | return |
---|---|---|
2.02e+03 | 1 | 1.88 |
2.02e+03 | 2 | 0.59 |
2.02e+03 | 3 | 0.35 |
2.02e+03 | 4 | |
2.02e+03 | 1 | |
2.02e+03 | 2 | 0.92 |
2.02e+03 | 3 | 0.17 |
2.02e+03 | 4 | 2.66 |
Relational data with dplyr
Mutating joints
This tools allows you o combine two tables that have observations in common,in the next tibble you can see the variable carrier, on the other hand the tibble airlines have the same variable, then we can use the left_joint() to unite them.
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
head(flights2)
year | month | day | hour | origin | dest | tailnum | carrier |
---|---|---|---|---|---|---|---|
2013 | 1 | 1 | 5 | EWR | IAH | N14228 | UA |
2013 | 1 | 1 | 5 | LGA | IAH | N24211 | UA |
2013 | 1 | 1 | 5 | JFK | MIA | N619AA | AA |
2013 | 1 | 1 | 5 | JFK | BQN | N804JB | B6 |
2013 | 1 | 1 | 6 | LGA | ATL | N668DN | DL |
2013 | 1 | 1 | 5 | EWR | ORD | N39463 | UA |
airlines
flights2 %>%
select(-origin, -dest, - day) %>%
left_join(airlines, by = "carrier") %>% select(carrier, name , everything()) %>% head()
airlines
flights2 %>%
select(-origin, -dest, - day) %>%
left_join(airlines, by = "carrier") %>% select(carrier, name , everything()) %>% head()
carrier | name |
---|---|
9E | Endeavor Air Inc. |
AA | American Airlines Inc. |
AS | Alaska Airlines Inc. |
B6 | JetBlue Airways |
DL | Delta Air Lines Inc. |
EV | ExpressJet Airlines Inc. |
F9 | Frontier Airlines Inc. |
FL | AirTran Airways Corporation |
HA | Hawaiian Airlines Inc. |
MQ | Envoy Air |
OO | SkyWest Airlines Inc. |
UA | United Air Lines Inc. |
US | US Airways Inc. |
VX | Virgin America |
WN | Southwest Airlines Co. |
YV | Mesa Airlines Inc. |
carrier | name | year | month | hour | tailnum |
---|---|---|---|---|---|
UA | United Air Lines Inc. | 2013 | 1 | 5 | N14228 |
UA | United Air Lines Inc. | 2013 | 1 | 5 | N24211 |
AA | American Airlines Inc. | 2013 | 1 | 5 | N619AA |
B6 | JetBlue Airways | 2013 | 1 | 5 | N804JB |
DL | Delta Air Lines Inc. | 2013 | 1 | 6 | N668DN |
UA | United Air Lines Inc. | 2013 | 1 | 5 | N39463 |
Now, we have added the information from the tibble airlines in a new tibble only.
There are type of joints like inner joints (exclude the mismatches using the inner_joint() function) and outer joints who conserved the mismatched fulling the gaps with NAs.
- A left join keeps all observations in x. left_joint_()
- A right join keeps all observations in y. right_joint_()
- A full join keeps all observations in x and y. full_joint_()
Defining keys
If you do not use the argument by then by default dplyr will search for columns in common.
flights2 %>%
left_join(weather) %>% head()
year | month | day | hour | origin | dest | tailnum | carrier | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2013 | 1 | 1 | 5 | EWR | IAH | N14228 | UA | 39 | 28 | 64.4 | 260 | 12.7 | 0 | 1.01e+03 | 10 | 2013-01-01 05:00:00 | |
2013 | 1 | 1 | 5 | LGA | IAH | N24211 | UA | 39.9 | 25 | 54.8 | 250 | 15 | 21.9 | 0 | 1.01e+03 | 10 | 2013-01-01 05:00:00 |
2013 | 1 | 1 | 5 | JFK | MIA | N619AA | AA | 39 | 27 | 61.6 | 260 | 15 | 0 | 1.01e+03 | 10 | 2013-01-01 05:00:00 | |
2013 | 1 | 1 | 5 | JFK | BQN | N804JB | B6 | 39 | 27 | 61.6 | 260 | 15 | 0 | 1.01e+03 | 10 | 2013-01-01 05:00:00 | |
2013 | 1 | 1 | 6 | LGA | ATL | N668DN | DL | 39.9 | 25 | 54.8 | 260 | 16.1 | 23 | 0 | 1.01e+03 | 10 | 2013-01-01 06:00:00 |
2013 | 1 | 1 | 5 | EWR | ORD | N39463 | UA | 39 | 28 | 64.4 | 260 | 12.7 | 0 | 1.01e+03 | 10 | 2013-01-01 05:00:00 |
Filter joins
Instead match variables like in regular joints this kind of joints match observations.
- semi_join(x, y) keeps all observations in x that have a match in y.
- anti_join(x, y) drops all observations in x that have a match in y.
top_dest <- flights %>%
count(dest, sort = TRUE) %>% head(10)
#select top 10 destines
flights %>%
semi_join(top_dest) %>%
head()
#take from flights just the colums that belong to top_dest
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2013 | 1 | 1 | 542 | 540 | 2 | 923 | 850 | 33 | AA | 1141 | N619AA | JFK | MIA | 160 | 1.09e+03 | 5 | 40 | 2013-01-01 05:00:00 |
2013 | 1 | 1 | 554 | 600 | -6 | 812 | 837 | -25 | DL | 461 | N668DN | LGA | ATL | 116 | 762 | 6 | 0 | 2013-01-01 06:00:00 |
2013 | 1 | 1 | 554 | 558 | -4 | 740 | 728 | 12 | UA | 1696 | N39463 | EWR | ORD | 150 | 719 | 5 | 58 | 2013-01-01 05:00:00 |
2013 | 1 | 1 | 555 | 600 | -5 | 913 | 854 | 19 | B6 | 507 | N516JB | EWR | FLL | 158 | 1.06e+03 | 6 | 0 | 2013-01-01 06:00:00 |
2013 | 1 | 1 | 557 | 600 | -3 | 838 | 846 | -8 | B6 | 79 | N593JB | JFK | MCO | 140 | 944 | 6 | 0 | 2013-01-01 06:00:00 |
2013 | 1 | 1 | 558 | 600 | -2 | 753 | 745 | 8 | AA | 301 | N3ALAA | LGA | ORD | 138 | 733 | 6 | 0 | 2013-01-01 06:00:00 |