R tidyverse: dplyr.

dplyr

dplyr is a package from tidyverse and its purpose is sorting and configuring your data as you please and prepare the data before analysis.

In this tutorial we will need first load the tidyverse package,(there is no need to load dplyr it is one of the tidyverse cores) and then we will load nycfights13 package just to get the dataset flights to work with this.

# Install packages if it is your firts time.
install.packages("tidyverse")
install.packages("nycfights13")

# load packages
library(tidyverse)
library(nycflights13)

Filter rows with filter()

filter() function allows you to subset a data if some condition is fulfilled.

Example subsetting the day 01 from dataset flights, but first we will see how is our data:

dim(flights)
## 336776 19

dim() function show us that our data has 336.776 rows and 19 columns.

Now we are going to use filter() function.

Day_01 <- filter(flights, month ==1 & day == 1 )
dim(Day_01)
## [1] 842  19

You can add an “OR” condition using “|”, in th next example we are going to filter all flights in January OR February

Jan_Feb <- filter(flights, month == 1 | month == 2)
#or use %in% to set a shortcut
Jan_Feb <- filter(flights,  month %in% c(1,2))

Also, you can add an AND condition with “&”, we are going to filter all flights made in January and by the plane with tailnumber “N14228”.

Jan_by_N14228 <- filter(flights, month == 1 & tailnum == "N14228")
dim(Jan_by_N14228)
## [1] 15 19

There were just 15 flights made in January by the flight “N14228”.

Exclude data

If you need to exclude part of your data when a condition is fulfilled you should use !.

without_Jan <- filter(flights, !(month == 1))

Arrange Rows with arrange()

To arrange data in your dataset you must indicate the column to arrange and dplyr will do it.

dataset_sorted <- arrange(flights, month, day, year)

Reordering by a column in descending order

(#tab:unnamed-chunk-11)
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
201311517515283081911UA1545N14228EWRIAH2271.4e+03 5152013-01-01 05:00:00
201311533529485083020UA1714N24211LGAIAH2271.42e+035292013-01-01 05:00:00
201311542540292385033AA1141N619AAJFKMIA1601.09e+035402013-01-01 05:00:00
201311544545-110041022-18B6725N804JBJFKBQN1831.58e+035452013-01-01 05:00:00
201311554600-6812837-25DL461N668DNLGAATL116762       602013-01-01 06:00:00
descending <- arrange(flights, desc(year))
head(descending, 5) #head() function displys just the firts 5 rows

Selecting columns with select()

This function is especially useful to subset dataframes selecting only the specified columns.

subset_columns <- select(flights, year, month,)
head(subset_columns,5)
(#tab:unnamed-chunk-14)
yearmonth
20131
20131
20131
20131
20131

Also you can select a range within the dataset and extract it.

subset_range <- select(flights, month:dep_delay)
head(subset_range)
(#tab:unnamed-chunk-16)
monthdaydep_timesched_dep_timedep_delay
115175152
115335294
115425402
11544545-1
11554600-6
11554558-4

Or exclude some colums with -

subset_excluded <- select(flights, -(month:distance))
head(subset_excluded)
(#tab:unnamed-chunk-18)
yearhourminutetime_hour
20135152013-01-01 05:00:00
20135292013-01-01 05:00:00
20135402013-01-01 05:00:00
20135452013-01-01 05:00:00
2013602013-01-01 06:00:00
20135582013-01-01 05:00:00

Improve the selection

There are some other functions that can be used with select() to improve the results, like start_with(), ends_with() and contains(), for example:

subset_by_start <- select(flights, starts_with("dep"))
head(subset_by_start)
(#tab:unnamed-chunk-20)
dep_timedep_delay
5172
5334
5422
544-1
554-6
554-4

If you want to change the column order in a dataframe, you can do it adding the function everything() clarifying which columns you want first.

subset_firts_day.month <- select(flights, day, month, everything())
head(subset_firts_day.month)
(#tab:unnamed-chunk-22)
daymonthyeardep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
112013517515283081911UA1545N14228EWRIAH2271.4e+03 5152013-01-01 05:00:00
112013533529485083020UA1714N24211LGAIAH2271.42e+035292013-01-01 05:00:00
112013542540292385033AA1141N619AAJFKMIA1601.09e+035402013-01-01 05:00:00
112013544545-110041022-18B6725N804JBJFKBQN1831.58e+035452013-01-01 05:00:00
112013554600-6812837-25DL461N668DNLGAATL116762       602013-01-01 06:00:00
112013554558-474072812UA1696N39463EWRORD150719       5582013-01-01 05:00:00

Add new columns with mutate()

There are moments when you need add a new variable in a dataframe, and this variable is the result of other variables in the original dataframe, namely, you can create a new column from others without delete the originals.

mutated_dataset <- mutate(flights, gain = arr_delay - dep_delay)
mutated_variable <- select(mutated_dataset, gain, everything())
head(mutated_dataset) #see by yourself

Replace values with mutate()

head(iris)
my_iris <- iris %>% mutate(Species = str_replace(Species, "setosa", "SETOSA"))
head(my_iris)
(#tab:unnamed-chunk-25)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.13.51.40.2setosa
4.93  1.40.2setosa
4.73.21.30.2setosa
4.63.11.50.2setosa
5  3.61.40.2setosa
5.43.91.70.4setosa
(#tab:unnamed-chunk-25)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.13.51.40.2SETOSA
4.93  1.40.2SETOSA
4.73.21.30.2SETOSA
4.63.11.50.2SETOSA
5  3.61.40.2SETOSA
5.43.91.70.4SETOSA

Add new colums with transmutate

When you want create new columns but delete the originals, then the best options is use trasnmutate() function.

subset_transmutate <- transmute(flights, speed = (distance / air_time) / 60, year, month)
head(subset_transmutate)
(#tab:unnamed-chunk-27)
speedyearmonth
0.103 20131
0.104 20131
0.113 20131
0.144 20131
0.109 20131
0.079920131

Note that all the original variables have been eliminated and only remain which were enunciated explicitly.

Grouping summaries with sumarize()

This function can collapses a whole data frame in a row and it is very useful to extract summaries, for example you can calculate what is the busiest day throughout all the flies.

busiest_day <- summarize(flights, busiest_day = median(day, na.rm = T))
busiest_day
(#tab:unnamed-chunk-28)
busiest_day
16

Grouping with group_by()

If you want extract a summary for some categories in a variable then the function group_by() is what you need.

summary_by_month <- group_by(flights, month)
summarize(summary_by_month, mean_air_time = mean(air_time, na.rm = T)) %>% transmute(month.name, mean_air_time)
(#tab:unnamed-chunk-30)
month.namemean_air_time
January154
February151
March149
April153
May146
June150
July147
August148
September143
October149
November155
December163

Combining multiple operation with pipe

There are different ways to write code in but you should avoid redundancy as much as you can to make your code more readable, one step on this route is use pipes (%>%), they can bind multiple functions and then process your object.

This is some code without %>%

by_Specie <- group_by(iris, Species)
counts <- summarize(by_Specie,
count = n())
counts_arranged <- arrange(counts, desc(Species))

The same expression can be done using %>%

counts_arranged <-  iris %>% 
  group_by(., Species) %>%
  summarize(.,count = n()) %>% 
  arrange(.,desc(Species))

#the dots are optional, they just represent the last object processed

Dealing with missin values

Any math operation against with a NA will generate another NA, they are infectious and tends to propagate themselves, so is useful delete them with na.rm argument when we are preparing the data.

mean_dep_delay_by_day <- flights %>% group_by(month, year, day) %>% summarize(mean = mean(dep_delay))
head(mean_dep_delay_by_day)
(#tab:unnamed-chunk-34)
monthyeardaymean
120131
120132
120133
120134
120135
120136

Solution:

mean_dep_delay_by_day <- flights %>% group_by(month, year, day) %>% summarize(mean = mean(dep_delay, na.rm = T))

Counts

As part of the example, first we will filter some data, and then create a frequency polygon removing NAs

not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
delays <- not_cancelled %>%
        group_by(tailnum) %>%
        summarize(delay = mean(arr_delay, na.rm = T), n= n())

ggplot(delays, aes(n, delay))+ geom_point(alpha = 1/10)
# Graphic made including counts with n()

n() function is very useful but it can not take arguments, just count a variable associated with another function, another option is use count() or table().

flights %>% count(month) 

# counting with table, the information on the output is the same.

table(flights$month)
(#tab:unnamed-chunk-39)
monthn
127004
224951
328834
428330
528796
628243
729425
829327
927574
1028889
1127268
1228135
(#tab:unnamed-chunk-39)
Var1Freq
127004
224951
328834
428330
528796
628243
729425
829327
927574
1028889
1127268
1228135

Relational data with dplyr

Mutating joints

This tools allows you o combine two tables that have observations in common,in the next tibble you can see the variable carrier, on the other hand the tibble airlines have the same variable, then we can use the left_joint() to unite them

flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
head(flights2)
(#tab:unnamed-chunk-41)
yearmonthdayhourorigindesttailnumcarrier
2013115EWRIAHN14228UA
2013115LGAIAHN24211UA
2013115JFKMIAN619AAAA
2013115JFKBQNN804JBB6
2013116LGAATLN668DNDL
2013115EWRORDN39463UA
airlines
flights2 %>%
select(-origin, -dest, - day) %>%
left_join(airlines, by = "carrier") %>% select(carrier, name , everything()) %>% head()
airlines
flights2 %>%
select(-origin, -dest, - day) %>%
left_join(airlines, by = "carrier") %>% select(carrier, name , everything()) %>% head()
(#tab:unnamed-chunk-44)
carriername
9EEndeavor Air Inc.
AAAmerican Airlines Inc.
ASAlaska Airlines Inc.
B6JetBlue Airways
DLDelta Air Lines Inc.
EVExpressJet Airlines Inc.
F9Frontier Airlines Inc.
FLAirTran Airways Corporation
HAHawaiian Airlines Inc.
MQEnvoy Air
OOSkyWest Airlines Inc.
UAUnited Air Lines Inc.
USUS Airways Inc.
VXVirgin America
WNSouthwest Airlines Co.
YVMesa Airlines Inc.
(#tab:unnamed-chunk-44)
carriernameyearmonthhourtailnum
UAUnited Air Lines Inc.201315N14228
UAUnited Air Lines Inc.201315N24211
AAAmerican Airlines Inc.201315N619AA
B6JetBlue Airways201315N804JB
DLDelta Air Lines Inc.201316N668DN
UAUnited Air Lines Inc.201315N39463

Now, we have added the information from the tibble airlines in a new tibble only.

There are some type of joints like inner joints (exclude the mismatches using the inner_joint() function) and outer joints who conserved the mismatched fulling the gaps with NAs.

  • A left join keeps all observations in x. left_joint_()
  • A right join keeps all observations in y. right_joint_()
  • A full join keeps all observations in x and y. full_joint_()

Defining keys

If you do not use the argument by then by default dplyr will search for columns in common.

flights2 %>%
left_join(weather) %>% head()
(#tab:unnamed-chunk-46)
yearmonthdayhourorigindesttailnumcarriertempdewphumidwind_dirwind_speedwind_gustprecippressurevisibtime_hour
2013115EWRIAHN14228UA39  2864.426012.7  01.01e+03102013-01-01 05:00:00
2013115LGAIAHN24211UA39.92554.825015  21.901.01e+03102013-01-01 05:00:00
2013115JFKMIAN619AAAA39  2761.626015    01.01e+03102013-01-01 05:00:00
2013115JFKBQNN804JBB639  2761.626015    01.01e+03102013-01-01 05:00:00
2013116LGAATLN668DNDL39.92554.826016.123  01.01e+03102013-01-01 06:00:00
2013115EWRORDN39463UA39  2864.426012.7  01.01e+03102013-01-01 05:00:00

Filter joins

Instead match variables like in regular joints this kind of joints match observations

  • semi_join(x, y) keeps all observations in x that have a match in y.

  • anti_join(x, y) drops all observations in x that have a match in y.

top_dest <- flights %>% 
        count(dest, sort = TRUE) %>% head(10)
#select top 10 destines

flights %>%
        semi_join(top_dest) %>%
        head()
#take from flights just the colums that belong to top_dest 
(#tab:unnamed-chunk-48)
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
201311542540292385033AA1141N619AAJFKMIA1601.09e+035402013-01-01 05:00:00
201311554600-6812837-25DL461N668DNLGAATL116762       602013-01-01 06:00:00
201311554558-474072812UA1696N39463EWRORD150719       5582013-01-01 05:00:00
201311555600-591385419B6507N516JBEWRFLL1581.06e+03602013-01-01 06:00:00
201311557600-3838846-8B679N593JBJFKMCO140944       602013-01-01 06:00:00
201311558600-27537458AA301N3ALAALGAORD138733       602013-01-01 06:00:00

If you want more information visit R for data science, probably, It’s the best book to learn .

Diego Sierra Ramírez
Diego Sierra Ramírez
Msc. in Biological Science / Data analyst

Related