You are reading the article R Select(), Filter(), Arrange(), Pipeline With Example updated in October 2023 on the website Nhunghuounewzealand.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested November 2023 R Select(), Filter(), Arrange(), Pipeline With Example
In this tutorial, you will learn
DayOfWeek: Identify the day of the week the driver uses his car
Distance: The total distance of the journey
MaxSpeed: The maximum speed of the journey
TotalTime: The length in minutes of the journey
The dataset has around 200 observations in the dataset, and the rides occurred between Monday to Friday.
First of all, you need to:
load the dataset
check the structure of the data.
One handy feature with dplyr is the glimpse() function. This is an improvement over str(). We can use glimpse() to see the structure of the dataset and decide what manipulation is required.
library(dplyr) df <- read.csv(PATH) glimpse(df)
Output:
## Observations: 205 ## Variables: 14This is obvious that the variable Comments needs further diagnostic. The first observations of the Comments variable are only missing values.
sum(df$Comments =="")This is obvious that the variable Comments needs further diagnostic. The first observations of the Comments variable are only missing values.
Code Explanation
Output:
## [1] 181We will begin with the select() verb. We don’t necessarily need all the variables, and a good practice is to select only the variables you find relevant.
We have 181 missing observations, almost 90 percent of the dataset. If you decide to exclude them, you won’t be able to carry on the analysis.
The other possibility is to drop the variable Comment with the select() verb.
We can select variables in different ways with select(). Note that, the first argument is the dataset.
- `select(df, A, B ,C)`: Select the variables A, B and C from df dataset. - `select(df, A:C)`: Select all variables from A to C from df dataset. - `select(df, -C)`: Exclude C from the dataset from df dataset.You can use the third way to exclude the Comments variable.
step_1_df <- select(df, -Comments) dim(df)Output:
## [1] 205 14 dim(step_1_df)Output:
## [1] 205 13The original dataset has 14 features while the step_1_df has 13.
The filter() verb helps to keep the observations following a criteria. The filter() works exactly like select(), you pass the data frame first and then a condition separated by a comma:
filter(df, condition) arguments: - df: dataset used to filter the data - condition: Condition used to filter the data One criteriaFirst of all, you can count the number of observations within each level of a factor variable.
table(step_1_df$GoingTo)Code Explanation
table(): Count the number of observations by level. Note, only factor level variable are accepted
table(step_1_df$GoingTo): Count the number of of trips toward the final destination.
Output:
## ## GSK Home ## 105 100The function table() indicates 105 rides are going to GSK and 100 to Home.
We can filter the data to return one dataset with 105 observations and another one with 100 observations.
# Select observations if GoingTo == Home select_home <- filter(df, GoingTo == "Home") dim(select_home)Output:
## [1] 100 14 # Select observations if GoingTo == Work select_work <- filter(df, GoingTo == "GSK") dim(select_work)Output:
## [1] 105 14 Multiple criterionsWe can filter a dataset with more than one criteria. For instance, you can extract the observations where the destination is Home and occured on a Wednesday.
select_home_wed <- filter(df, GoingTo == "Home" & DayOfWeek == "Wednesday") dim(select_home_wed)Output:
## [1] 23 1423 observations matched this criterion.
PipelineThe creation of a dataset requires a lot of operations, such as:
importing
merging
selecting
filtering
and so on
This operator is a code which performs steps without saving intermediate steps to the hard drive. If you are back to our example from above, you can select the variables of interest and filter them. We have three steps:
Step 1: Import data: Import the gps data
Step 2: Select data: Select GoingTo and DayOfWeek
Step 3: Filter data: Return only Home and Wednesday
We can use the hard way to do it:
# Step 1 step_1 <- read.csv(PATH) # Step 2 step_2 <- select(step_1, GoingTo, DayOfWeek) # Step 3 step_3 <- filter(step_2, GoingTo == "Home", DayOfWeek == "Wednesday") head(step_3)Output:
## GoingTo DayOfWeek ## 1 Home Wednesday ## 2 Home Wednesday ## 3 Home Wednesday ## 4 Home Wednesday ## 5 Home Wednesday ## 6 Home WednesdayThat is not a convenient way to perform many operations, especially in a situation with lots of steps. The environment ends up with a lot of objects stored.
Basic syntax of pipeline
… arguments – New_df: Name of the new data frame – df: Data frame used to compute the step – step: Instruction for each step – Note: The last instruction does not need the pipe operator `%`, you don’t have instructions to pipe anymore Note: Create a new variable is optional. If not included, the output will be displayed in the console.
You can create your first pipe following the steps enumerated above.
# Create the data frame filter_home_wed.It will be the object return at the end of the pipeline filter_home_wed <- #Step 1 #Step 2 #Step 3 filter(GoingTo == "Home",DayOfWeek == "Wednesday") identical(step_3, filter_home_wed)Output:
## [1] TRUEWe are ready to create a stunning dataset with the pipeline operator.
In the previous tutorial, you learn how to sort the values with the function sort(). The library dplyr has its sorting function. It works like a charm with the pipeline. The arrange() verb can reorder one or many rows, either ascending (default) or descending.
- `arrange(A)`: Ascending sort of variable A - `arrange(A, B)`: Ascending sort of variable A and B - `arrange(desc(A), B)`: Descending sort of variable A and ascending sort of BWe can sort the distance by destination.
# Sort by destination and distance arrange(GoingTo, Distance) head<step_2_df)Output:
## X Date StartTime DayOfWeek GoingTo Distance MaxSpeed AvgSpeed ## 1 193 7/25/2011 08:06 Monday GSK 48.32 121.2 63.4 ## 2 196 7/21/2011 07:59 Thursday GSK 48.35 129.3 81.5 ## 3 198 7/20/2011 08:24 Wednesday GSK 48.50 125.8 75.7 ## 4 189 7/27/2011 08:15 Wednesday GSK 48.82 124.5 70.4 ## 5 95 10/11/2011 08:25 Tuesday GSK 48.94 130.8 85.7 ## 6 171 8/10/2011 08:13 Wednesday GSK 48.98 124.8 72.8 ## AvgMovingSpeed FuelEconomy TotalTime MovingTime Take407All ## 1 78.4 8.45 45.7 37.0 No ## 2 89.0 8.28 35.6 32.6 Yes ## 3 87.3 7.89 38.5 33.3 Yes ## 4 77.8 8.45 41.6 37.6 No ## 5 93.2 7.81 34.3 31.5 Yes ## 6 78.8 8.54 40.4 37.3 No SummaryIn the table below, you summarize all the operations you learnt during the tutorial.
Verb Objective Code Explanation
glimpse check the structure of a df
glimpse(df)Identical to str()
select() Select/exclude the variables
select(df, A, B ,C)Select the variables A, B and C
select(df, A:C)Select all variables from A to C
select(df, -C)Exclude C
filter() Filter the df based a one or many conditions
filter(df, condition1)One condition
filter(df, condition1ondition2)
arrange() Sort the dataset with one or many variables
arrange(A)Ascending sort of variable A
arrange(A, B)Ascending sort of variable A and B
arrange(desc(A), B)Descending sort of variable A and ascending sort of B
Create a pipeline between each step
You're reading R Select(), Filter(), Arrange(), Pipeline With Example
Update the detailed information about R Select(), Filter(), Arrange(), Pipeline With Example on the Nhunghuounewzealand.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!