Up to now we have been manipulating vectors by reordering and subsetting them through indexing. However, once we start more advanced analyses, the preferred unit for data storage is not the vector but the data frame. In this chapter we learn to work directly with data frames, which greatly facilitate the organization of information. We will be using data frames for the majority of this book. We will focus on a specific data format referred to as tidy and on specific collection of packages that are particularly helpful for working with tidy data referred to as the tidyverse. Show
We can load all the tidyverse packages at once by installing and loading the tidyverse package: We will learn how to implement the tidyverse approach throughout the book, but before delving into the details, in this chapter we introduce some of the most widely used tidyverse functionality, starting with the dplyr package for manipulating data frames and the purrr package for working with functions. Note that the tidyverse also includes a graphing package, ggplot2, which we introduce later in Chapter 8 in the Data Visualization part of the book; the readr package discussed in Chapter 5; and many others. In this chapter, we first introduce the concept of tidy data and then demonstrate how we use the tidyverse to work with data frames in this format. Tidy dataWe say that a data table is in
tidy format if each row represents one observation and columns represent the different variables available for each of these observations. The
Each row represent a state with each of the five columns providing a different variable related to these states: name, abbreviation, region, population, and total murders. To see how the same information can be provided in different formats, consider the following example:
This tidy dataset provides fertility rates for two countries across the years. This is a tidy dataset because each row presents one observation with the three variables being country, year, and fertility rate. However, this dataset originally came in another format and was reshaped for the dslabs package. Originally, the data was in the following format:
The same information is provided, but there are two important differences in the format: 1) each row
includes several observations and 2) one of the variables, year, is stored in the header. For the tidyverse packages to be optimally used, data need to be reshaped into Although not immediately obvious, as you go through the book you will start to appreciate the advantages of working in a framework in which functions use tidy formats for both inputs and outputs. You will see how this permits the data analyst to focus on more important aspects of the analysis rather than the format of the data. Exercises1. Examine the built-in dataset
2. Examine the built-in dataset
3. Examine the built-in dataset
4. Which of the following built-in datasets is tidy (you can pick more than one):
Manipulating data framesThe dplyr package from the tidyverse introduces functions that perform some of the most common operations when working with data frames and uses names for these functions that are relatively easy to remember. For instance, to change the data table by adding a new column, we use Adding a column with mutateWe want all the necessary information for our analysis to be included in the data table. So the first task is to add the murder rates
to our murders data frame. The function
Notice that here we used This is one of dplyr’s main features. Functions in
this package, such as We can see that the new column is added:
Although we have overwritten the original Subsetting with filterNow suppose that we want to filter the data table to only show the entries for which the murder rate is lower than 0.71. To do this we use the
Selecting columns with selectAlthough our data table only has six
columns, some data tables include hundreds. If we want to view just a few, we can use the dplyr
In the call to Exercises1. Load the dplyr package and the murders dataset.
You can add columns using the dplyr function
We can write Use the function 2. If 3. With dplyr, we can use
Use 4. The dplyr function
You can use other logical vectors to filter rows. Use 5. We can remove rows using the
Create a new data frame called 6. We can also use
Create a new data
frame called 7. Suppose you want to live in the Northeast or West and want the murder rate to be less than 1. We want to see the data for the states satisfying these options. Note that you can use logical operators with
Make sure The pipe: |> or %>%In R we can perform a series of operations, for example We wrote code above to show three variables (state, region, rate) for states that have murder rates below 0.71. To do this, we defined the intermediate object \[ \mbox{original data } \rightarrow \mbox{ select } \rightarrow \mbox{ filter } \] For such an operation, we can use the pipe
This line of code is equivalent to the two lines of code above. What is going on here? In general, the pipe sends the result of the left side of the pipe to be the first argument of the function on the right side of the pipe. Here is a very simple example: We can continue to pipe values along:
The above statement is equivalent to Remember that the pipe sends values to the first argument, so we can define other arguments as if the first argument is already defined:
Therefore, when using the pipe with data frames and dplyr, we no longer need to specify the required first argument since the dplyr functions we have described all take the data as the first argument. In the code we wrote:
Note that the pipe works well with functions where the first argument is the input data. Functions in tidyverse packages like dplyr have this format and can be used easily with the pipe. Exercises1. The pipe
In the solution to the previous exercise, we did the following:
The pipe
Notice that Repeat the previous exercise, but now instead of creating a new object, show the result and only include the state, rate, and rank
columns. Use a pipe 2. Reset
Summarizing dataAn important part of exploratory data analysis is summarizing data. The average and standard deviation are two examples of widely used summary statistics. More informative summaries can often be achieved by first splitting data into groups. In this section, we cover two new dplyr verbs that make these
computations easier: |