Is the process of making inferences about a population based on information obtained from a sample?

We will look at an example using data from Inside Airbnb (). Airbnb is an online marketplace for arranging vacation rentals and places to stay. The data set contains listings for Vancouver, Canada, in September 2020. Our data includes an ID number, neighborhood, type of room, the number of people the rental accommodates, number of bathrooms, bedrooms, beds, and the price per night.

Inhaltsverzeichnis Show

Is the process of making inference about the population based on information obtained from a sample?
What is the process of statistical inference?
Which helps to make inferences about a population?
What is sample inference?

library(tidyverse)

set.seed(123)

airbnb <- read_csv("data/listings.csv")
airbnb

## # A tibble: 4,594 × 8
##       id neighbourhood   room_type   accommodates bathrooms bedrooms  beds price
##                                         
##  1     1 Downtown        Entire hom…            5 2 baths          2     2   150
##  2     2 Downtown Easts… Entire hom…            4 2 baths          2     2   132
##  3     3 West End        Entire hom…            2 1 bath           1     1    85
##  4     4 Kensington-Ced… Entire hom…            2 1 bath           1     0   146
##  5     5 Kensington-Ced… Entire hom…            4 1 bath           1     2   110
##  6     6 Hastings-Sunri… Entire hom…            4 1 bath           2     3   195
##  7     7 Renfrew-Collin… Entire hom…            8 3 baths          4     5   130
##  8     8 Mount Pleasant  Entire hom…            2 1 bath           1     1    94
##  9     9 Grandview-Wood… Private ro…            2 1 privat…        1     1    79
## 10    10 West End        Private ro…            2 1 privat…        1     1    75
## # … with 4,584 more rows

Suppose the city of Vancouver wants information about Airbnb rentals to help plan city bylaws, and they want to know how many Airbnb places are listed as entire homes and apartments (rather than as private or shared rooms). Therefore they may want to estimate the true proportion of all Airbnb listings where the “type of place” is listed as “entire home or apartment.” Of course, we usually do not have access to the true population, but here let’s imagine (for learning purposes) that our data set represents the population of all Airbnb rental listings in Vancouver, Canada. We can find the proportion of listings where

## # A tibble: 4,594 × 8
##       id neighbourhood   room_type   accommodates bathrooms bedrooms  beds price
##                                         
##  1     1 Downtown        Entire hom…            5 2 baths          2     2   150
##  2     2 Downtown Easts… Entire hom…            4 2 baths          2     2   132
##  3     3 West End        Entire hom…            2 1 bath           1     1    85
##  4     4 Kensington-Ced… Entire hom…            2 1 bath           1     0   146
##  5     5 Kensington-Ced… Entire hom…            4 1 bath           1     2   110
##  6     6 Hastings-Sunri… Entire hom…            4 1 bath           2     3   195
##  7     7 Renfrew-Collin… Entire hom…            8 3 baths          4     5   130
##  8     8 Mount Pleasant  Entire hom…            2 1 bath           1     1    94
##  9     9 Grandview-Wood… Private ro…            2 1 privat…        1     1    79
## 10    10 West End        Private ro…            2 1 privat…        1     1    75
## # … with 4,584 more rows

airbnb |>
  summarize(
    n =  sum(room_type == "Entire home/apt"),
    proportion = sum(room_type == "Entire home/apt") / nrow(airbnb)
  )

## # A tibble: 1 × 2
##       n proportion
##         
## 1  3434      0.747

We can see that the proportion of

## # A tibble: 4,594 × 8
##       id neighbourhood   room_type   accommodates bathrooms bedrooms  beds price
##                                         
##  1     1 Downtown        Entire hom…            5 2 baths          2     2   150
##  2     2 Downtown Easts… Entire hom…            4 2 baths          2     2   132
##  3     3 West End        Entire hom…            2 1 bath           1     1    85
##  4     4 Kensington-Ced… Entire hom…            2 1 bath           1     0   146
##  5     5 Kensington-Ced… Entire hom…            4 1 bath           1     2   110
##  6     6 Hastings-Sunri… Entire hom…            4 1 bath           2     3   195
##  7     7 Renfrew-Collin… Entire hom…            8 3 baths          4     5   130
##  8     8 Mount Pleasant  Entire hom…            2 1 bath           1     1    94
##  9     9 Grandview-Wood… Private ro…            2 1 privat…        1     1    79
## 10    10 West End        Private ro…            2 1 privat…        1     1    75
## # … with 4,584 more rows

8 listings in the data set is 0.747. This value, 0.747, is the population parameter. Remember, this parameter value is usually unknown in real data analysis problems, as it is typically not possible to make measurements for an entire population.

Instead, perhaps we can approximate it with a small subset of data! To investigate this idea, let’s try randomly selecting 40 listings (i.e., taking a random sample of size 40 from our population), and computing the proportion for that sample. We will use the

## # A tibble: 4,594 × 8
##       id neighbourhood   room_type   accommodates bathrooms bedrooms  beds price
##                                         
##  1     1 Downtown        Entire hom…            5 2 baths          2     2   150
##  2     2 Downtown Easts… Entire hom…            4 2 baths          2     2   132
##  3     3 West End        Entire hom…            2 1 bath           1     1    85
##  4     4 Kensington-Ced… Entire hom…            2 1 bath           1     0   146
##  5     5 Kensington-Ced… Entire hom…            4 1 bath           1     2   110
##  6     6 Hastings-Sunri… Entire hom…            4 1 bath           2     3   195
##  7     7 Renfrew-Collin… Entire hom…            8 3 baths          4     5   130
##  8     8 Mount Pleasant  Entire hom…            2 1 bath           1     1    94
##  9     9 Grandview-Wood… Private ro…            2 1 privat…        1     1    79
## 10    10 West End        Private ro…            2 1 privat…        1     1    75
## # … with 4,584 more rows

9 function from the

airbnb |>
  summarize(
    n =  sum(room_type == "Entire home/apt"),
    proportion = sum(room_type == "Entire home/apt") / nrow(airbnb)
  )

0 package to take the sample. The arguments of

## # A tibble: 4,594 × 8
##       id neighbourhood   room_type   accommodates bathrooms bedrooms  beds price
##                                         
##  1     1 Downtown        Entire hom…            5 2 baths          2     2   150
##  2     2 Downtown Easts… Entire hom…            4 2 baths          2     2   132
##  3     3 West End        Entire hom…            2 1 bath           1     1    85
##  4     4 Kensington-Ced… Entire hom…            2 1 bath           1     0   146
##  5     5 Kensington-Ced… Entire hom…            4 1 bath           1     2   110
##  6     6 Hastings-Sunri… Entire hom…            4 1 bath           2     3   195
##  7     7 Renfrew-Collin… Entire hom…            8 3 baths          4     5   130
##  8     8 Mount Pleasant  Entire hom…            2 1 bath           1     1    94
##  9     9 Grandview-Wood… Private ro…            2 1 privat…        1     1    79
## 10    10 West End        Private ro…            2 1 privat…        1     1    75
## # … with 4,584 more rows

9 are (1) the data frame to sample from, and (2) the size of the sample to take.

library(infer)

sample_1 <- rep_sample_n(tbl = airbnb, size = 40)

airbnb_sample_1 <- summarize(sample_1,
  n = sum(room_type == "Entire home/apt"),
  prop = sum(room_type == "Entire home/apt") / 40
)

airbnb_sample_1

## # A tibble: 1 × 3
##   replicate     n  prop
##         
## 1         1    28   0.7

Here we see that the proportion of entire home/apartment listings in this random sample is 0.7. Wow—that’s close to our true population value! But remember, we computed the proportion using a random sample of size 40. This has two consequences. First, this value is only an estimate, i.e., our best guess of our population parameter using this sample. Given that we are estimating a single value here, we often refer to it as a point estimate. Second, since the sample was random, if we were to take another random sample of size 40 and compute the proportion for that sample, we would not get the same answer:

sample_2 <- rep_sample_n(airbnb, size = 40)

airbnb_sample_2 <- summarize(sample_2,
  n = sum(room_type == "Entire home/apt"),
  prop = sum(room_type == "Entire home/apt") / 40
)

airbnb_sample_2

## # A tibble: 1 × 3
##   replicate     n  prop
##         
## 1         1    35 0.875

Confirmed! We get a different value for our estimate this time. That means that our point estimate might be unreliable. Indeed, estimates vary from sample to sample due to sampling variability. But just how much should we expect the estimates of our random samples to vary? Or in other words, how much can we really trust our point estimate based on a single sample?

To understand this, we will simulate many samples (much more than just two) of size 40 from our population of listings and calculate the proportion of entire home/apartment listings in each sample. This simulation will create many sample proportions, which we can visualize using a histogram. The distribution of the estimate for all possible samples of a given size (which we commonly refer to as \(n\)) from a population is called a sampling distribution. The sampling distribution will help us see how much we would expect our sample proportions from this population to vary for samples of size 40.

We again use the

## # A tibble: 4,594 × 8
##       id neighbourhood   room_type   accommodates bathrooms bedrooms  beds price
##                                         
##  1     1 Downtown        Entire hom…            5 2 baths          2     2   150
##  2     2 Downtown Easts… Entire hom…            4 2 baths          2     2   132
##  3     3 West End        Entire hom…            2 1 bath           1     1    85
##  4     4 Kensington-Ced… Entire hom…            2 1 bath           1     0   146
##  5     5 Kensington-Ced… Entire hom…            4 1 bath           1     2   110
##  6     6 Hastings-Sunri… Entire hom…            4 1 bath           2     3   195
##  7     7 Renfrew-Collin… Entire hom…            8 3 baths          4     5   130
##  8     8 Mount Pleasant  Entire hom…            2 1 bath           1     1    94
##  9     9 Grandview-Wood… Private ro…            2 1 privat…        1     1    79
## 10    10 West End        Private ro…            2 1 privat…        1     1    75
## # … with 4,584 more rows

9 to take samples of size 40 from our population of Airbnb listings. But this time we set the

airbnb |>
  summarize(
    n =  sum(room_type == "Entire home/apt"),
    proportion = sum(room_type == "Entire home/apt") / nrow(airbnb)
  )

3 argument to 20,000 to specify that we want to take 20,000 samples of size 40.

samples <- rep_sample_n(airbnb, size = 40, reps = 20000)
samples

## # A tibble: 800,000 × 9
## # Groups:   replicate [20,000]
##    replicate    id neighbourhood room_type accommodates bathrooms bedrooms  beds
##                                         
##  1         1  4403 Downtown      Entire h…            2 1 bath           1     1
##  2         1   902 Kensington-C… Private …            2 1 shared…        1     1
##  3         1  3808 Hastings-Sun… Entire h…            6 1.5 baths        1     3
##  4         1   561 Kensington-C… Entire h…            6 1 bath           2     2
##  5         1  3385 Mount Pleasa… Entire h…            4 1 bath           1     1
##  6         1  4232 Shaughnessy   Entire h…            6 1.5 baths        2     2
##  7         1  1169 Downtown      Entire h…            3 1 bath           1     1
##  8         1   959 Kitsilano     Private …            1 1.5 shar…        1     1
##  9         1  2171 Downtown      Entire h…            2 1 bath           1     1
## 10         1  1258 Dunbar South… Entire h…            4 1 bath           2     2
## # … with 799,990 more rows, and 1 more variable: price

Notice that the column

airbnb |>
  summarize(
    n =  sum(room_type == "Entire home/apt"),
    proportion = sum(room_type == "Entire home/apt") / nrow(airbnb)
  )

4 indicates the replicate, or sample, to which each listing belongs. Above, since by default R only prints the first few rows, it looks like all of the listings have

airbnb |>
  summarize(
    n =  sum(room_type == "Entire home/apt"),
    proportion = sum(room_type == "Entire home/apt") / nrow(airbnb)
  )

4 set to 1. But you can check the last few entries using the

airbnb |>
  summarize(
    n =  sum(room_type == "Entire home/apt"),
    proportion = sum(room_type == "Entire home/apt") / nrow(airbnb)
  )

6 function to verify that we indeed created 20,000 samples (or replicates).

## # A tibble: 4,594 × 8
##       id neighbourhood   room_type   accommodates bathrooms bedrooms  beds price
##                                         
##  1     1 Downtown        Entire hom…            5 2 baths          2     2   150
##  2     2 Downtown Easts… Entire hom…            4 2 baths          2     2   132
##  3     3 West End        Entire hom…            2 1 bath           1     1    85
##  4     4 Kensington-Ced… Entire hom…            2 1 bath           1     0   146
##  5     5 Kensington-Ced… Entire hom…            4 1 bath           1     2   110
##  6     6 Hastings-Sunri… Entire hom…            4 1 bath           2     3   195
##  7     7 Renfrew-Collin… Entire hom…            8 3 baths          4     5   130
##  8     8 Mount Pleasant  Entire hom…            2 1 bath           1     1    94
##  9     9 Grandview-Wood… Private ro…            2 1 privat…        1     1    79
## 10    10 West End        Private ro…            2 1 privat…        1     1    75
## # … with 4,584 more rows

Now that we have obtained the samples, we need to compute the proportion of entire home/apartment listings in each sample. We first group the data by the

airbnb |>
  summarize(
    n =  sum(room_type == "Entire home/apt"),
    proportion = sum(room_type == "Entire home/apt") / nrow(airbnb)
  )

4 variable—to group the set of listings in each sample together—and then use

airbnb |>
  summarize(
    n =  sum(room_type == "Entire home/apt"),
    proportion = sum(room_type == "Entire home/apt") / nrow(airbnb)
  )

8 to compute the proportion in each sample. We print both the first and last few entries of the resulting data frame below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples.

## # A tibble: 4,594 × 8
##       id neighbourhood   room_type   accommodates bathrooms bedrooms  beds price
##                                         
##  1     1 Downtown        Entire hom…            5 2 baths          2     2   150
##  2     2 Downtown Easts… Entire hom…            4 2 baths          2     2   132
##  3     3 West End        Entire hom…            2 1 bath           1     1    85
##  4     4 Kensington-Ced… Entire hom…            2 1 bath           1     0   146
##  5     5 Kensington-Ced… Entire hom…            4 1 bath           1     2   110
##  6     6 Hastings-Sunri… Entire hom…            4 1 bath           2     3   195
##  7     7 Renfrew-Collin… Entire hom…            8 3 baths          4     5   130
##  8     8 Mount Pleasant  Entire hom…            2 1 bath           1     1    94
##  9     9 Grandview-Wood… Private ro…            2 1 privat…        1     1    79
## 10    10 West End        Private ro…            2 1 privat…        1     1    75
## # … with 4,584 more rows

## # A tibble: 4,594 × 8
##       id neighbourhood   room_type   accommodates bathrooms bedrooms  beds price
##                                         
##  1     1 Downtown        Entire hom…            5 2 baths          2     2   150
##  2     2 Downtown Easts… Entire hom…            4 2 baths          2     2   132
##  3     3 West End        Entire hom…            2 1 bath           1     1    85
##  4     4 Kensington-Ced… Entire hom…            2 1 bath           1     0   146
##  5     5 Kensington-Ced… Entire hom…            4 1 bath           1     2   110
##  6     6 Hastings-Sunri… Entire hom…            4 1 bath           2     3   195
##  7     7 Renfrew-Collin… Entire hom…            8 3 baths          4     5   130
##  8     8 Mount Pleasant  Entire hom…            2 1 bath           1     1    94
##  9     9 Grandview-Wood… Private ro…            2 1 privat…        1     1    79
## 10    10 West End        Private ro…            2 1 privat…        1     1    75
## # … with 4,584 more rows

## # A tibble: 4,594 × 8
##       id neighbourhood   room_type   accommodates bathrooms bedrooms  beds price
##                                         
##  1     1 Downtown        Entire hom…            5 2 baths          2     2   150
##  2     2 Downtown Easts… Entire hom…            4 2 baths          2     2   132
##  3     3 West End        Entire hom…            2 1 bath           1     1    85
##  4     4 Kensington-Ced… Entire hom…            2 1 bath           1     0   146
##  5     5 Kensington-Ced… Entire hom…            4 1 bath           1     2   110
##  6     6 Hastings-Sunri… Entire hom…            4 1 bath           2     3   195
##  7     7 Renfrew-Collin… Entire hom…            8 3 baths          4     5   130
##  8     8 Mount Pleasant  Entire hom…            2 1 bath           1     1    94
##  9     9 Grandview-Wood… Private ro…            2 1 privat…        1     1    79
## 10    10 West End        Private ro…            2 1 privat…        1     1    75
## # … with 4,584 more rows

We can now visualize the sampling distribution of sample proportions for samples of size 40 using a histogram in Figure . Keep in mind: in the real world, we don’t have access to the full population. So we can’t take many samples and can’t actually construct or visualize the sampling distribution. We have created this particular example such that we do have access to the full population, which lets us visualize the sampling distribution directly for learning purposes.

## # A tibble: 4,594 × 8
##       id neighbourhood   room_type   accommodates bathrooms bedrooms  beds price
##                                         
##  1     1 Downtown        Entire hom…            5 2 baths          2     2   150
##  2     2 Downtown Easts… Entire hom…            4 2 baths          2     2   132
##  3     3 West End        Entire hom…            2 1 bath           1     1    85
##  4     4 Kensington-Ced… Entire hom…            2 1 bath           1     0   146
##  5     5 Kensington-Ced… Entire hom…            4 1 bath           1     2   110
##  6     6 Hastings-Sunri… Entire hom…            4 1 bath           2     3   195
##  7     7 Renfrew-Collin… Entire hom…            8 3 baths          4     5   130
##  8     8 Mount Pleasant  Entire hom…            2 1 bath           1     1    94
##  9     9 Grandview-Wood… Private ro…            2 1 privat…        1     1    79
## 10    10 West End        Private ro…            2 1 privat…        1     1    75
## # … with 4,584 more rows

Figure 10.2: Sampling distribution of the sample proportion for sample size 40.

The sampling distribution in Figure appears to be bell-shaped, is roughly symmetric, and has one peak. It is centered around 0.7 and the sample proportions range from about 0.4 to about 1. In fact, we can calculate the mean of the sample proportions.

## # A tibble: 4,594 × 8
##       id neighbourhood   room_type   accommodates bathrooms bedrooms  beds price
##                                         
##  1     1 Downtown        Entire hom…            5 2 baths          2     2   150
##  2     2 Downtown Easts… Entire hom…            4 2 baths          2     2   132
##  3     3 West End        Entire hom…            2 1 bath           1     1    85
##  4     4 Kensington-Ced… Entire hom…            2 1 bath           1     0   146
##  5     5 Kensington-Ced… Entire hom…            4 1 bath           1     2   110
##  6     6 Hastings-Sunri… Entire hom…            4 1 bath           2     3   195
##  7     7 Renfrew-Collin… Entire hom…            8 3 baths          4     5   130
##  8     8 Mount Pleasant  Entire hom…            2 1 bath           1     1    94
##  9     9 Grandview-Wood… Private ro…            2 1 privat…        1     1    79
## 10    10 West End        Private ro…            2 1 privat…        1     1    75
## # … with 4,584 more rows

## # A tibble: 4,594 × 8
##       id neighbourhood   room_type   accommodates bathrooms bedrooms  beds price
##                                         
##  1     1 Downtown        Entire hom…            5 2 baths          2     2   150
##  2     2 Downtown Easts… Entire hom…            4 2 baths          2     2   132
##  3     3 West End        Entire hom…            2 1 bath           1     1    85
##  4     4 Kensington-Ced… Entire hom…            2 1 bath           1     0   146
##  5     5 Kensington-Ced… Entire hom…            4 1 bath           1     2   110
##  6     6 Hastings-Sunri… Entire hom…            4 1 bath           2     3   195
##  7     7 Renfrew-Collin… Entire hom…            8 3 baths          4     5   130
##  8     8 Mount Pleasant  Entire hom…            2 1 bath           1     1    94
##  9     9 Grandview-Wood… Private ro…            2 1 privat…        1     1    79
## 10    10 West End        Private ro…            2 1 privat…        1     1    75
## # … with 4,584 more rows

We notice that the sample proportions are centered around the population proportion value, 0.747! In general, the mean of the sampling distribution should be equal to the population proportion. This is great news because it means that the sample proportion is neither an overestimate nor an underestimate of the population proportion. In other words, if you were to take many samples as we did above, there is no tendency towards over or underestimating the population proportion. In a real data analysis setting where you just have access to your single sample, this implies that you would suspect that your sample point estimate is roughly equally likely to be above or below the true population proportion.

Is the process of making inference about the population based on information obtained from a sample?

Statistical Inference is the process of drawing conclusions about a population based on information obtained from a sample. With statistical inference, researchers can use statistics obtained from a sample to estimate parameters of a population.

What is the process of statistical inference?

Statistical inference is the process of analysing the result and making conclusions from data subject to random variation. It is also called inferential statistics. Hypothesis testing and confidence intervals are the applications of the statistical inference.

Which helps to make inferences about a population?

Inferential statistics is a way of making inferences about populations based on samples.

What is sample inference?

The use of randomization in sampling allows for the analysis of results using the methods of statistical inference. Statistical inference is based on the laws of probability, and allows analysts to infer conclusions about a given population based on results observed through random sampling.