We will look at an example using data from Inside Airbnb (). Airbnb is an online marketplace for arranging vacation rentals and places to stay. The data set contains listings for Vancouver, Canada, in September 2020. Our data includes an ID number, neighborhood, type of room, the number of people the rental accommodates, number of bathrooms, bedrooms, beds, and the price per night. Show
Suppose the city of Vancouver wants information about Airbnb rentals to help plan city bylaws, and they want to know how many Airbnb places are listed as entire homes and apartments (rather than as private or shared rooms). Therefore they may want to estimate the true proportion of all Airbnb listings where the “type of place” is listed as “entire home or apartment.” Of course, we usually do not have access to the true population, but here let’s imagine (for learning purposes) that our data set represents the population of all Airbnb rental listings in Vancouver, Canada. We can find the proportion of listings where 7.
We can see that the proportion of 8 listings in the data set is 0.747. This value, 0.747, is the population parameter. Remember, this parameter value is usually unknown in real data analysis problems, as it is typically not possible to make measurements for an entire population.Instead, perhaps we can approximate it with a small subset of data! To investigate this idea, let’s try randomly selecting 40 listings (i.e., taking a random sample of size 40 from our population), and computing the proportion for that sample. We will use the 9 function from the 0 package to take the sample. The arguments of 9 are (1) the data frame to sample from, and (2) the size of the sample to take.
Here we see that the proportion of entire home/apartment listings in this random sample is 0.7. Wow—that’s close to our true population value! But remember, we computed the proportion using a random sample of size 40. This has two consequences. First, this value is only an estimate, i.e., our best guess of our population parameter using this sample. Given that we are estimating a single value here, we often refer to it as a point estimate. Second, since the sample was random, if we were to take another random sample of size 40 and compute the proportion for that sample, we would not get the same answer:
Confirmed! We get a different value for our estimate this time. That means that our point estimate might be unreliable. Indeed, estimates vary from sample to sample due to sampling variability. But just how much should we expect the estimates of our random samples to vary? Or in other words, how much can we really trust our point estimate based on a single sample? To understand this, we will simulate many samples (much more than just two) of size 40 from our population of listings and calculate the proportion of entire home/apartment listings in each sample. This simulation will create many sample proportions, which we can visualize using a histogram. The distribution of the estimate for all possible samples of a given size (which we commonly refer to as \(n\)) from a population is called a sampling distribution. The sampling distribution will help us see how much we would expect our sample proportions from this population to vary for samples of size 40. We again use the 9 to take samples of size 40 from our population of Airbnb listings. But this time we set the 3 argument to 20,000 to specify that we want to take 20,000 samples of size 40.
Notice that the column 4 indicates the replicate, or sample, to which each listing belongs. Above, since by default R only prints the first few rows, it looks like all of the listings have 4 set to 1. But you can check the last few entries using the 6 function to verify that we indeed created 20,000 samples (or replicates). 0Now that we have obtained the samples, we need to compute the proportion of entire home/apartment listings in each sample. We first group the data by the 4 variable—to group the set of listings in each sample together—and then use 8 to compute the proportion in each sample. We print both the first and last few entries of the resulting data frame below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples. 1 2 3We can now visualize the sampling distribution of sample proportions for samples of size 40 using a histogram in Figure . Keep in mind: in the real world, we don’t have access to the full population. So we can’t take many samples and can’t actually construct or visualize the sampling distribution. We have created this particular example such that we do have access to the full population, which lets us visualize the sampling distribution directly for learning purposes. 4Figure 10.2: Sampling distribution of the sample proportion for sample size 40. The sampling distribution in Figure appears to be bell-shaped, is roughly symmetric, and has one peak. It is centered around 0.7 and the sample proportions range from about 0.4 to about 1. In fact, we can calculate the mean of the sample proportions. 5 6We notice that the sample proportions are centered around the population proportion value, 0.747! In general, the mean of the sampling distribution should be equal to the population proportion. This is great news because it means that the sample proportion is neither an overestimate nor an underestimate of the population proportion. In other words, if you were to take many samples as we did above, there is no tendency towards over or underestimating the population proportion. In a real data analysis setting where you just have access to your single sample, this implies that you would suspect that your sample point estimate is roughly equally likely to be above or below the true population proportion. Is the process of making inference about the population based on information obtained from a sample?Statistical Inference is the process of drawing conclusions about a population based on information obtained from a sample. With statistical inference, researchers can use statistics obtained from a sample to estimate parameters of a population.
What is the process of statistical inference?Statistical inference is the process of analysing the result and making conclusions from data subject to random variation. It is also called inferential statistics. Hypothesis testing and confidence intervals are the applications of the statistical inference.
Which helps to make inferences about a population?Inferential statistics is a way of making inferences about populations based on samples.
What is sample inference?The use of randomization in sampling allows for the analysis of results using the methods of statistical inference. Statistical inference is based on the laws of probability, and allows analysts to infer conclusions about a given population based on results observed through random sampling.
|