Lab 2: Introduction to data

This worksheet is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This worksheet was adapted for OpenIntro by Andrew Bray and Mine etinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics; it was extended for the University of York by Gustav Delius, and subsequently by Stephen Connor.

Some define Statistics as the field that focuses on turning information into knowledge. This worksheet is designed to give you more practice with summarising and visualising the raw information - the data. In this lab, you will gain insight into public health by generating simple graphical and numerical summaries of a data set collected by the Centers for Disease Control and Prevention (CDC). As this is a large data set, along the way you’ll also learn the indispensable skills of data processing and subsetting.

Remember!

As always, you should start the lab by creating a script file (with a sensible name), and then adding each line of code to this file as you go, so that you can easily re-run it later if necessary. Add your own comments to remind you what each chunk of code does!

The Behavioral Risk Factor Surveillance System

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.

We begin by loading the data set of 20,000 observations into the R workspace. Loading the data set may take a few seconds, so be patient. Use the following command to load the data:

source("http://www.openintro.org/stat/data/cdc.R")

Once loaded, the data set cdc shows up in your Environment panel. It is in a format that R calls a data frame. It is a table with each row representing a case and each column representing a variable. We can have a look at the first few entries (rows) of our data with the command

head(cdc)

and similarly we can look at the last few by typing

tail(cdc)

You could also look at all of the data frame at once by typing its name into the console, but that might be unwise here: we know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen. It’s better to take small peeks at the data with head, tail or the subsetting techniques that you’ll learn in a moment.

Types of variables

You already know from the Intro Lab that to view the names of the variables in our data set you can type the command

names(cdc)

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health cover plan (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in their lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

Variables come in different types. It is important to distinguish between different types of variables since methods for viewing and summarising data are dependent on variable type. A variable is either quantitative or qualitative.

A variable that is quantitative (numeric) may be either discrete or continuous. A discrete variable is a numerical variable that can assume a finite number or at most a countably infinite number of values, for example, the number of students in a class. A continuous variable is a numerical variable that can assume an uncountable number of values associated with subsets of the real number line, for example, the height of a tree.

When a variable is qualitative, it is essentially defining groups or categories. Qualitative variables are therefore also often referred to as categorical variables. When the categories have no ordering the variable is called nominal. For example, a variable “music preference” could have values such as “classical,” “jazz,” “rock,” or “other.” When the categories have a distinct ordering, the variable is called ordinal. Such a variable might be educational level with values GCSEs, A-levels, Bachelors degree, Masters degree, PhD.

The distinction between the different types is not always as clear cut as one would like. Consider for example the variable height that represents the respondents’ height in inches. Even though this is always rounded to integer values in the data set, it is still a continuous variable, because non-integer values would make sense, even though they may not be used in the data set.

Note that even categorical variables can take numerical values, because the categories could be labelled by numbers. We see this for example in the variable exerany that takes the values 0 and 1, with 1 representing that the respondent has exercised in the last month and 0 that they have not. This is a categorical variable. It is less clear whether it is ordinal or nominal, but luckily for a variable that takes on only two possible values the distinction is of no consequence. Only once there are at least three values will the statistical techniques differ between ordinal and nominal variables.

Your turn

Look at the variables in this data set. For each variable, identify its data type. How many of the variables are quantitative? How many are categorical?
Answer quiz question 1.

Summaries and tables

The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distil all of that information into a few summary statistics and graphics.

As a simple example, the function summary() returns a numerical summary: minimum, first quartile, median, mean, second quartile, and maximum. For weight this is

summary(cdc$weight)

We will look more closely at the meaning of these summary statistics later.

While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table() does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type

table(cdc$smoke100)

or instead look at the relative frequency distribution by typing

table(cdc$smoke100)/20000

Notice how R automatically divides all entries in the table by 20,000 in the command above. This is similar to something we have already observed; when we multiplied or divided a vector by a number, R applied that action across all entries in the vector. As we see above, this also works for tables. Next, we make a bar plot of the entries in the table by putting the table inside the barplot() command.

barplot(table(cdc$smoke100))

Notice what we’ve done here! We’ve computed the table of cdc$smoke100 and then immediately applied the graphical function, barplot. This is an important idea: R commands can be nested. You could also break this into two steps by typing the following:

smoke <- table(cdc$smoke100)
barplot(smoke)

Here, we’ve made a new object, a table, called smoke (the contents of which we can see by typing smoke into the console) and then used it in as the input for barplot.

Your turn

Create numerical summaries for height and age. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
Answer quiz question 2.

The table command can be used to tabulate any number of variables that you provide. For example, to examine which participants have smoked across each gender, we could use the following.

table(cdc$gender, cdc$smoke100)

Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has smoked at least 100 cigarettes. The rows refer to gender. To create a mosaic plot of this table, we would enter the following command.

mosaicplot(table(cdc$gender, cdc$smoke100))

We could have accomplished this in two steps by saving the table in one line and applying mosaicplot in the next (see the table/barplot example above).

We can also use a barplot to show how respondents’ general health differs by gender:

barplot(table(cdc$genhlth, cdc$gender),
        beside = F,
        legend.text = T,
        xlab = "Gender",
        ylab = "Frequency",
        main = "General health by gender")
Your turn

Try changing beside = F to beside = T and see what changes. Which do you find more informative?

Tip

Note that you can flip between plots that you’ve created by clicking the forward and backward arrows in the Viewer window of RStudio, just above the plots.

Interlude: how R thinks about data

We mentioned that R stores data in data frames, which you might think of as a type of spreadsheet. Each row is a different observation (a different respondent) and each column is a different variable (the first is genhlth, the second exerany, and so on). We can see the size of the data frame next to the object name in the workspace or we can type

dim(cdc)

which will return the number of rows and columns. Now, if we want to access a subset of the full data frame, we can use row-and-column notation. For example, to see the sixth variable of the 567th respondent, use the format

cdc[567, 6] 

which means we want the element of our data set that is in the 567th row (meaning the 567th person or observation) and the 6th column (in this case, weight). We know that weight is the 6th variable because it is the 6th entry in the list of variable names:

names(cdc)[6]

To see the weights for the first 10 respondents we can type

cdc[1:10, 6]

In this expression, we have asked just for rows in the range 1 through 10. We’ve already seen that R uses the : notation to create a range of values, so 1:10 expands to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You can see this by entering

1:10

Finally, if we want all of the data for the first 10 respondents, type

cdc[1:10, ] 

By leaving out an index or a range (we didn’t type anything between the comma and the closing square bracket), we get all the columns. When starting out in R, this can be a bit counterintuitive. As a rule, we omit the column number to see all columns in a data frame. Similarly, if we leave out an index or range for the rows, we would access all the observations, not just the 567th, or rows 1 through 10. Try the following to see the weights for all 20,000 respondents fly by on your screen

cdc[ , 6]

R recognises that it is not very useful to put so many numbers on the screen, so stops after 1,000 entries.

Recall that column 6 represents respondents’ weight, so the command above reported all of the weights in the data set. We have already seen an alternative method to access the weight data by referring to the name. We can use any of the variable names to select items in our data set, for example

cdc$weight

The dollar-sign $ tells R to look in data frame cdc for the column called weight. Since that’s a single vector, we can subset it with just a single index inside square brackets. We see the weight for the 567th respondent by typing

cdc$weight[567]

Similarly, for just the first 10 respondents

cdc$weight[1:10] 

The command above returns the same result as the cdc[1:10, 6] command.

Tip

Both row-and-column notation and dollar-sign notation are widely used: which one you choose to use depends on your personal preference.

Your turn

Answer quiz question 3.

A little more on subsetting

It’s often useful to extract all observations (cases) in a data set that have specific characteristics. We accomplish this through conditioning commands. First, consider expressions like

cdc$gender == "m" 

or

cdc$age > 30

As we saw in Lab 1, these commands produce vectors of TRUE and FALSE values. There is one value for each respondent, where TRUE indicates that the person was male (via the first command) or older than 30 (second command).

Suppose we want to extract just the data for the men in the sample, or just for those over 30. We can use the R function subset() to do that for us. For example, the command

mdata <- subset(cdc, cdc$gender == "m") 

will create a new data set called mdata that contains only the men from the cdc data set. In addition to finding it in your workspace alongside its dimensions, you can take a peek at the first several rows as usual

head(mdata)

This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep only specific variables, which is a topic we’ll discuss in a future lab. For now, the important thing is that we can carve up the data based on values of one or more variables.

As we saw in Lab 1, we can use several of these conditions together with & and |. The & is read and so that

m_and_over30 <- subset(cdc, cdc$gender == "m" & cdc$age > 30)

will give you the data for men over the age of 30. The | character is read or so that

m_or_over30 <- subset(cdc, cdc$gender == "m" | cdc$age > 30) 

will take people who are men or over the age of 30 (why that’s an interesting group is hard to say, but right now the mechanics of this are the important thing). In principle, you may use as many “and” and “or” clauses as you like when forming a subset.

Your turn

Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked at least 100 cigarettes in their lifetime. Use the summary command to see the summary statistics for the weight variable in this smaller data set.
Answer quiz question 4.

Creating new variables from old

Sometimes we wish to use variables in our dataset to create new measurements of interest. We’ve seen that each variable in our dataset is stored as a column in the cdc data frame: each column can be easily accessed using either row-and-column or dollar-sign notation, and then manipulated as we would a vector. This means that it is simple to perform simple algebraic operations on variables to create new ones.

For example, suppose that we wish to create a new variable, weight_centred, which measures the difference between a person’s weight and the mean weight of the entire sample. We can do this by typing

weight_centred <- cdc$weight - mean(cdc$weight)

We call such a variable centred because it has been shifted so as to have zero mean:

summary(weight_centred)

(Note that if you type mean(weight_centred) then R returns the value \(-5.2492 \times 10^{-15}\) instead of zero: this is just an artefact caused by rounding.)

Your turn

Create a new variable called male_height_centred that measures the difference between each male respondent’s height and the mean height of all male respondents. What fraction of male respondents are taller than the mean height of all male respondents?
Answer quiz question 5.

Now let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdiff <- cdc$weight - cdc$wtdesire

We could then count how many people currently weigh more than their desired weight:

sum(wdiff > 0)
Your turn

What proportion of female respondents have a current weight which is exactly the same as their desired weight?
Answer quiz question 6.

Finally, let’s consider another new variable that doesn’t show up directly in this data set: Body Mass Index (BMI). BMI is a weight to height ratio and can be calculated as

\[\text{BMI} = \frac{weight~(lb)}{height~(in)^2} \cdot 703\] where 703 is the approximate conversion factor to change units from metric (metres and kilograms) to imperial (inches and pounds).

Your turn

Create a variable bmi which gives the BMI of each respondent in the dataset. (Hint: to square each element of a vector x in R you can type x^2.) Check that the mean BMI value of the cdc respondents is 26.30693.
Answer quiz question 7.

Suppose that we now choose one of the respondents in the cdc dataset at random: let

\[A = \{\text{the BMI of our randomly chosen respondent is greater than 34}\}\,.\]

What is \(\mathbb{P}\left(A\right)\)? Since each person in the dataset is equally likely to be chosen, we can calculate this probability by counting how many respondents have a BMI greater than 34, and dividing by the total number of respondents:

sum(bmi > 34)/20000
#> [1] 0.0756

Your final exercise for this lab involves calculating a conditional probability. Recall that we already saw that the mean BMI value is 26.30693. Define the event \(B\) by

\[B=\{\text{the BMI of our randomly chosen respondent is greater than the mean value}\}\,.\]

Your turn

Answer quiz question 8.