2  Presenting a single variable

2.1 Quantative variables

For this demonstration, I’ll be using white wine data from the wine quality dataset from the UCI Machine Learning repository, accessible at winequality-white.csv. It contains information related to red and white wine samples from northern Portugal, and the study aimed to model wine quality based on the chemical properties of the wines.

2.1.1 Exploring the dataset

We don’t always know what our data looks like. It is usually a good idea to explore it using head() or View(). Here, we can see the columns present in the winequality dataset.

head(winequality.white)
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
7.0 0.27 0.36 20.7 0.045 45 170 1.0010 3.00 0.45 8.8 6
6.3 0.30 0.34 1.6 0.049 14 132 0.9940 3.30 0.49 9.5 6
8.1 0.28 0.40 6.9 0.050 30 97 0.9951 3.26 0.44 10.1 6
7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.40 9.9 6
7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.40 9.9 6
8.1 0.28 0.40 6.9 0.050 30 97 0.9951 3.26 0.44 10.1 6

2.1.2 Histograms and boxplots

There are plenty of ways to visualize a single quantitative variable, but the two most common methods are using a histogram and a boxplot.

We can build a histogram using ggplot2, our plotting utility of choice. Let’s use it to visualize the distribution of citric acid concentration in white wines.

ggplot(winequality.white, aes(citric.acid)) +
  geom_histogram() +
  labs(
    title = 'Citric Acid Levels in White Wine',
    x = 'Citric Acid concentration (g/dm³)',
    y = 'Frequency',
    caption = 'Source: Cortez et al.'
  )

It looks like our dataset is roughly positively skewed, and has a few outliers to the right.

A boxplot, or a box-and-whisker plot, makes it easier to visualize outliers. We can draw them either vertically or horizontally by placing them on the \(x\) or \(y\)-axis within our aesthetic mappings.

ggplot(winequality.white, aes(x = citric.acid)) +
  geom_boxplot() +
  labs(
    title = 'Citric Acid Levels in White Wine',
    x = 'Citric Acid concentration (g/dm³)'
  )

Although individual variables aren’t typically presented in the form of a boxplot, it can be useful for looking at its distribution.

2.1.3 Cumulative frequency distributions

The median and quartiles are examples of percentiles/quantiles. A percentile is a measure of what percent of the dataset is below a value. In other words, the nth percentile of a distribution describes value below which n% of your dataset lies. We can visualize this using a cumulative frequency distribution.

ggplot(winequality.white, aes(citric.acid)) +
  stat_ecdf() +
  labs(
    title = 'Citric Acid Levels in White Wine',
    x = 'Citric Acid concentration (g/dm³)',
    y = 'Cumulative relative frequency'
  )

While a variable is rarely presented using a CDF plot, it can be a useful tool to gain a better understanding of your data. [TODO: Explain how to interpret CDFs]

2.2 Categorical variables

For this demonstration, I’ll be using the tv-shows.csv data, which contains information on the top-rated TV shows around the world from The Movie Database. You can import it using the read.csv() method as demonstrated in previous pages.

2.2.1 Exploring the dataset

As usual, we can use head() or View() to explore the dataset and see what columns are available to us.

head(tv.shows)
first_air_date country language name popularity vote_average vote_count
2012-05-14 Argentina es Violetta 35.821 8.2 230
2004-03-15 Argentina es Floricienta 70.664 8.1 133
2015-01-13 Argentina es Airport Security 17.261 7.6 100
2002-05-27 Argentina es Rebelde Way 128.706 8.0 312
2021-10-29 Argentina es Maradona, Blessed Dream 56.841 7.5 613
2017-06-19 Argentina es Once 277.186 8.7 1181

We could use this data to visualize the number of top-rated TV shows from each country.

2.2.2 Relative frequency table

We can count the number of TV shows by country and present the relative frequency for each country, or what proportion of shows in the dataset are from that particular country. We’ll also cap the data to only show countries with at least 10 shows in the list.

tv.shows.table <- tv.shows %>%
  count(country, sort = TRUE) %>%
  filter(n >= 10) %>%
  mutate(relative_frequency = n / sum(n))
tv.shows.table
country n relative_frequency
United States 1408 0.558
Japan 396 0.157
United Kingdom 179 0.071
Mexico 154 0.061
Korea, Republic of 98 0.039
Canada 65 0.026
Colombia 59 0.023
Spain 51 0.020
France 22 0.009
Brazil 21 0.008
Turkey 20 0.008
Italy 15 0.006
Germany 14 0.006
Argentina 13 0.005
China 10 0.004

2.2.3 Bar chart

A bar chart is the preferred way to visualize univariate, categorical data such as the information above. Since we’ve already saved the relative frequency table to tv.shows.table, we’ll be able to use this dataframe in conjunction with ggplot to create a bar chart.

ggplot(tv.shows.table, aes(x = reorder(country, n), y = n)) +
  geom_bar(stat = 'identity') +
  labs(
    title = 'Countries With the Most Hit TV Shows',
    subtitle = 'Which countries have the most top-rated shows?',
    x = 'Country',
    y = 'Number of TV shows',
    caption = 'Source: The Movie Database'
  ) +
  coord_flip() +
  scale_y_continuous(limits = c(0, 1500), n.breaks = 7)