Understanding
Data
This guide explains the why behind every statistical idea — not just what to memorise, but how to think about data. Work through each section in order, try every checkpoint question, then reveal the answer to check yourself.
Before you can analyse data, you need to know what kind of data you have. The type determines which tools are valid. You wouldn't calculate the average eye colour — that makes no sense. Understanding data types prevents that kind of error.
Imagine a survey collecting eye colour (brown, blue, green) and heights (160 cm, 172 cm, 155 cm). You can compute the average height — it's meaningful. But "average eye colour" is meaningless. The type of data tells you which calculations are valid and which graphs make sense. Skipping this step leads to conclusions that look numerical but mean nothing.
The four data types
| Type | Description | Examples |
|---|---|---|
| Qualitative | Describes a category or quality — not a number | Eye colour, favourite sport, type of pet |
| Quantitative | Expressed as a number — can be measured or counted | Height, temperature, number of siblings |
| Discrete | Countable; can only take specific whole-number values | Number of students in a class: 0, 1, 2, 3… |
| Continuous | Can take any value in a range — measured, not counted | Height: 162.3 cm; time: 4.72 s; temperature: 21.5 °C |
Ask yourself two questions in order:
1. "Is it a number?" — if no, it's qualitative.
2. "Can I measure it with a ruler, scale, or stopwatch?" — if yes, it's continuous. If I can only count it in whole numbers, it's discrete.
Population vs Sample
Suppose you want to know the average height of every 14-year-old in Quebec. There are hundreds of thousands of them. Measuring every single person would take years and cost a fortune. Instead, you measure a carefully chosen group of, say, 200 students. If the sample is representative and random, its statistics will be very close to the true population values.
(b) Blood type (A, B, AB, O) → category, not a number → Qualitative
(c) Time to run 100 m → measured with a stopwatch, any decimal → Quantitative, Continuous
(d) Number of cars → countable whole numbers → Quantitative, Discrete
(e) Mass of a watermelon → measured on a scale, any decimal → Quantitative, Continuous
b) Give one example each of qualitative data and quantitative data collected from your school day.
c) A researcher surveys 50 randomly chosen households to estimate recycling habits across an entire city. What is the population? What is the sample?
a) Shoe sizes are discrete — they only come in specific values (6, 6.5, 7, 7.5…). Even though half-sizes exist, the values are fixed and countable, not continuously measured.
b) Many valid answers. Example: qualitative — subject names (Math, French, Science). Quantitative — number of minutes in each class, or a test score.
c) Population = all households in the city. Sample = the 50 randomly chosen households surveyed.
Raw data is a list of numbers thrown at you without order. Before you can find averages, spot patterns, or draw graphs, you need to organise it. Frequency tables and stem-and-leaf plots are two of the most powerful organising tools.
Imagine someone hands you: 72, 85, 91, 68, 72, 85, 91, 72, 60, 85. At a glance, you can't tell what's typical, what repeats, or what the spread is. Once you organise it into a frequency table, you immediately see that 72 and 85 each appear three times, and the range runs from 60 to 91. Organisation turns noise into insight.
Building a frequency table step by step
Here is the raw data from a class quiz (scores out of 100): 60, 70, 80, 70, 90, 70, 80, 60, 80, 70, 90, 80, 70, 80, 90, 80, 70, 60, 90, 80
70: appears 6 times
80: appears 7 times
90: appears 4 times
Score 70: (6 / 20) × 100% = 30%
Score 80: (7 / 20) × 100% = 35%
Score 90: (4 / 20) × 100% = 20%
Stem-and-Leaf plots
A stem-and-leaf plot displays data in sorted order while preserving the original values. The stem is the leading digit (the tens), and the leaf is the last digit (the ones). Each row in the plot is one stem, with all matching leaves listed to the right.
1. Identify the stems (leading digits) and write them in a column.
2. For each data value, write its last digit (leaf) beside the correct stem.
3. Re-order the leaves in each row from smallest to largest.
4. Write a key, e.g. "2 | 3 means 23."
3 | 1 4 8
4 | 2
b) Build a stem-and-leaf plot for: 14, 17, 21, 23, 26, 29, 31
c) A frequency table has four rows with frequencies 5, 8, 12, and x. The total is 30. What is x?
a) Walking: (10/25) × 100% = 40% | Bus: (8/25) × 100% = 32% | Driven: (7/25) × 100% = 28%. Check: 40 + 32 + 28 = 100% ✓
b) Stems 1, 2, 3:
1 | 4 7
2 | 1 3 6 9
3 | 1
Key: 1 | 4 means 14
c) 5 + 8 + 12 + x = 30 → 25 + x = 30 → x = 5
When someone says "on average," they could mean three different things — the mean, the median, or the mode. Each one tells a different story about a dataset. Choosing the wrong one leads to misleading conclusions.
Suppose five employees earn $30 000, $32 000, $31 000, $29 000, and $500 000 (the CEO). The mean is about $124 400 — but four out of five people earn far less than that. The median is $31 000 — a much more honest picture of what a typical employee earns. The mode doesn't apply here (all values are different). Each measure captures a different aspect of "typical."
Finding the median — odd vs even count
- Odd number of values: sort the data; the median is the exact middle value. For 7 values, it's the 4th.
- Even number of values: sort the data; the median is the mean of the two middle values. For 6 values, average the 3rd and 4th.
Choosing the right measure
| Measure | Best used when | Weakness |
|---|---|---|
| Mean | Data is symmetric with no extreme outliers | Heavily affected by outliers (very high or very low values) |
| Median | Data has outliers or is skewed in one direction | Ignores the actual size of most values |
| Mode | Data is qualitative, or you want the most common value | May not exist, or there may be multiple modes |
The mean (≈ 24.9) is pulled far above the typical value because of the outlier 95. Six out of seven values are between 11 and 15. The median (14) accurately reflects the typical value. In this case, the median is a better measure of centre.
Use the mean when data is balanced and symmetric (e.g., heights of students — no extreme outliers).
Use the median when data is skewed or contains outliers (e.g., incomes, house prices).
Use the mode when you want the most popular value, especially for qualitative data (e.g., "most common shoe size ordered").
b) The ages of six people at a party are: 14, 15, 14, 16, 13, 72. Which measure of central tendency best describes a "typical" age? Why?
c) A dataset has no mode. What does that tell you about the data?
a) Sorted: 3, 3, 3, 5, 6, 7, 8, 9 (8 values)
Mean = (3+3+3+5+6+7+8+9) / 8 = 44/8 = 5.5
Median = average of 4th and 5th values = (5+6)/2 = 5.5
Mode = 3 (appears 3 times)
b) The median is best. The 72-year-old is an outlier that pulls the mean far from the typical age of the young people. Sorted: 13, 14, 14, 15, 16, 72 → median = (14+15)/2 = 14.5, which accurately reflects the typical partygoer.
c) If there is no mode, every value in the dataset appears exactly the same number of times (usually once each). No single value is more common than any other.
Knowing the "centre" of data is only half the picture. You also need to know how spread out the values are. Two classes could have the same average test score but one class might have everyone clustered near that average, while the other class is all over the map.
Class A scores: {58, 59, 60, 61, 62} — mean = 60.
Class B scores: {10, 30, 60, 90, 110} — mean = 60.
Both classes have exactly the same mean. But Class A is tightly grouped (a reliable, consistent class), while Class B is all over the place. A teacher treating them identically based on the mean alone would be making a serious mistake. You need spread to tell the full story.
A complete statistical description of a dataset includes at least one measure of centre (mean, median, or mode) and at least one measure of spread (range). Reporting only the mean is like describing a person by their age alone — technically true, but incomplete.
Range B = 18 − 2 = 16
Both datasets have a mean of 10, but Dataset A has a range of only 4 (values are tightly clustered around the mean), while Dataset B has a range of 16 (values are widely spread). The mean alone is misleading — spread is essential for a complete picture.
b) Two basketball players each scored an average of 18 points per game over 5 games. Player A's scores: {17, 18, 19, 18, 18}. Player B's scores: {5, 10, 18, 30, 27}. Who is more consistent? Justify using the range.
c) Can the range of a dataset ever be zero? Give an example.
a) Maximum = 31, minimum = 4. Range = 31 − 4 = 27
b) Player A: range = 19 − 17 = 2 (very consistent). Player B: range = 30 − 5 = 25 (highly variable). Player A is far more consistent — a coach can predict Player A's performance reliably.
c) Yes. If every value in the dataset is identical, the range is zero. Example: {5, 5, 5, 5} → range = 5 − 5 = 0.
A graph is only useful if it matches the type of data and the question you're trying to answer. Using a pie chart for data measured over time, or a histogram for categories, doesn't just look wrong — it actively misleads the reader.
Suppose you track a city's temperature every month for a year. If you draw a pie chart, each month becomes a slice of a circle — but a circle implies that all months together make up some "whole," which is meaningless for temperature. A line graph, on the other hand, shows the rise and fall across the year clearly. The right graph makes the pattern instantly obvious; the wrong graph hides it.
Graph selection guide
| Graph type | Best for | Key feature |
|---|---|---|
| Bar graph | Comparing categories (qualitative or discrete data) | Bars do NOT touch; height = frequency or value |
| Histogram | Continuous data grouped into intervals | Bars DO touch; no gaps — data is continuous |
| Line graph | Data over time (showing trends and changes) | Points connected by lines to show progression |
| Pie chart | Parts of a whole — showing relative proportions | Sector angle = (freq / total) × 360° |
| Stem-and-leaf | Showing distribution of small datasets | Preserves original values; shows shape of data |
In a bar graph, the bars have gaps between them because the categories are separate and distinct (e.g., "cats," "dogs," "birds" — there is nothing between them).
In a histogram, the bars touch with no gaps because the data is continuous — values flow from one interval directly into the next (e.g., 0–10 cm is immediately followed by 10–20 cm). The touching bars represent this continuity.
Sector angles for pie charts
French: (6/30) × 360° = 0.20 × 360° = 72°
Science: (9/30) × 360° = 0.30 × 360° = 108°
History: (6/30) × 360° = 0.20 × 360° = 72°
Best graph: Line graph. The data is collected at regular time intervals (monthly) and you want to show trends and seasonal changes over time. A line graph connects the data points to reveal the pattern of change clearly.
Best graph: Pie chart. You have four distinct categories and you want to show how each one contributes to the whole (all students together = 100%). A pie chart makes relative proportions visually obvious. A bar chart works too, but the question specifically asks about proportions of a whole.
Best graph: Histogram. Height is continuous data grouped into class intervals (e.g., 150–160 cm, 160–170 cm). Since the intervals are adjacent and continuous, the bars must touch — that is the defining feature of a histogram. A bar graph would be wrong here because it implies the groups are separate categories.
b) A student draws a bar graph to display the distribution of exam scores that were recorded to the nearest whole number in intervals of 10 (50–59, 60–69, etc.). A classmate says this should be a histogram. Who is correct? Why?
c) Name the one graph type that preserves every original data value while also showing the shape of the distribution.
a) Total = 40.
Spring: (12/40) × 360° = 108°
Summer: (16/40) × 360° = 144°
Autumn: (8/40) × 360° = 72°
Winter: (4/40) × 360° = 36°
Check: 108 + 144 + 72 + 36 = 360° ✓
b) The classmate is correct. Exam scores measured in intervals (50–59, 60–69…) are grouped continuous data — the intervals are adjacent with no gaps between them. A histogram is required (bars touching). A bar graph would incorrectly imply that the score groups are separate categories with no connection to one another.
c) A stem-and-leaf plot preserves every original value (you can reconstruct the full dataset from it) while its shape reveals the distribution visually.