Statistics — Study Guide

1

Types of Data

Why classifying data matters before analysing it

Before you can analyse data, you need to know what kind of data you have. The type determines which tools are valid. You wouldn't calculate the average eye colour — that makes no sense. Understanding data types prevents that kind of error.

💭

Why does data type determine which tool you use?
Imagine a survey collecting eye colour (brown, blue, green) and heights (160 cm, 172 cm, 155 cm). You can compute the average height — it's meaningful. But "average eye colour" is meaningless. The type of data tells you which calculations are valid and which graphs make sense. Skipping this step leads to conclusions that look numerical but mean nothing.

The four data types

Type	Description	Examples
Qualitative	Describes a category or quality — not a number	Eye colour, favourite sport, type of pet
Quantitative	Expressed as a number — can be measured or counted	Height, temperature, number of siblings
Discrete	Countable; can only take specific whole-number values	Number of students in a class: 0, 1, 2, 3…
Continuous	Can take any value in a range — measured, not counted	Height: 162.3 cm; time: 4.72 s; temperature: 21.5 °C

🔑

Quick classification test:
Ask yourself two questions in order:
1. "Is it a number?" — if no, it's qualitative.
2. "Can I measure it with a ruler, scale, or stopwatch?" — if yes, it's continuous. If I can only count it in whole numbers, it's discrete.

Population vs Sample

💭

Why do we sample instead of studying the whole population?
Suppose you want to know the average height of every 14-year-old in Quebec. There are hundreds of thousands of them. Measuring every single person would take years and cost a fortune. Instead, you measure a carefully chosen group of, say, 200 students. If the sample is representative and random, its statistics will be very close to the true population values.

Population

Every individual in the group of interest. (All Sec 2 students in Quebec.)

Sample

A smaller, representative subset selected from the population. Must be random and unbiased.

★ Easy

Classifying data

Classify each item: (a) number of books on a shelf, (b) a person's blood type, (c) the time to run 100 m, (d) number of cars in a parking lot, (e) the mass of a watermelon

Show solution

1

Apply the classification test to each item

(a) Number of books → countable whole numbers → Quantitative, Discrete
(b) Blood type (A, B, AB, O) → category, not a number → Qualitative
(c) Time to run 100 m → measured with a stopwatch, any decimal → Quantitative, Continuous
(d) Number of cars → countable whole numbers → Quantitative, Discrete
(e) Mass of a watermelon → measured on a scale, any decimal → Quantitative, Continuous

Key insight: Discrete = counted (whole numbers only). Continuous = measured (any value in a range).

Checkpoint 1

a) A student records the shoe sizes of classmates. Is this discrete or continuous? Explain.
b) Give one example each of qualitative data and quantitative data collected from your school day.
c) A researcher surveys 50 randomly chosen households to estimate recycling habits across an entire city. What is the population? What is the sample?

a) Shoe sizes are discrete — they only come in specific values (6, 6.5, 7, 7.5…). Even though half-sizes exist, the values are fixed and countable, not continuously measured.

b) Many valid answers. Example: qualitative — subject names (Math, French, Science). Quantitative — number of minutes in each class, or a test score.

c) Population = all households in the city. Sample = the 50 randomly chosen households surveyed.

2

Frequency Tables & Stem-and-Leaf

Organising raw data so patterns become visible

Raw data is a list of numbers thrown at you without order. Before you can find averages, spot patterns, or draw graphs, you need to organise it. Frequency tables and stem-and-leaf plots are two of the most powerful organising tools.

💭

Why organise raw data at all?
Imagine someone hands you: 72, 85, 91, 68, 72, 85, 91, 72, 60, 85. At a glance, you can't tell what's typical, what repeats, or what the spread is. Once you organise it into a frequency table, you immediately see that 72 and 85 each appear three times, and the range runs from 60 to 91. Organisation turns noise into insight.

Building a frequency table step by step

Here is the raw data from a class quiz (scores out of 100): 60, 70, 80, 70, 90, 70, 80, 60, 80, 70, 90, 80, 70, 80, 90, 80, 70, 60, 90, 80

★★ Medium

Building a frequency table with relative frequencies

Organise the quiz scores above into a frequency table showing frequency and relative frequency (as a percent).

Show solution

1

List all unique values and tally

60: appears 3 times
70: appears 6 times
80: appears 7 times
90: appears 4 times

2

Confirm the total

3 + 6 + 7 + 4 = 20 values total ✓

3

Compute relative frequency for each row

Score 60: (3 / 20) × 100% = 15%
Score 70: (6 / 20) × 100% = 30%
Score 80: (7 / 20) × 100% = 35%
Score 90: (4 / 20) × 100% = 20%

4

Verify relative frequencies sum to 100%

15% + 30% + 35% + 20% = 100% ✓

Completed table: Score 60 → f=3 (15%) | 70 → f=6 (30%) | 80 → f=7 (35%) | 90 → f=4 (20%) | Total: 20 (100%)

Relative frequency

relative frequency = (frequency / total) × 100%

All relative frequencies in a table must sum to exactly 100%. If they don't, recheck your counts.

Stem-and-Leaf plots

A stem-and-leaf plot displays data in sorted order while preserving the original values. The stem is the leading digit (the tens), and the leaf is the last digit (the ones). Each row in the plot is one stem, with all matching leaves listed to the right.

💡

How to build a stem-and-leaf plot:
1. Identify the stems (leading digits) and write them in a column.
2. For each data value, write its last digit (leaf) beside the correct stem.
3. Re-order the leaves in each row from smallest to largest.
4. Write a key, e.g. "2 | 3 means 23."

★ Easy

Building a stem-and-leaf plot

Given the data: 23, 25, 31, 34, 38, 42 — build a stem-and-leaf plot.

Show solution

1

Identify the stems (tens digits)

Values range from 23 to 42, so stems are: 2, 3, 4

2

Place each leaf (ones digit) beside its stem

2 | 3 5
3 | 1 4 8
4 | 2

3

Write the key

Key: 2 | 3 means 23

Reading the plot: 2|3 5 = {23, 25} · 3|1 4 8 = {31, 34, 38} · 4|2 = {42}

Checkpoint 2

a) A survey of 25 students found that 10 walk to school, 8 take the bus, and 7 are driven. Complete the relative frequency for each: walking = ?, bus = ?, driven = ?
b) Build a stem-and-leaf plot for: 14, 17, 21, 23, 26, 29, 31
c) A frequency table has four rows with frequencies 5, 8, 12, and x. The total is 30. What is x?

a) Walking: (10/25) × 100% = 40% | Bus: (8/25) × 100% = 32% | Driven: (7/25) × 100% = 28%. Check: 40 + 32 + 28 = 100% ✓

b) Stems 1, 2, 3:
1 | 4 7
2 | 1 3 6 9
3 | 1
Key: 1 | 4 means 14

c) 5 + 8 + 12 + x = 30 → 25 + x = 30 → x = 5

3

Measures of Central Tendency

What does "average" really mean?

When someone says "on average," they could mean three different things — the mean, the median, or the mode. Each one tells a different story about a dataset. Choosing the wrong one leads to misleading conclusions.

💭

Why are there three kinds of "average"?
Suppose five employees earn $30 000, $32 000, $31 000, $29 000, and $500 000 (the CEO). The mean is about $124 400 — but four out of five people earn far less than that. The median is $31 000 — a much more honest picture of what a typical employee earns. The mode doesn't apply here (all values are different). Each measure captures a different aspect of "typical."

Mean

mean = (sum of all values) / (number of values)

Median

middle value when data is sorted in ascending order

Mode

value(s) that appear most often in the dataset

Finding the median — odd vs even count

Odd number of values: sort the data; the median is the exact middle value. For 7 values, it's the 4th.
Even number of values: sort the data; the median is the mean of the two middle values. For 6 values, average the 3rd and 4th.

Choosing the right measure

Measure	Best used when	Weakness
Mean	Data is symmetric with no extreme outliers	Heavily affected by outliers (very high or very low values)
Median	Data has outliers or is skewed in one direction	Ignores the actual size of most values
Mode	Data is qualitative, or you want the most common value	May not exist, or there may be multiple modes

★ Easy

Finding mean, median, and mode

Find the mean, median, and mode for: {4, 8, 6, 5, 3, 9, 6}

Show solution

1

Sort the data in ascending order

3, 4, 5, 6, 6, 8, 9 (7 values)

2

Calculate the mean

mean = (3 + 4 + 5 + 6 + 6 + 8 + 9) / 7 = 41 / 7 ≈ 5.86

3

Find the median (7 values → 4th value)

3, 4, 5, [6], 6, 8, 9 → median = 6

4

Find the mode (most frequent value)

6 appears twice; all other values appear once. → mode = 6

Answer: Mean ≈ 5.86 · Median = 6 · Mode = 6

★★ Medium

Outlier effect on the mean

Compare the mean and median for: {12, 14, 13, 15, 11, 14, 95}. What does this tell you?

Show solution

1

Sort the data

11, 12, 13, 14, 14, 15, 95 (7 values)

2

Calculate the mean

mean = (11 + 12 + 13 + 14 + 14 + 15 + 95) / 7 = 174 / 7 ≈ 24.9

3

Find the median (4th value)

11, 12, 13, [14], 14, 15, 95 → median = 14

4

Interpret

The mean (≈ 24.9) is pulled far above the typical value because of the outlier 95. Six out of seven values are between 11 and 15. The median (14) accurately reflects the typical value. In this case, the median is a better measure of centre.

Answer: Mean ≈ 24.9 (distorted by outlier 95) · Median = 14 (stable, representative)

🧠

When to use which measure:
Use the mean when data is balanced and symmetric (e.g., heights of students — no extreme outliers).
Use the median when data is skewed or contains outliers (e.g., incomes, house prices).
Use the mode when you want the most popular value, especially for qualitative data (e.g., "most common shoe size ordered").

Checkpoint 3

a) Find the mean, median, and mode for: {7, 3, 9, 3, 5, 8, 3, 6}
b) The ages of six people at a party are: 14, 15, 14, 16, 13, 72. Which measure of central tendency best describes a "typical" age? Why?
c) A dataset has no mode. What does that tell you about the data?

a) Sorted: 3, 3, 3, 5, 6, 7, 8, 9 (8 values)
Mean = (3+3+3+5+6+7+8+9) / 8 = 44/8 = 5.5
Median = average of 4th and 5th values = (5+6)/2 = 5.5
Mode = 3 (appears 3 times)

b) The median is best. The 72-year-old is an outlier that pulls the mean far from the typical age of the young people. Sorted: 13, 14, 14, 15, 16, 72 → median = (14+15)/2 = 14.5, which accurately reflects the typical partygoer.

c) If there is no mode, every value in the dataset appears exactly the same number of times (usually once each). No single value is more common than any other.

4

Spread & Range

Two datasets can have the same mean but look very different

Knowing the "centre" of data is only half the picture. You also need to know how spread out the values are. Two classes could have the same average test score but one class might have everyone clustered near that average, while the other class is all over the map.

💭

Why isn't central tendency enough on its own?
Class A scores: {58, 59, 60, 61, 62} — mean = 60.
Class B scores: {10, 30, 60, 90, 110} — mean = 60.
Both classes have exactly the same mean. But Class A is tightly grouped (a reliable, consistent class), while Class B is all over the place. A teacher treating them identically based on the mean alone would be making a serious mistake. You need spread to tell the full story.

Range

range = maximum − minimum

Range tells you the total span of the data. A large range means high variability; a small range means consistent data.

💡

Always report both central tendency AND spread.
A complete statistical description of a dataset includes at least one measure of centre (mean, median, or mode) and at least one measure of spread (range). Reporting only the mean is like describing a person by their age alone — technically true, but incomplete.

★ Easy

Same mean, different spread

Dataset A: {8, 9, 10, 11, 12}. Dataset B: {2, 6, 10, 14, 18}. Find the mean and range for each. What does this show?

Show solution

1

Calculate mean for Dataset A

mean A = (8 + 9 + 10 + 11 + 12) / 5 = 50 / 5 = 10

2

Calculate mean for Dataset B

mean B = (2 + 6 + 10 + 14 + 18) / 5 = 50 / 5 = 10

3

Calculate range for each

Range A = 12 − 8 = 4
Range B = 18 − 2 = 16

4

Interpret

Both datasets have a mean of 10, but Dataset A has a range of only 4 (values are tightly clustered around the mean), while Dataset B has a range of 16 (values are widely spread). The mean alone is misleading — spread is essential for a complete picture.

Answer: Mean A = Mean B = 10 · Range A = 4 · Range B = 16 — same centre, very different spread

Checkpoint 4

a) Find the range for: {17, 4, 22, 9, 31, 8}
b) Two basketball players each scored an average of 18 points per game over 5 games. Player A's scores: {17, 18, 19, 18, 18}. Player B's scores: {5, 10, 18, 30, 27}. Who is more consistent? Justify using the range.
c) Can the range of a dataset ever be zero? Give an example.

a) Maximum = 31, minimum = 4. Range = 31 − 4 = 27

b) Player A: range = 19 − 17 = 2 (very consistent). Player B: range = 30 − 5 = 25 (highly variable). Player A is far more consistent — a coach can predict Player A's performance reliably.

c) Yes. If every value in the dataset is identical, the range is zero. Example: {5, 5, 5, 5} → range = 5 − 5 = 0.

5

Graphs

Choosing the right visual for your data

A graph is only useful if it matches the type of data and the question you're trying to answer. Using a pie chart for data measured over time, or a histogram for categories, doesn't just look wrong — it actively misleads the reader.

💭

Why does choosing the wrong graph obscure the message?
Suppose you track a city's temperature every month for a year. If you draw a pie chart, each month becomes a slice of a circle — but a circle implies that all months together make up some "whole," which is meaningless for temperature. A line graph, on the other hand, shows the rise and fall across the year clearly. The right graph makes the pattern instantly obvious; the wrong graph hides it.

Graph selection guide

Graph type	Best for	Key feature
Bar graph	Comparing categories (qualitative or discrete data)	Bars do NOT touch; height = frequency or value
Histogram	Continuous data grouped into intervals	Bars DO touch; no gaps — data is continuous
Line graph	Data over time (showing trends and changes)	Points connected by lines to show progression
Pie chart	Parts of a whole — showing relative proportions	Sector angle = (freq / total) × 360°
Stem-and-leaf	Showing distribution of small datasets	Preserves original values; shows shape of data

⚠️

Histogram vs Bar graph — the most common mix-up:
In a bar graph, the bars have gaps between them because the categories are separate and distinct (e.g., "cats," "dogs," "birds" — there is nothing between them).
In a histogram, the bars touch with no gaps because the data is continuous — values flow from one interval directly into the next (e.g., 0–10 cm is immediately followed by 10–20 cm). The touching bars represent this continuity.

Sector angles for pie charts

Pie chart sector angle

sector angle = (frequency / total) × 360°

All sector angles must sum to exactly 360°. Always verify this at the end.

★ Easy

Calculating pie chart sector angles

30 students chose a favourite subject. Results: Math=9, French=6, Science=9, History=6. Calculate the sector angle for each subject.

Show solution

1

Confirm the total

9 + 6 + 9 + 6 = 30 ✓

2

Calculate each sector angle using (freq / total) × 360°

Math: (9/30) × 360° = 0.30 × 360° = 108°
French: (6/30) × 360° = 0.20 × 360° = 72°
Science: (9/30) × 360° = 0.30 × 360° = 108°
History: (6/30) × 360° = 0.20 × 360° = 72°

3

Verify the total equals 360°

108° + 72° + 108° + 72° = 360° ✓

Answer: Math = 108° · French = 72° · Science = 108° · History = 72°

★★ Medium

Choosing the right graph type

For each scenario, state the best graph type and justify your choice. (a) Monthly rainfall totals for one year. (b) The proportion of students choosing each of 4 lunch options. (c) The heights (in cm) of 50 students, grouped into intervals of 10 cm.

Show solution

1

Scenario (a): Monthly rainfall over a year

Best graph: Line graph. The data is collected at regular time intervals (monthly) and you want to show trends and seasonal changes over time. A line graph connects the data points to reveal the pattern of change clearly.

2

Scenario (b): Proportion of students choosing lunch options

Best graph: Pie chart. You have four distinct categories and you want to show how each one contributes to the whole (all students together = 100%). A pie chart makes relative proportions visually obvious. A bar chart works too, but the question specifically asks about proportions of a whole.

3

Scenario (c): Heights grouped into 10 cm intervals

Best graph: Histogram. Height is continuous data grouped into class intervals (e.g., 150–160 cm, 160–170 cm). Since the intervals are adjacent and continuous, the bars must touch — that is the defining feature of a histogram. A bar graph would be wrong here because it implies the groups are separate categories.

Summary: (a) Line graph — time trends · (b) Pie chart — parts of a whole · (c) Histogram — continuous grouped data

Checkpoint 5

a) 40 people were surveyed about their favourite season. Results: Spring=12, Summer=16, Autumn=8, Winter=4. Calculate the sector angle for each season in a pie chart.
b) A student draws a bar graph to display the distribution of exam scores that were recorded to the nearest whole number in intervals of 10 (50–59, 60–69, etc.). A classmate says this should be a histogram. Who is correct? Why?
c) Name the one graph type that preserves every original data value while also showing the shape of the distribution.

a) Total = 40.
Spring: (12/40) × 360° = 108°
Summer: (16/40) × 360° = 144°
Autumn: (8/40) × 360° = 72°
Winter: (4/40) × 360° = 36°
Check: 108 + 144 + 72 + 36 = 360° ✓

b) The classmate is correct. Exam scores measured in intervals (50–59, 60–69…) are grouped continuous data — the intervals are adjacent with no gaps between them. A histogram is required (bars touching). A bar graph would incorrectly imply that the score groups are separate categories with no connection to one another.

c) A stem-and-leaf plot preserves every original value (you can reconstruct the full dataset from it) while its shape reveals the distribution visually.

UnderstandingData

The four data types

Population vs Sample

Building a frequency table step by step

Stem-and-Leaf plots

Finding the median — odd vs even count

Choosing the right measure

Graph selection guide

Sector angles for pie charts

Understanding
Data