IB HL

Statistics and Probability

Download PDF

How to Use This Guide

Statistics and Probability is Topic 4 of the IB Math AA HL syllabus and reliably accounts for a substantial block of marks across Paper 2 and Paper 3. This guide follows the IB AA HL syllabus point by point: probability rules, conditional probability and Bayes’ theorem, discrete random variables (binomial and Poisson), continuous random variables (normal distribution), hypothesis testing (chi-squared, t-tests), and correlation and regression. Every section includes fully worked examples drawn from past IB papers, exam alerts for the mistakes that cost marks most often, and a complete formula reference at the end.

How to approach Statistics on exams: The IB rewards structured method. For any probability question, always define your events and state the formula before you substitute values. For hypothesis tests, always write H0H_0 and H1H_1 explicitly, state the significance level, calculate or identify the test statistic, compare to the critical value or p-value, and write a conclusion in context. For distributions, always name the distribution and its parameters — for example, “Let XB(12,0.3)X \sim B(12, 0.3)” — before calculating any probabilities. Your GDC can compute binomial, Poisson, and normal probabilities directly; know how to use these functions and show the setup clearly in your working.

What is and is not in the formula booklet: The booklet gives: P(AB)P(A \cup B), P(AB)P(A \mid B), E(X)E(X), Var(X)\text{Var}(X), binomial mean and variance, Poisson formula and mean/variance, normal standardisation Z=XμσZ = \frac{X-\mu}{\sigma}, the chi-squared statistic, and the Pearson correlation coefficient formula. NOT given: Bayes’ theorem (you must derive it from first principles or recognise the pattern), the method for constructing tree diagrams, the decision rule for hypothesis tests, and how to interpret the regression coefficient bb.


Section 1: Probability (4.5–4.7)

Probability is a measure of how likely an event is to occur, expressed as a number between 0 (impossible) and 1 (certain). An event is a subset of the sample space SS, the set of all possible outcomes.

Notation:

  • P(A)P(A) — probability that event AA occurs
  • AA' or Aˉ\bar{A} — the complement of AA (event AA does NOT occur)
  • ABA \cup BAA or BB (or both)
  • ABA \cap BAA and BB simultaneously

1.1 Combined Probability

The addition rule gives the probability that at least one of two events occurs:

P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)

The subtraction removes the double-counting of outcomes in both AA and BB.

Special case — mutually exclusive events: If AA and BB cannot both occur, P(AB)=0P(A \cap B) = 0, so:

P(AB)=P(A)+P(B)(mutually exclusive only)P(A \cup B) = P(A) + P(B) \qquad \text{(mutually exclusive only)}

Complement rule:

P(A)=1P(A)P(A') = 1 - P(A)

Combined Probability — Addition Rule

In a class of 30 students, 18 study French, 12 study Spanish, and 6 study both. A student is chosen at random. Find the probability they study French or Spanish.

Define events: Let FF = studies French, SS = studies Spanish.

P(F)=1830,P(S)=1230,P(FS)=630P(F) = \frac{18}{30}, \quad P(S) = \frac{12}{30}, \quad P(F \cap S) = \frac{6}{30}

Apply the addition rule:

P(FS)=1830+1230630=2430=45P(F \cup S) = \frac{18}{30} + \frac{12}{30} - \frac{6}{30} = \frac{24}{30} = \frac{4}{5}

Venn Diagram — Finding Unknown Probabilities

Given P(A)=0.5P(A) = 0.5, P(B)=0.4P(B) = 0.4, and P(AB)=0.7P(A \cup B) = 0.7, find P(AB)P(A \cap B).

Rearrange the addition rule:

P(AB)=P(A)+P(B)P(AB)=0.5+0.40.7=0.2P(A \cap B) = P(A) + P(B) - P(A \cup B) = 0.5 + 0.4 - 0.7 = 0.2

Hence find P(A only)P(A \text{ only}) — the probability of AA but not BB.

P(AB)=P(A)P(AB)=0.50.2=0.3P(A \cap B') = P(A) - P(A \cap B) = 0.5 - 0.2 = 0.3

1.2 Conditional Probability

The conditional probability of AA given that BB has occurred is:

P(AB)=P(AB)P(B),P(B)>0P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \qquad P(B) > 0

Rearranging gives the multiplication rule:

P(AB)=P(AB)P(B)=P(BA)P(A)P(A \cap B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)

Confusing P(AB)P(A \mid B) with P(BA)P(B \mid A) is one of the most costly errors in IB Statistics. These are generally different. P(has diseasetest positive)P(\text{has disease} \mid \text{test positive}) is NOT the same as P(test positivehas disease)P(\text{test positive} \mid \text{has disease}). Always ask: “which event is the condition (after the bar)?”

Conditional Probability — Two-Way Table

A survey records whether 100 students exercise regularly and whether they sleep more than 7 hours per night.

Sleep 7\geq 7 hSleep <7< 7 hTotal
Exercises regularly421860
Does not exercise231740
Total6535100

(a) Find the probability a randomly selected student exercises regularly given they sleep at least 7 hours.

(b) Find the probability a student sleeps fewer than 7 hours given they do not exercise regularly.

Part (a): Let EE = exercises, GG = sleeps 7\geq 7 h.

P(EG)=P(EG)P(G)=42/10065/100=42650.646P(E \mid G) = \frac{P(E \cap G)}{P(G)} = \frac{42/100}{65/100} = \frac{42}{65} \approx 0.646

Part (b): Let LL = sleeps <7< 7 h, EE' = does not exercise.

P(LE)=17/10040/100=1740=0.425P(L \mid E') = \frac{17/100}{40/100} = \frac{17}{40} = 0.425

1.3 Independence

Two events AA and BB are independent if knowing one occurred gives no information about the other:

A and B are independent    P(AB)=P(A)P(B)A \text{ and } B \text{ are independent} \iff P(A \cap B) = P(A) \cdot P(B)

Equivalently, P(AB)=P(A)P(A \mid B) = P(A) and P(BA)=P(B)P(B \mid A) = P(B).

Independence \neq mutual exclusivity. Mutually exclusive events (cannot both occur) with non-zero probabilities are actually the most extreme form of dependence — if AA happens, BB definitely cannot, so P(BA)=0P(B)P(B \mid A) = 0 \neq P(B). Students frequently confuse these two concepts.

Testing Independence

P(A)=0.4P(A) = 0.4, P(B)=0.5P(B) = 0.5, P(AB)=0.2P(A \cap B) = 0.2. Are AA and BB independent?

Check: P(A)×P(B)=0.4×0.5=0.20P(A) \times P(B) = 0.4 \times 0.5 = 0.20

Since P(AB)=0.20=P(A)P(B)P(A \cap B) = 0.20 = P(A) \cdot P(B), the events are independent.

Also verify: P(AB)=0.20.5=0.4=P(A)P(A \mid B) = \dfrac{0.2}{0.5} = 0.4 = P(A). Confirmed.

1.4 Tree Diagrams

Tree diagrams organise sequential probability problems. Each branch carries a conditional probability, and probabilities along a path are multiplied (multiplication rule). Probabilities on branches from the same node must sum to 1.

Tree Diagram — Two-Stage Problem

A box contains 4 red and 6 blue balls. Two balls are drawn without replacement. Find the probability that both balls are the same colour.

Stage 1 branch probabilities:

  • P(Red1)=410P(\text{Red}_1) = \frac{4}{10}, P(Blue1)=610P(\text{Blue}_1) = \frac{6}{10}

Stage 2 branch probabilities (conditional on Stage 1):

  • After Red: P(Red2Red1)=39P(\text{Red}_2 \mid \text{Red}_1) = \frac{3}{9}, P(Blue2Red1)=69P(\text{Blue}_2 \mid \text{Red}_1) = \frac{6}{9}
  • After Blue: P(Red2Blue1)=49P(\text{Red}_2 \mid \text{Blue}_1) = \frac{4}{9}, P(Blue2Blue1)=59P(\text{Blue}_2 \mid \text{Blue}_1) = \frac{5}{9}

Path probabilities (same colour):

P(RR)=41039=1290P(\text{RR}) = \frac{4}{10} \cdot \frac{3}{9} = \frac{12}{90}

P(BB)=61059=3090P(\text{BB}) = \frac{6}{10} \cdot \frac{5}{9} = \frac{30}{90}

P(same colour)=1290+3090=4290=715P(\text{same colour}) = \frac{12}{90} + \frac{30}{90} = \frac{42}{90} = \frac{7}{15}

1.5 Bayes’ Theorem HL

Bayes’ theorem allows you to reverse conditional probabilities — to find the probability of a cause given an observed effect. It arises naturally from the multiplication rule.

For two complementary events AA and AA':

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}

where the total probability P(B)P(B) is expanded as:

P(B)=P(BA)P(A)+P(BA)P(A)P(B) = P(B \mid A) \cdot P(A) + P(B \mid A') \cdot P(A')

Combining these:

P(AB)=P(BA)P(A)P(BA)P(A)+P(BA)P(A)P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B \mid A) \cdot P(A) + P(B \mid A') \cdot P(A')}

Bayes’ theorem is not in the formula booklet. You are expected to derive it or apply it directly using a tree diagram. Drawing the tree and labelling all branches is the safest approach — read off the answer by dividing the target path probability by the total probability of the observed outcome.

Bayes’ Theorem — Medical Test

A disease affects 1% of a population. A diagnostic test has a 95% sensitivity (true positive rate) and a 2% false positive rate. A randomly selected person tests positive. Find the probability they actually have the disease.

Define events:

  • DD = has the disease, DD' = does not have the disease
  • T+T^+ = tests positive

Given: P(D)=0.01P(D) = 0.01, P(D)=0.99P(D') = 0.99, P(T+D)=0.95P(T^+ \mid D) = 0.95, P(T+D)=0.02P(T^+ \mid D') = 0.02

Total probability of testing positive:

P(T+)=P(T+D)P(D)+P(T+D)P(D)P(T^+) = P(T^+ \mid D) \cdot P(D) + P(T^+ \mid D') \cdot P(D')

=(0.95)(0.01)+(0.02)(0.99)=0.0095+0.0198=0.0293= (0.95)(0.01) + (0.02)(0.99) = 0.0095 + 0.0198 = 0.0293

Apply Bayes:

P(DT+)=P(T+D)P(D)P(T+)=0.00950.02930.324P(D \mid T^+) = \frac{P(T^+ \mid D) \cdot P(D)}{P(T^+)} = \frac{0.0095}{0.0293} \approx 0.324

Despite a positive test, there is only a 32.4% chance the person actually has the disease. This counter-intuitive result arises because the disease is rare — most positive tests come from the large pool of healthy people with false positives.

In Bayes problems, the denominator is always P(observed outcome)P(\text{observed outcome}), computed via the law of total probability over all mutually exclusive “causes.” The most common error is forgetting to include all branches in this denominator, or using P(D)P(D) in place of P(T+)P(T^+).

Bayes’ Theorem — Factory Quality Control

Machine A produces 60% of a factory’s output; machine B produces the remaining 40%. Machine A has a 3% defect rate; machine B has a 5% defect rate. A randomly selected item is found to be defective. What is the probability it came from machine A?

Setup: P(A)=0.6P(A) = 0.6, P(B)=0.4P(B) = 0.4, P(DA)=0.03P(D \mid A) = 0.03, P(DB)=0.05P(D \mid B) = 0.05

P(D)=(0.03)(0.6)+(0.05)(0.4)=0.018+0.020=0.038P(D) = (0.03)(0.6) + (0.05)(0.4) = 0.018 + 0.020 = 0.038

P(AD)=(0.03)(0.6)0.038=0.0180.0380.474P(A \mid D) = \frac{(0.03)(0.6)}{0.038} = \frac{0.018}{0.038} \approx 0.474

There is about a 47.4% chance the defective item came from machine A.

Quick Recall — Section 1

Try to answer without scrolling up:

  1. State Bayes’ theorem in words.
  2. If P(AB)=P(A)P(B)P(A \cap B) = P(A) \cdot P(B), what does this tell you about AA and BB?
  3. What is the formula for conditional probability P(AB)P(A|B)?
Reveal answers
  1. Bayes’ theorem lets you “reverse” a conditional probability — given the outcome, find the probability of the cause.
  2. AA and BB are independent.
  3. P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}.

Section 2: Discrete Random Variables (4.8–4.9)

A discrete random variable (DRV) XX takes a countable set of values, each with a defined probability. The probability distribution of XX is a complete list of all possible values xix_i and their probabilities P(X=xi)P(X = x_i).

Requirements for a valid probability distribution:

  1. 0P(X=xi)10 \leq P(X = x_i) \leq 1 for all ii
  2. iP(X=xi)=1\sum_i P(X = x_i) = 1

2.1 Expected Value and Variance

The expected value (mean) of XX is the long-run average outcome:

E(X)=xP(X=x)E(X) = \sum x \cdot P(X = x)

The variance measures the spread around the mean:

Var(X)=E(X2)[E(X)]2\text{Var}(X) = E(X^2) - [E(X)]^2

where E(X2)=x2P(X=x)E(X^2) = \sum x^2 \cdot P(X = x).

The standard deviation is SD(X)=Var(X)\text{SD}(X) = \sqrt{\text{Var}(X)}.

Linear transformations: For Y=aX+bY = aX + b:

E(aX+b)=aE(X)+bE(aX + b) = a \cdot E(X) + b

Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \cdot \text{Var}(X)

Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X), not a2Var(X)+ba^2 \text{Var}(X) + b. The constant bb shifts the distribution but does not affect spread. Students frequently add bb or b2b^2 to the variance.

Expected Value and Variance

A biased die has the following distribution:

xx1234
P(X=x)P(X=x)0.10.30.40.2

(a) Find E(X)E(X). (b) Find Var(X)\text{Var}(X). (c) Find E(3X2)E(3X - 2) and Var(3X2)\text{Var}(3X - 2).

Part (a):

E(X)=1(0.1)+2(0.3)+3(0.4)+4(0.2)=0.1+0.6+1.2+0.8=2.7E(X) = 1(0.1) + 2(0.3) + 3(0.4) + 4(0.2) = 0.1 + 0.6 + 1.2 + 0.8 = 2.7

Part (b): First compute E(X2)E(X^2):

E(X2)=12(0.1)+22(0.3)+32(0.4)+42(0.2)=0.1+1.2+3.6+3.2=8.1E(X^2) = 1^2(0.1) + 2^2(0.3) + 3^2(0.4) + 4^2(0.2) = 0.1 + 1.2 + 3.6 + 3.2 = 8.1

Var(X)=8.1(2.7)2=8.17.29=0.81\text{Var}(X) = 8.1 - (2.7)^2 = 8.1 - 7.29 = 0.81

Part (c):

E(3X2)=3(2.7)2=8.12=6.1E(3X - 2) = 3(2.7) - 2 = 8.1 - 2 = 6.1

Var(3X2)=32Var(X)=9×0.81=7.29\text{Var}(3X - 2) = 3^2 \cdot \text{Var}(X) = 9 \times 0.81 = 7.29

2.2 Binomial Distribution

The binomial distribution models the number of successes in nn independent trials, each with probability pp of success.

Conditions for a binomial model:

  1. Fixed number of trials nn
  2. Each trial has exactly two outcomes (success / failure)
  3. Constant probability pp of success on each trial
  4. Trials are independent

If XB(n,p)X \sim B(n, p), then:

P(X=x)=(nx)px(1p)nx,x=0,1,2,,nP(X = x) = \binom{n}{x} p^x (1-p)^{n-x}, \qquad x = 0, 1, 2, \ldots, n

E(X)=npVar(X)=np(1p)E(X) = np \qquad \text{Var}(X) = np(1-p)

Binomial Quick Reference

QuantityFormula
DistributionXB(n,p)X \sim B(n,\, p)
P.M.F.P(X=x)=(nx)px(1p)nxP(X=x) = \binom{n}{x}p^x(1-p)^{n-x}
MeanE(X)=npE(X) = np
VarianceVar(X)=np(1p)\text{Var}(X) = np(1-p)
GDC (Casio)BinomPD(x, n, p) for exact; BinomCD(x, n, p) for P(Xx)P(X \leq x)

Binomial — Exact Probability

A fair coin is tossed 8 times. Find the probability of getting exactly 5 heads.

State the distribution: Let XX = number of heads. Since trials are independent with p=0.5p = 0.5 and n=8n = 8: XB(8,0.5)X \sim B(8, 0.5).

P(X=5)=(85)(0.5)5(0.5)3=56×(0.5)8=56256=7320.219P(X = 5) = \binom{8}{5}(0.5)^5(0.5)^3 = 56 \times (0.5)^8 = \frac{56}{256} = \frac{7}{32} \approx 0.219

Binomial — Cumulative and Complement

In a production line, 15% of items are defective. A batch of 20 items is selected. Find: (a) P(X3)P(X \leq 3), (b) P(X4)P(X \geq 4), (c) P(2X5)P(2 \leq X \leq 5).

State: XB(20,0.15)X \sim B(20, 0.15)

Part (a): Use GDC cumulative binomial: P(X3)0.6477P(X \leq 3) \approx 0.6477

Part (b): Complement rule:

P(X4)=1P(X3)10.6477=0.3523P(X \geq 4) = 1 - P(X \leq 3) \approx 1 - 0.6477 = 0.3523

Part (c):

P(2X5)=P(X5)P(X1)0.93270.1756=0.7571P(2 \leq X \leq 5) = P(X \leq 5) - P(X \leq 1) \approx 0.9327 - 0.1756 = 0.7571

When NOT to use binomial: The binomial requires (1) fixed nn, (2) constant pp, (3) independence. Drawing without replacement from a small population violates independence — use the hypergeometric distribution or direct probability instead. “Selecting cards from a deck” without replacement is not binomial.

Binomial — Finding Parameters from Mean and Variance

A random variable XB(n,p)X \sim B(n, p) has mean 6 and variance 4.2. Find nn and pp.

Set up equations:

np=6np(1p)=4.2np = 6 \qquad np(1-p) = 4.2

Divide the second by the first:

1p=4.26=0.7    p=0.31 - p = \frac{4.2}{6} = 0.7 \implies p = 0.3

Substitute back:

n(0.3)=6    n=20n(0.3) = 6 \implies n = 20

2.3 Poisson Distribution HL

The Poisson distribution models the number of occurrences of a random event in a fixed interval of time or space, when events occur independently at a constant average rate.

Conditions for a Poisson model:

  1. Events occur independently
  2. Events occur at a constant average rate mm per unit interval
  3. Two events cannot occur simultaneously

If XPo(m)X \sim \text{Po}(m), then:

P(X=x)=emmxx!,x=0,1,2,P(X = x) = \frac{e^{-m} m^x}{x!}, \qquad x = 0, 1, 2, \ldots

E(X)=Var(X)=mE(X) = \text{Var}(X) = m

The equality of mean and variance is a diagnostic property — if a dataset has mean \approx variance, a Poisson model is plausible.

Poisson Quick Reference

QuantityFormula
DistributionXPo(m)X \sim \text{Po}(m)
P.M.F.P(X=x)=emmxx!P(X=x) = \dfrac{e^{-m}m^x}{x!}
MeanE(X)=mE(X) = m
VarianceVar(X)=m\text{Var}(X) = m
GDC (Casio)PoissonPD(x, m) for exact; PoissonCD(x, m) for P(Xx)P(X \leq x)

Poisson — Direct Calculation

Calls arrive at a helpdesk at an average rate of 3 per hour. Find the probability that: (a) exactly 2 calls arrive in one hour, (b) fewer than 4 calls arrive in one hour.

State: XPo(3)X \sim \text{Po}(3)

Part (a):

P(X=2)=e3322!=e39292×0.049790.224P(X = 2) = \frac{e^{-3} \cdot 3^2}{2!} = \frac{e^{-3} \cdot 9}{2} \approx \frac{9}{2} \times 0.04979 \approx 0.224

Part (b):

P(X<4)=P(X3)=P(X=0)+P(X=1)+P(X=2)+P(X=3)P(X < 4) = P(X \leq 3) = P(X=0) + P(X=1) + P(X=2) + P(X=3)

=e3 ⁣(1+3+92+276)=e3130.04979×130.647= e^{-3}\!\left(1 + 3 + \frac{9}{2} + \frac{27}{6}\right) = e^{-3} \cdot 13 \approx 0.04979 \times 13 \approx 0.647

Poisson — Changing the Interval

Emails arrive at an average rate of 6 per hour. Find the probability that: (a) no emails arrive in 20 minutes, (b) more than 2 emails arrive in 30 minutes.

Key step: Rescale the rate to match the interval.

  • In 20 minutes: m=6×2060=2m = 6 \times \frac{20}{60} = 2, so XPo(2)X \sim \text{Po}(2)
  • In 30 minutes: m=6×3060=3m = 6 \times \frac{30}{60} = 3, so YPo(3)Y \sim \text{Po}(3)

Part (a):

P(X=0)=e2200!=e20.135P(X = 0) = \frac{e^{-2} \cdot 2^0}{0!} = e^{-2} \approx 0.135

Part (b):

P(Y>2)=1P(Y2)=1e3 ⁣(1+3+92)=1e3(8.5)10.423=0.577P(Y > 2) = 1 - P(Y \leq 2) = 1 - e^{-3}\!\left(1 + 3 + \frac{9}{2}\right) = 1 - e^{-3}(8.5) \approx 1 - 0.423 = 0.577

When the Poisson interval changes, multiply the rate proportionally. If the rate is ”mm per hour” and you want a 15-minute interval, use mnew=m×14m_{\text{new}} = m \times \frac{1}{4}. Forgetting to rescale is one of the most common Poisson errors.

Poisson as an approximation to binomial HL: When nn is large and pp is small (rule of thumb: n50n \geq 50, p0.1p \leq 0.1), B(n,p)Po(np)B(n, p) \approx \text{Po}(np).

Poisson Approximation to Binomial HL

A rare genetic mutation occurs in 1 in 500 births. In a sample of 400 births, find the approximate probability that at most 2 have the mutation.

Check conditions: n=400n = 400 (large), p=0.002p = 0.002 (small). Approximate with Po(m)\text{Po}(m) where m=np=400×0.002=0.8m = np = 400 \times 0.002 = 0.8.

P(X2)=e0.8(1+0.8+0.642)=e0.8(2.12)0.4493×2.120.953P(X \leq 2) = e^{-0.8}\left(1 + 0.8 + \frac{0.64}{2}\right) = e^{-0.8}(2.12) \approx 0.4493 \times 2.12 \approx 0.953

Quick Recall — Section 2

Try to answer without scrolling up:

  1. Write the formula for the expected value E(X)E(X) of a discrete random variable.
  2. If XB(n,p)X \sim B(n, p), what are E(X)E(X) and Var(X)\text{Var}(X)?
  3. When can you approximate a binomial distribution with a Poisson distribution?
Reveal answers
  1. E(X)=xiP(X=xi)E(X) = \sum x_i \cdot P(X = x_i).
  2. E(X)=npE(X) = np and Var(X)=np(1p)\text{Var}(X) = np(1-p).
  3. When nn is large and pp is small (close to 0), with m=npm = np.

Section 3: Continuous Random Variables (4.10–4.12)

A continuous random variable (CRV) takes values in a continuous range. Unlike DRVs, we cannot assign probability to individual values — instead we work with a probability density function (pdf) f(x)f(x).

Properties of a pdf:

  1. f(x)0f(x) \geq 0 for all xx
  2. f(x)dx=1\displaystyle\int_{-\infty}^{\infty} f(x)\,dx = 1 (total area under the curve equals 1)
  3. P(aXb)=abf(x)dxP(a \leq X \leq b) = \displaystyle\int_a^b f(x)\,dx

For a CRV: P(X=x)=0P(X = x) = 0 for any single value. Therefore P(Xa)=P(X<a)P(X \leq a) = P(X < a).

Mean and variance of a CRV:

E(X)=xf(x)dxVar(X)=x2f(x)dx[E(X)]2E(X) = \int_{-\infty}^{\infty} x \cdot f(x)\,dx \qquad \text{Var}(X) = \int_{-\infty}^{\infty} x^2 f(x)\,dx - [E(X)]^2

3.1 The Normal Distribution

The normal distribution is the most important continuous distribution. It models many natural phenomena — heights, measurement errors, exam scores — where data clusters symmetrically around a mean.

If XN(μ,σ2)X \sim N(\mu, \sigma^2), its bell-shaped pdf is:

f(x)=1σ2πexp ⁣((xμ)22σ2)f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

Key properties:

  • Symmetric about the mean μ\mu
  • Mean = Median = Mode =μ= \mu
  • Inflection points at μ±σ\mu \pm \sigma
  • P(μσ<X<μ+σ)0.683P(\mu - \sigma < X < \mu + \sigma) \approx 0.683
  • P(μ2σ<X<μ+2σ)0.954P(\mu - 2\sigma < X < \mu + 2\sigma) \approx 0.954
  • P(μ3σ<X<μ+3σ)0.997P(\mu - 3\sigma < X < \mu + 3\sigma) \approx 0.997

The 68-95-99.7 Rule

IntervalApproximate probability
Within 1σ1\sigma of μ\mu68.3%
Within 2σ2\sigma of μ\mu95.4%
Within 3σ3\sigma of μ\mu99.7%

These are approximations. Use your GDC for exact values.

3.2 Standardisation and Z-Scores

The standard normal distribution ZN(0,1)Z \sim N(0, 1) has mean 0 and variance 1. Any normal random variable can be converted to ZZ via:

Z=XμσZ = \frac{X - \mu}{\sigma}

A z-score tells you how many standard deviations an observation is from the mean. Z=1.5Z = 1.5 means the value is 1.5 standard deviations above the mean.

Normal Distribution — Finding Probabilities

Heights of adults are normally distributed with mean 170 cm and standard deviation 8 cm. Find: (a) P(X>180)P(X > 180), (b) P(160<X<185)P(160 < X < 185).

State: XN(170,64)X \sim N(170, 64)

Part (a): Z=1801708=1.25Z = \dfrac{180 - 170}{8} = 1.25

Using GDC or standard normal table: P(Z>1.25)=1P(Z1.25)10.8944=0.1056P(Z > 1.25) = 1 - P(Z \leq 1.25) \approx 1 - 0.8944 = 0.1056

Part (b): Lower bound: Z1=1601708=1.25Z_1 = \dfrac{160-170}{8} = -1.25; Upper bound: Z2=1851708=1.875Z_2 = \dfrac{185-170}{8} = 1.875

P(1.25<Z<1.875)=P(Z<1.875)P(Z<1.25)0.96960.1056=0.8640P(-1.25 < Z < 1.875) = P(Z < 1.875) - P(Z < -1.25) \approx 0.9696 - 0.1056 = 0.8640

When using GDC for normal probabilities, use normalcdf(lower, upper, μ, σ). For P(X>a)P(X > a), use a very large upper bound (e.g., 109910^{99}), or compute 1P(Xa)1 - P(X \leq a) using the complement. Show your GDC setup in working — write out which distribution, the parameters, and the bounds.

Normal Distribution — Symmetry Shortcut

XN(50,25)X \sim N(50, 25). Find P(45<X<55)P(45 < X < 55) without a table.

The interval [45,55][45, 55] is symmetric about μ=50\mu = 50, each end 50455=1\frac{50-45}{5} = 1 standard deviation away.

By the 68-95-99.7 rule: P(μσ<X<μ+σ)0.683P(\mu - \sigma < X < \mu + \sigma) \approx 0.683.

For an exact answer: ZZ bounds are 1-1 and 11, so using GDC: P(1<Z<1)=0.6827P(-1 < Z < 1) = 0.6827.

3.3 Inverse Normal

The inverse normal problem asks: given a probability pp, find the value xx such that P(Xx)=pP(X \leq x) = p.

On the GDC, use invNorm(p, μ, σ) to find xx directly.

Inverse Normal — Finding a Threshold

Test scores are distributed as XN(65,100)X \sim N(65, 100). The top 10% of students receive a distinction. Find the minimum score needed for a distinction.

We need xx such that P(X>x)=0.10P(X > x) = 0.10, i.e., P(Xx)=0.90P(X \leq x) = 0.90.

Using GDC: invNorm(0.90, 65, 10) =77.8= 77.8

A student needs at least 77.8 (round up to 78) to receive a distinction.

Inverse Normal — Symmetric Interval

XN(μ,σ2)X \sim N(\mu, \sigma^2) with μ=40\mu = 40 and σ=5\sigma = 5. Find the value kk such that P(μk<X<μ+k)=0.90P(\mu - k < X < \mu + k) = 0.90.

By symmetry, P(X<μk)=0.05P(X < \mu - k) = 0.05.

Using GDC: invNorm(0.05, 40, 5) 31.78\approx 31.78, so k=4031.78=8.22k = 40 - 31.78 = 8.22.

Check: P(408.22<X<40+8.22)=P(31.78<X<48.22)=0.90P(40 - 8.22 < X < 40 + 8.22) = P(31.78 < X < 48.22) = 0.90. Confirmed.

Finding μ\mu or σ\sigma from a Normal Probability

XN(μ,16)X \sim N(\mu, 16). It is given that P(X>20)=0.2P(X > 20) = 0.2. Find μ\mu.

Standardise: P ⁣(Z>20μ4)=0.2P\!\left(Z > \dfrac{20 - \mu}{4}\right) = 0.2

Find the z-score: P(Z>z)=0.2P(Zz)=0.8z0.8416P(Z > z) = 0.2 \Rightarrow P(Z \leq z) = 0.8 \Rightarrow z \approx 0.8416

Solve:

20μ4=0.8416    20μ=3.366    μ16.6\frac{20 - \mu}{4} = 0.8416 \implies 20 - \mu = 3.366 \implies \mu \approx 16.6


Section 4: Hypothesis Testing (4.13–4.15)

Statistical hypothesis testing provides a formal framework for deciding whether observed data provide sufficient evidence against a default assumption. Every test follows the same logical structure.

The universal procedure:

  1. State H0H_0 and H1H_1 — the null and alternative hypotheses
  2. State the significance level α\alpha (typically 0.05 or 0.01)
  3. Identify the test statistic and its distribution under H0H_0
  4. Calculate the test statistic (or use GDC)
  5. Find the p-value (or compare test statistic to critical value)
  6. Make a decision: reject H0H_0 if p-value<αp\text{-value} < \alpha
  7. Write a conclusion in context

Never omit H0H_0 and H1H_1, and never write them in terms of sample statistics. Hypotheses are always about population parameters. Write H0:μ=50H_0: \mu = 50, not ”H0H_0: the sample mean is 50.” Losing marks for missing hypotheses is one of the most avoidable errors in IB exams.

4.1 The p-value

The p-value is the probability of obtaining a result at least as extreme as the observed data, assuming H0H_0 is true. A small p-value means the observed data would be very unlikely under H0H_0, providing evidence against it.

Decision rule: Reject H0H_0 if p-value<α\text{p-value} < \alpha.

The p-value is NOT the probability that H0H_0 is true. It is the probability of the observed data (or more extreme) given H0H_0 is true. This distinction matters in written conclusions — say “there is sufficient evidence to reject H0H_0” rather than “we have proved H0H_0 is false.”

4.2 Chi-Squared Goodness-of-Fit Test

The chi-squared goodness-of-fit test assesses whether observed frequency data are consistent with a proposed theoretical distribution.

Test statistic:

χ2=all cells(OE)2E\chi^2 = \sum_{\text{all cells}} \frac{(O - E)^2}{E}

where OO is the observed frequency and EE is the expected frequency under H0H_0.

Under H0H_0, χ2χν2\chi^2 \sim \chi^2_\nu where the degrees of freedom ν=k1m\nu = k - 1 - m, with kk = number of categories and mm = number of parameters estimated from the data.

Conditions: All expected frequencies E5E \geq 5. If some are too small, combine adjacent categories.

Chi-Squared GOF Procedure

StepAction
H0H_0The data follow the proposed distribution
H1H_1The data do not follow the proposed distribution
ν\nuk1k - 1 (if no parameters estimated); k1mk - 1 - m (if mm estimated)
Reject H0H_0 ifχcalc2>χcrit2\chi^2_{\text{calc}} > \chi^2_{\text{crit}} or p<αp < \alpha

Chi-Squared Goodness-of-Fit

A die is rolled 120 times. The observed frequencies are:

Face123456
Observed172218251919

Test at the 5% significance level whether the die is fair.

Step 1 — Hypotheses:

  • H0H_0: The die is fair (each face has probability 16\frac{1}{6})
  • H1H_1: The die is not fair

Step 2 — Significance level: α=0.05\alpha = 0.05

Step 3 — Expected frequencies: If fair, E=1206=20E = \frac{120}{6} = 20 for each face.

Step 4 — Test statistic:

χ2=(1720)220+(2220)220+(1820)220+(2520)220+(1920)220+(1920)220\chi^2 = \frac{(17-20)^2}{20} + \frac{(22-20)^2}{20} + \frac{(18-20)^2}{20} + \frac{(25-20)^2}{20} + \frac{(19-20)^2}{20} + \frac{(19-20)^2}{20}

=9+4+4+25+1+120=4420=2.2= \frac{9 + 4 + 4 + 25 + 1 + 1}{20} = \frac{44}{20} = 2.2

Step 5 — Degrees of freedom: ν=61=5\nu = 6 - 1 = 5

Step 6 — Critical value: χ5,0.052=11.07\chi^2_{5, 0.05} = 11.07

Step 7 — Decision: χcalc2=2.2<11.07\chi^2_{\text{calc}} = 2.2 < 11.07, so we fail to reject H0H_0.

Conclusion: At the 5% significance level, there is insufficient evidence to conclude the die is unfair.

The chi-squared test requires all expected frequencies to be at least 5, not the observed frequencies. If any E<5E < 5, merge that category with an adjacent one before calculating χ2\chi^2.

4.3 Chi-Squared Test for Independence

The chi-squared test for independence tests whether two categorical variables are associated in a contingency table.

H0:The two variables are independentH_0: \text{The two variables are independent} H1:The two variables are not independent (there is an association)H_1: \text{The two variables are not independent (there is an association)}

Expected frequency for cell (i,j)(i, j):

Eij=(row i total)×(column j total)grand totalE_{ij} = \frac{(\text{row } i \text{ total}) \times (\text{column } j \text{ total})}{\text{grand total}}

Degrees of freedom: ν=(r1)(c1)\nu = (r-1)(c-1) where rr = number of rows and cc = number of columns.

Chi-Squared Test for Independence

Use the table from Section 1.2 (exercise vs. sleep). Test at 5% significance whether exercise and sleep are independent.

Sleep 7\geq 7 hSleep <7< 7 hTotal
Exercises421860
Does not exercise231740
Total6535100

Hypotheses: H0H_0: Exercise and sleep hours are independent. H1H_1: They are associated.

Expected frequencies: E11=60×65100=39E_{11} = \frac{60 \times 65}{100} = 39, E12=60×35100=21E_{12} = \frac{60 \times 35}{100} = 21, E21=40×65100=26E_{21} = \frac{40 \times 65}{100} = 26, E22=40×35100=14E_{22} = \frac{40 \times 35}{100} = 14

Test statistic:

χ2=(4239)239+(1821)221+(2326)226+(1714)214\chi^2 = \frac{(42-39)^2}{39} + \frac{(18-21)^2}{21} + \frac{(23-26)^2}{26} + \frac{(17-14)^2}{14}

=939+921+926+9140.231+0.429+0.346+0.643=1.649= \frac{9}{39} + \frac{9}{21} + \frac{9}{26} + \frac{9}{14} \approx 0.231 + 0.429 + 0.346 + 0.643 = 1.649

Degrees of freedom: ν=(21)(21)=1\nu = (2-1)(2-1) = 1

Critical value: χ1,0.052=3.841\chi^2_{1, 0.05} = 3.841

Decision: 1.649<3.8411.649 < 3.841, fail to reject H0H_0.

Conclusion: At the 5% level, there is insufficient evidence of an association between exercise habits and sleep duration.

4.4 t-Test for the Mean

The one-sample t-test tests whether the population mean μ\mu equals a specified value μ0\mu_0, when the population standard deviation is unknown and the population is approximately normal.

Hypotheses:

  • Two-tailed: H0:μ=μ0H_0: \mu = \mu_0 vs. H1:μμ0H_1: \mu \neq \mu_0
  • One-tailed: H0:μ=μ0H_0: \mu = \mu_0 vs. H1:μ>μ0H_1: \mu > \mu_0 (or μ<μ0\mu < \mu_0)

Test statistic:

t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

where xˉ\bar{x} is the sample mean, ss is the sample standard deviation, and nn is the sample size. Under H0H_0, ttn1t \sim t_{n-1} (t-distribution with n1n-1 degrees of freedom).

One-Sample t-Test

A manufacturer claims their light bulbs last a mean of 1000 hours. A random sample of 15 bulbs gives a mean of 985 hours with a standard deviation of 30 hours. Test at the 5% level whether the mean lifetime is less than claimed.

Hypotheses: H0:μ=1000H_0: \mu = 1000; H1:μ<1000H_1: \mu < 1000 (one-tailed)

Significance level: α=0.05\alpha = 0.05

Test statistic:

t=985100030/15=157.7461.936t = \frac{985 - 1000}{30/\sqrt{15}} = \frac{-15}{7.746} \approx -1.936

Degrees of freedom: ν=151=14\nu = 15 - 1 = 14

Critical value (one-tailed): t14,0.05=1.761t_{14, 0.05} = -1.761

Decision: t=1.936<1.761t = -1.936 < -1.761, so we reject H0H_0.

Conclusion: At the 5% significance level, there is sufficient evidence to conclude the mean bulb lifetime is less than 1000 hours.

On a GDC (Casio), the t-test is under STAT → TEST → 1-Sample tTest. Enter μ0\mu_0, xˉ\bar{x}, sxs_x, and nn, then select the correct tail. The GDC outputs the t-statistic and p-value directly — always report both in your working.

4.5 Two-Sample t-Test and Paired t-Test HL

Two-sample t-test: Tests whether two independent populations have the same mean.

H0:μ1=μ2(equivalently, μ1μ2=0)H_0: \mu_1 = \mu_2 \qquad \text{(equivalently, } \mu_1 - \mu_2 = 0\text{)}

Test statistic (assuming equal but unknown variances — pooled):

t=xˉ1xˉ2sp1n1+1n2,sp2=(n11)s12+(n21)s22n1+n22t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}, \qquad s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}

with ν=n1+n22\nu = n_1 + n_2 - 2 degrees of freedom.

Paired t-test HL: Used when observations come in matched pairs (before/after, two measurements on the same subject).

Define the differences di=x1ix2id_i = x_{1i} - x_{2i}. Then:

dˉ=1ndi,sd=SD of the di values\bar{d} = \frac{1}{n}\sum d_i, \qquad s_d = \text{SD of the } d_i \text{ values}

t=dˉsd/ntn1t = \frac{\bar{d}}{s_d / \sqrt{n}} \sim t_{n-1}

Use a paired t-test when the same subject is measured twice (before/after). Use a two-sample t-test when comparing two different, independent groups. Applying the wrong test to paired data ignores the correlation between measurements and leads to incorrect inference.

Paired t-Test HL

Eight students take a memory test before and after a training programme. Their scores are:

Student12345678
Before6271588065735570
After6876608467786174

Test at the 5% level whether the training improves scores.

Step 1 — Compute differences di=AfterBefored_i = \text{After} - \text{Before}:

| did_i | 6 | 5 | 2 | 4 | 2 | 5 | 6 | 4 |

Step 2 — Statistics of dd:

dˉ=6+5+2+4+2+5+6+48=348=4.25\bar{d} = \frac{6+5+2+4+2+5+6+4}{8} = \frac{34}{8} = 4.25

sd=(didˉ)2n11.5811s_d = \sqrt{\frac{\sum(d_i - \bar{d})^2}{n-1}} \approx 1.5811

Step 3 — Hypotheses: H0:μd=0H_0: \mu_d = 0; H1:μd>0H_1: \mu_d > 0 (one-tailed)

Step 4 — Test statistic:

t=4.251.5811/8=4.250.55907.60t = \frac{4.25}{1.5811/\sqrt{8}} = \frac{4.25}{0.5590} \approx 7.60

Step 5 — Critical value: t7,0.05=1.895t_{7, 0.05} = 1.895

Step 6 — Decision: t=7.601.895t = 7.60 \gg 1.895, reject H0H_0.

Conclusion: At the 5% level, there is very strong evidence that the training programme significantly improves scores.


Section 5: Correlation and Regression (4.1–4.4)

Bivariate data involves two variables measured on the same individual. We explore whether changes in one variable are associated with changes in the other.

5.1 Scatter Diagrams

A scatter diagram plots pairs (xi,yi)(x_i, y_i) to visualise the relationship between two variables. Key features to comment on:

  • Direction: positive (up-right) or negative (down-right) trend
  • Strength: how closely do points follow a line?
  • Form: linear or non-linear?
  • Outliers: points far from the general pattern

5.2 Pearson’s Correlation Coefficient

The Pearson product-moment correlation coefficient (PMCC) rr measures the strength and direction of the linear relationship between two variables:

r=SxySxxSyyr = \frac{S_{xy}}{\sqrt{S_{xx} \cdot S_{yy}}}

where:

Sxy=(xixˉ)(yiyˉ)=xiyinxˉyˉS_{xy} = \sum (x_i - \bar{x})(y_i - \bar{y}) = \sum x_i y_i - n\bar{x}\bar{y}

Sxx=(xixˉ)2=xi2nxˉ2Syy=(yiyˉ)2=yi2nyˉ2S_{xx} = \sum (x_i - \bar{x})^2 = \sum x_i^2 - n\bar{x}^2 \qquad S_{yy} = \sum (y_i - \bar{y})^2 = \sum y_i^2 - n\bar{y}^2

Properties of rr:

  • 1r1-1 \leq r \leq 1
  • r=1r = 1: perfect positive linear correlation
  • r=1r = -1: perfect negative linear correlation
  • r=0r = 0: no linear correlation (but there may be a non-linear relationship)

Interpreting rr — Approximate Guidelines

| r|r| value | Interpretation | |---|---| | 0.000.000.200.20 | Very weak or no linear correlation | | 0.200.200.400.40 | Weak linear correlation | | 0.400.400.600.60 | Moderate linear correlation | | 0.600.600.800.80 | Strong linear correlation | | 0.800.801.001.00 | Very strong linear correlation |

These are guidelines, not rigid rules. Context matters.

r=0.9r = 0.9 does NOT mean “90% of the variation is explained.” The coefficient of determination r2=0.81r^2 = 0.81 tells you that 81% of the variation in yy is explained by the linear relationship with xx. Students regularly confuse rr with r2r^2.

Correlation does not imply causation. Even a very high value of rr does not mean that changes in xx cause changes in yy. There may be a lurking variable, or the relationship may be coincidental.

Computing PMCC

Five students’ hours of study (xx) and exam marks (yy) are:

Studentxxyy
A250
B465
C672
D880
E1090

Calculate rr.

Step 1 — Means: xˉ=2+4+6+8+105=6\bar{x} = \frac{2+4+6+8+10}{5} = 6, yˉ=50+65+72+80+905=71.4\bar{y} = \frac{50+65+72+80+90}{5} = 71.4

Step 2 — Sums of squares:

xixˉx_i - \bar{x}yiyˉy_i - \bar{y}(xxˉ)(yyˉ)(x-\bar{x})(y-\bar{y})(xxˉ)2(x-\bar{x})^2(yyˉ)2(y-\bar{y})^2
A4-421.4-21.485.685.61616457.96457.96
B2-26.4-6.412.812.84440.9640.96
C000.60.600000.360.36
D228.68.617.217.24473.9673.96
E4418.618.674.474.41616345.96345.96
Sum19040919.2

r=SxySxxSyy=19040×919.2=19036768=190191.750.991r = \frac{S_{xy}}{\sqrt{S_{xx} \cdot S_{yy}}} = \frac{190}{\sqrt{40 \times 919.2}} = \frac{190}{\sqrt{36768}} = \frac{190}{191.75} \approx 0.991

Very strong positive linear correlation.

5.3 Spearman’s Rank Correlation Coefficient HL

Spearman’s rank correlation coefficient rsr_s measures the strength and direction of a monotonic relationship between two variables. It is appropriate when:

  • Data are ordinal (ranked categories), or
  • The bivariate data are not normally distributed, or
  • The relationship may be monotonic but not necessarily linear

Formula:

rs=16di2n(n21)r_s = 1 - \frac{6\sum d_i^2}{n(n^2 - 1)}

where did_i is the difference in ranks for the ii-th pair and nn is the number of pairs.

Tied ranks: If two or more values are equal, assign each the average of the ranks they would have occupied. For example, if the 3rd and 4th values are tied, both receive rank 3.53.5.

On Paper 3, your GDC can calculate Spearman’s rank correlation directly — but you must also be able to rank data by hand and apply the formula step by step. Expect to show the ranking process and the di2d_i^2 table in your written working.

Two common errors: (1) forgetting to rank the raw data before computing differences — did_i is the difference in ranks, not in raw values; (2) confusing Spearman’s rsr_s with Pearson’s rr. Pearson’s measures linear association and requires roughly bivariate-normal data; Spearman’s measures monotonic association and makes no distributional assumption.

Spearman’s Rank — Step-by-Step

Six athletes are assessed by two judges. Their scores are:

AthleteABCDEF
Judge 1 score857092786588
Judge 2 score807490756885

Calculate Spearman’s rank correlation coefficient.

Step 1 — Rank each judge’s scores (rank 1 = highest):

AthleteJudge 1 scoreRank 1Judge 2 scoreRank 2
A853803
B705744
C921901
D784755
E656686
F882852

Step 2 — Compute did_i and di2d_i^2:

AthleteRank 1Rank 2did_idi2d_i^2
A3300
B5411
C1100
D451-11
E6600
F2200
Sum2

Step 3 — Apply the formula (n=6n = 6, di2=2\sum d_i^2 = 2):

rs=16×26(361)=112210=10.05710.943r_s = 1 - \frac{6 \times 2}{6(36 - 1)} = 1 - \frac{12}{210} = 1 - 0.0571 \approx 0.943

Conclusion: rs0.943r_s \approx 0.943 indicates very strong positive agreement between the two judges’ rankings.

5.4 Linear Regression

The regression line of yy on xx (also called the least-squares regression line) minimises the sum of squared residuals. Its equation is:

y=ax+by = ax + b

where the slope and intercept are:

a=SxySxxb=yˉaxˉa = \frac{S_{xy}}{S_{xx}} \qquad b = \bar{y} - a\bar{x}

The regression line always passes through the point of means (xˉ,yˉ)(\bar{x}, \bar{y}).

In IB notation the regression line is written y=ax+by = ax + b (not y^=β0+β1x\hat{y} = \beta_0 + \beta_1 x). The GDC (Casio) gives aa and bb directly via STAT → REG → ax+b. Always write out the full equation with the numerical values of aa and bb, not just “the regression line.”

Finding and Using the Regression Line

Using the data from the PMCC example above, find the regression line y=ax+by = ax + b and estimate the exam mark for a student who studies for 7 hours.

Step 1 — Calculate aa:

a=SxySxx=19040=4.75a = \frac{S_{xy}}{S_{xx}} = \frac{190}{40} = 4.75

Step 2 — Calculate bb:

b=yˉaxˉ=71.44.75×6=71.428.5=42.9b = \bar{y} - a\bar{x} = 71.4 - 4.75 \times 6 = 71.4 - 28.5 = 42.9

Regression line: y=4.75x+42.9y = 4.75x + 42.9

Prediction at x=7x = 7:

y=4.75(7)+42.9=33.25+42.9=76.1576y = 4.75(7) + 42.9 = 33.25 + 42.9 = 76.15 \approx 76

Do not extrapolate beyond the data range. The regression line is only reliable for xx values within the range of the original data (here, 2x102 \leq x \leq 10). Predicting outside this range — for example, estimating the score for 20 hours of study — may give absurd or meaningless results. State this limitation explicitly if asked.

Coefficient of Determination r2r^2

For the study data, r0.991r \approx 0.991. Interpret r2r^2.

r2(0.991)20.982r^2 \approx (0.991)^2 \approx 0.982

Interpretation: Approximately 98.2% of the variation in exam marks is explained by the linear relationship with hours of study. This suggests the linear model is an excellent fit for these data.

Regression with GDC — Full Workflow

The following data shows temperature (xx, in °C) and ice cream sales per day (yy, in units):

xx151822252830
yy80110145170200220

(a) Find rr and the regression line. (b) Estimate sales when temperature is 20°C. (c) Comment on reliability.

Part (a) — Using GDC (Casio):

  1. Enter xx values in List 1, yy values in List 2
  2. STAT → CALC → 2VAR to obtain: xˉ=23\bar{x} = 23, yˉ=154.17\bar{y} = 154.17, r0.999r \approx 0.999
  3. STAT → REG → ax+b: a9.77a \approx 9.77, b70.5b \approx -70.5

Regression line: y=9.77x70.5y = 9.77x - 70.5

Part (b): y=9.77(20)70.5=195.470.5=124.9125y = 9.77(20) - 70.5 = 195.4 - 70.5 = 124.9 \approx 125 units

Part (c): r0.999r \approx 0.999, which indicates a very strong positive linear correlation. Since x=20x = 20 lies within the data range [15,30][15, 30], this is interpolation and the prediction is reliable. The model explains r299.8%r^2 \approx 99.8\% of the variation in sales.


Section 6: Quick Reference

Probability Rules

RuleFormula
AdditionP(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)
ComplementP(A)=1P(A)P(A') = 1 - P(A)
ConditionalP(AB)=P(AB)P(B)P(A \mid B) = \dfrac{P(A \cap B)}{P(B)}
MultiplicationP(AB)=P(AB)P(B)P(A \cap B) = P(A \mid B) \cdot P(B)
IndependenceP(AB)=P(A)P(B)P(A \cap B) = P(A) \cdot P(B)
Bayes’ theoremP(AB)=P(BA)P(A)P(BA)P(A)+P(BA)P(A)P(A \mid B) = \dfrac{P(B \mid A) \cdot P(A)}{P(B \mid A) \cdot P(A) + P(B \mid A') \cdot P(A')}

Discrete Distributions

DistributionParametersP.M.F.MeanVariance
General DRVP(X=x)P(X = x) givenxP(X=x)\sum x \cdot P(X=x)E(X2)[E(X)]2E(X^2) - [E(X)]^2
Binomialn,pn, p(nx)px(1p)nx\binom{n}{x}p^x(1-p)^{n-x}npnpnp(1p)np(1-p)
Poissonmmemmxx!\dfrac{e^{-m}m^x}{x!}mmmm

Continuous Distributions

DistributionParametersKey formulaMeanVariance
Normalμ,σ2\mu, \sigma^2Z=XμσZ = \dfrac{X-\mu}{\sigma}μ\muσ2\sigma^2
Standard Normal0,10, 1P(Zz)P(Z \leq z) from table/GDC0011

68-95-99.7 rule: P(μkσ<X<μ+kσ)68.3%,95.4%,99.7%P(\mu - k\sigma < X < \mu + k\sigma) \approx 68.3\%, 95.4\%, 99.7\% for k=1,2,3k = 1, 2, 3.

Hypothesis Testing Summary

TestH0H_0Test statisticν\nuConditions
χ2\chi^2 GOFDistribution fitsχ2=(OE)2E\chi^2 = \sum \frac{(O-E)^2}{E}k1mk-1-mE5E \geq 5 all cells
χ2\chi^2 independenceVariables independentχ2=(OE)2E\chi^2 = \sum \frac{(O-E)^2}{E}(r1)(c1)(r-1)(c-1)E5E \geq 5 all cells
1-sample ttμ=μ0\mu = \mu_0t=xˉμ0s/nt = \dfrac{\bar{x}-\mu_0}{s/\sqrt{n}}n1n-1Normal population or nn large
2-sample ttμ1=μ2\mu_1 = \mu_2pooled ttn1+n22n_1+n_2-2Independent samples
Paired tt HLμd=0\mu_d = 0t=dˉsd/nt = \dfrac{\bar{d}}{s_d/\sqrt{n}}n1n-1Matched pairs

Correlation and Regression

QuantityFormula
PMCCr=SxySxxSyyr = \dfrac{S_{xy}}{\sqrt{S_{xx} \cdot S_{yy}}}
Spearman’s rank HLrs=16di2n(n21)r_s = 1 - \dfrac{6\sum d_i^2}{n(n^2-1)}
Regression slopea=SxySxxa = \dfrac{S_{xy}}{S_{xx}}
Regression interceptb=yˉaxˉb = \bar{y} - a\bar{x}
Regression liney=ax+by = ax + b passes through (xˉ,yˉ)(\bar{x}, \bar{y})
Coefficient of determinationr2r^2 = proportion of variance in yy explained by xx
SxyS_{xy}xiyinxˉyˉ\sum x_i y_i - n\bar{x}\bar{y}
SxxS_{xx}xi2nxˉ2\sum x_i^2 - n\bar{x}^2
SyyS_{yy}yi2nyˉ2\sum y_i^2 - n\bar{y}^2

Linear Transformation Rules

PropertyFormula
E(aX+b)E(aX + b)aE(X)+ba \cdot E(X) + b
Var(aX+b)\text{Var}(aX + b)a2Var(X)a^2 \cdot \text{Var}(X)
SD(aX+b)\text{SD}(aX + b)$
E(X+Y)E(X + Y)E(X)+E(Y)E(X) + E(Y)
Var(X+Y)\text{Var}(X + Y) (independent)Var(X)+Var(Y)\text{Var}(X) + \text{Var}(Y)

Mixed Practice — Exam Style

How to use this section: Unlike topic-specific practice, these questions are interleaved — they mix all topics from this guide in random order. Before answering, identify which concept or topic area the question is testing. This is exactly the skill you need on Paper 2 and Paper 3, where you don’t know in advance which topic each question covers.

  1. [Normal Distribution] A continuous random variable XN(20,16)X \sim N(20, 16). Find P(16<X<28)P(16 < X < 28).

    A. 0.7745

    B. 0.8186

    C. 0.9772

    D. 0.6827

  2. [Conditional Probability] A bag contains 4 red and 6 blue balls. Two balls are drawn without replacement. Given that the first ball drawn is red, what is the probability the second is also red?

    A. 410\dfrac{4}{10}

    B. 16100\dfrac{16}{100}

    C. 39\dfrac{3}{9}

    D. 49\dfrac{4}{9}

  3. [Poisson Distribution] A radioactive source emits on average 3 particles per second. Find the probability that exactly 5 particles are emitted in a given second. Leave your answer in exact form.

    A. 35e35!\dfrac{3^5 e^{-3}}{5!}

    B. 53e53!\dfrac{5^3 e^{-5}}{3!}

    C. (53)(0.3)3(0.7)2\binom{5}{3}(0.3)^3(0.7)^2

    D. e3e^{-3}

  4. [Hypothesis Testing — Chi-Squared] A chi-squared test of independence at the 5% significance level gives χcalc2=6.12\chi^2_{\text{calc}} = 6.12 with 2 degrees of freedom. The critical value is 5.991. What is the correct conclusion?

    A. Accept H0H_0 — there is sufficient evidence of association

    B. Reject H0H_0 — there is sufficient evidence of association at the 5% level

    C. Reject H0H_0 — the variables are definitely not independent

    D. Accept H0H_0 — the test is inconclusive because the p-value is unknown

  5. [Bayes’ Theorem] A medical test for a disease has sensitivity 95% (probability of a positive result given disease) and specificity 90% (probability of a negative result given no disease). The disease prevalence is 2%. A patient tests positive. What is the approximate probability they have the disease?

    A. 95%

    B. 16%

    C. 50%

    D. 2%

  6. [Binomial Distribution] A fair coin is tossed 8 times. Find the probability of obtaining fewer than 3 heads.

    A. (82)(12)8\binom{8}{2}\left(\dfrac{1}{2}\right)^8

    B. k=02(8k)(12)8\displaystyle\sum_{k=0}^{2}\binom{8}{k}\left(\dfrac{1}{2}\right)^8

    C. 3(12)83\left(\dfrac{1}{2}\right)^8

    D. 1k=38(8k)(12)81 - \displaystyle\sum_{k=3}^{8}\binom{8}{k}\left(\dfrac{1}{2}\right)^8

  7. [Correlation and Regression] The regression line of yy on xx is y=2.4x1.8y = 2.4x - 1.8 and xˉ=5\bar{x} = 5. Find yˉ\bar{y}.

    A. 10.210.2

    B. 13.813.8

    C. 1.81.8

    D. 1212

  8. [Normal Distribution — Inverse] XN(μ,σ2)X \sim N(\mu, \sigma^2) with P(X>72)=0.10P(X > 72) = 0.10 and P(X<48)=0.05P(X < 48) = 0.05. Which system of equations is correct?

    A. 72μσ=1.282\dfrac{72 - \mu}{\sigma} = 1.282 and 48μσ=1.645\dfrac{48 - \mu}{\sigma} = -1.645

    B. 72μσ=0.10\dfrac{72 - \mu}{\sigma} = 0.10 and 48μσ=0.05\dfrac{48 - \mu}{\sigma} = -0.05

    C. σ=72μ\sigma = 72 - \mu and σ=μ48\sigma = \mu - 48

    D. 72μ=1.28272\mu = 1.282 and 48μ=1.64548\mu = 1.645

  9. [Conditional Probability — Independence] Events AA and BB satisfy P(A)=0.4P(A) = 0.4, P(B)=0.5P(B) = 0.5, and P(AB)=0.2P(A \cap B) = 0.2. Which statement is correct?

    A. AA and BB are mutually exclusive

    B. AA and BB are independent

    C. AA and BB are neither mutually exclusive nor independent

    D. AA and BB are both mutually exclusive and independent

  10. [Hypothesis Testing — Interpretation] A student performs a one-tailed tt-test and obtains p=0.032p = 0.032. At the 5% significance level, the correct interpretation is:

    A. There is a 3.2% probability that H0H_0 is true

    B. There is a 3.2% probability that the result occurred by chance if H0H_0 is true; reject H0H_0

    C. There is a 96.8% probability that H1H_1 is true

    D. The result is not statistically significant at the 5% level

Show Answers
  1. B — 0.8186. Standardise: σ=16=4\sigma = \sqrt{16} = 4, so Z1=16204=1Z_1 = \frac{16-20}{4} = -1 and Z2=28204=2Z_2 = \frac{28-20}{4} = 2. From standard normal tables: P(1<Z<2)0.8186P(-1 < Z < 2) \approx 0.8186. A (0.7745) corresponds to P(1<Z<1.5)P(-1 < Z < 1.5). C (0.9772) is P(Z<2)P(Z < 2), the one-tailed cumulative — a common error from reading the table without subtracting the lower tail. D (0.6827) is P(1<Z<1)P(-1 < Z < 1), ignoring the upper bound of Z=2Z = 2.

  2. C39=13\dfrac{3}{9} = \dfrac{1}{3}. After removing one red ball, 3 red remain out of 9 total. A uses the unconditional probability of red. D applies the wrong denominator. B treats draws as independent (with replacement).

  3. A35e35!\dfrac{3^5 e^{-3}}{5!}. Poisson formula: P(X=k)=λkeλk!P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!} with λ=3\lambda = 3, k=5k = 5. B reverses λ\lambda and kk. C uses a binomial formula incorrectly applied to a Poisson context.

  4. B — Reject H0H_0. Since χcalc2=6.12>5.991\chi^2_{\text{calc}} = 6.12 > 5.991 (critical value), we reject H0H_0 and conclude there is sufficient evidence of association at the 5% significance level. C is incorrect — rejecting H0H_0 means we have statistical evidence, not certainty. D is incorrect — we do not need the exact p-value to compare with the critical value.

  5. B — Approximately 16%. Using Bayes’ theorem: P(diseasepositive)=0.95×0.020.95×0.02+0.10×0.980.0190.019+0.0980.163P(\text{disease}|\text{positive}) = \frac{0.95 \times 0.02}{0.95 \times 0.02 + 0.10 \times 0.98} \approx \frac{0.019}{0.019 + 0.098} \approx 0.163. This counterintuitive result (a “positive” test only gives 16% probability of disease) arises because the disease is rare — this is the base rate neglect fallacy.

  6. Bk=02(8k)(12)8\displaystyle\sum_{k=0}^{2}\binom{8}{k}\left(\dfrac{1}{2}\right)^8. “Fewer than 3” means k=0,1,2k = 0, 1, 2. D is equivalent (complement) and also correct. A gives only P(X=2)P(X=2). C has no probabilistic basis.

  7. Ayˉ=10.2\bar{y} = 10.2. The regression line always passes through (xˉ,yˉ)(\bar{x},\bar{y}): yˉ=2.4(5)1.8=121.8=10.2\bar{y} = 2.4(5) - 1.8 = 12 - 1.8 = 10.2.

  8. A — Correct system. P(X>72)=0.10P(X > 72) = 0.10 means z=1.282z = 1.282 (upper tail). P(X<48)=0.05P(X < 48) = 0.05 means z=1.645z = -1.645 (lower tail). B incorrectly uses the probabilities themselves as z-scores. This question requires knowing standard normal inverse values.

  9. BAA and BB are independent. Check: P(A)×P(B)=0.4×0.5=0.2=P(AB)P(A) \times P(B) = 0.4 \times 0.5 = 0.2 = P(A \cap B). Since the product of probabilities equals the joint probability, the events are independent. A requires P(AB)=0P(A \cap B) = 0, which is not the case here.

  10. B — Reject H0H_0; the p-value (0.032) is less than the significance level (0.05). A is a classic misinterpretation of p-values — the p-value is NOT the probability that H0H_0 is true. C is another common misinterpretation; the p-value says nothing directly about the probability of H1H_1.


IB Math IA Ideas — Statistics and Probability

Exploration topics from this chapter:

  • Does home advantage exist in football? — Collect win/draw/loss records for home and away matches across a season and use a chi-squared test of independence to determine whether venue is statistically associated with result. Extend by comparing leagues across different countries to investigate whether the effect size varies.

  • Modelling goal-scoring with the Poisson distribution — Goals per match in football (or other sports) often follow a Po(λ)\text{Po}(\lambda) distribution. Estimate λ\lambda from real data, perform a goodness-of-fit chi-squared test, and investigate whether the Poisson assumption holds equally well for high- and low-scoring teams.

  • The birthday problem and simulation — Derive analytically the probability that at least two people in a group of nn share a birthday, then verify with a Monte Carlo simulation. Extend to non-uniform birthday distributions using real birth-rate data (e.g., from national statistics offices) to see how much the real probability deviates from the uniform-distribution model.

  • Income inequality and the Gini coefficient — Obtain income-distribution data from the World Bank or OECD. Fit a log-normal distribution to model incomes, compute the theoretical Gini coefficient from the distribution parameters, and compare to the empirical value. Investigate how the Gini coefficient has changed over time for a country of your choice.

  • Regression analysis in sport or health — Choose two quantitative variables with a plausible causal link (e.g., hours of sleep and reaction time, training load and performance, or diet and cholesterol). Collect or source real data, compute the regression line and r2r^2, test the significance of the correlation, and critically evaluate confounding factors.

  • Bayesian updating and medical testing — Use Bayes’ theorem to model how the probability that a patient has a disease changes as successive independent tests come back positive. Investigate how sensitivity, specificity, and prevalence interact, and calculate the number of positive tests needed to exceed a 95% posterior probability of disease.

  • Does music tempo affect heart rate? A hypothesis test — Design a small experiment: measure resting heart rate, play fast and slow music, measure again. Use a paired tt-test to test whether tempo has a significant effect. Discuss Type I and Type II errors and how sample size affects the power of the test.

Tip: A strong IA has a clear personal engagement angle. Pick a topic that connects to something you genuinely find interesting — sport, health, economics, or psychology — and let the mathematics serve your question, not the other way around.


May 2026 Prediction Questions

These are NOT official IB questions. These are trend-based practice questions written to reflect the topic areas and question styles most likely to appear on the May 2026 IB Math AA HL Paper 2. Based on recent exam patterns (2022-2025), expect heavy weighting on: hypothesis testing (chi-squared and tt-tests), normal distribution calculations, Bayes’ theorem, and linear regression.


Question 1 [Hypothesis Testing] [~8 marks]

A factory claims that the mean mass of its cereal boxes is 500 g. A quality inspector takes a random sample of 12 boxes and records the following masses (in grams):

498,  502,  495,  501,  497,  503,  496,  499,  504,  498,  500,  497498, \; 502, \; 495, \; 501, \; 497, \; 503, \; 496, \; 499, \; 504, \; 498, \; 500, \; 497

(a) State appropriate null and alternative hypotheses for a two-tailed test.

(b) The sample mean is xˉ=499.17\bar{x} = 499.17 g and the sample standard deviation is s=2.89s = 2.89 g. Calculate the tt-statistic for this test.

(c) The critical values for a two-tailed tt-test at the 5% significance level with 11 degrees of freedom are ±2.201\pm 2.201. State the conclusion of the test, justifying your answer.

(d) State one assumption required for this test to be valid.

Show Solution

Part (a) — Hypotheses

H0:μ=500H_0: \mu = 500

H1:μ500H_1: \mu \neq 500

(Two-tailed test because we are checking whether the mean differs from 500 in either direction.)

Part (b) — tt-statistic

t=xˉμ0s/n=499.175002.89/12t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{499.17 - 500}{2.89 / \sqrt{12}}

=0.832.89/3.464=0.830.8343=0.995= \frac{-0.83}{2.89 / 3.464} = \frac{-0.83}{0.8343} = -0.995

Part (c) — Conclusion

t=0.995<2.201|t| = 0.995 < 2.201 (the critical value).

Since the test statistic does not fall in the rejection region (t<2.201|t| < 2.201), we do not reject H0H_0.

There is insufficient evidence at the 5% significance level to conclude that the mean mass of cereal boxes differs from 500 g.

Part (d) — Assumption

The masses of the cereal boxes are normally distributed in the population. (This is required for a tt-test with a small sample size, n=12n = 12.)

Answer: t=0.995t = -0.995; since t<2.201|t| < 2.201, do not reject H0H_0. There is insufficient evidence that the mean mass differs from 500 g. The test assumes normality of the population.


Question 2 [Bayes’ Theorem] [~7 marks]

A medical screening test for a disease has the following properties:

  • The probability of a positive result given the patient has the disease (sensitivity) is 0.95.
  • The probability of a negative result given the patient does not have the disease (specificity) is 0.90.
  • The prevalence of the disease in the population is 0.02.

(a) Construct a tree diagram or define the events and their probabilities.

(b) Find the probability that a randomly selected person tests positive.

(c) Find the probability that a person who tests positive actually has the disease.

(d) Comment on the practical implications of your answer to part (c).

Show Solution

Part (a) — Events and probabilities

Let DD = has disease, DD' = does not have disease, ++ = tests positive, - = tests negative.

P(D)=0.02,P(D)=0.98P(D) = 0.02, \quad P(D') = 0.98

P(+D)=0.95,P(D)=0.05P(+ \mid D) = 0.95, \quad P(- \mid D) = 0.05

P(+D)=0.10,P(D)=0.90P(+ \mid D') = 0.10, \quad P(- \mid D') = 0.90

Part (b) — P(+)P(+) using the law of total probability

P(+)=P(+D)P(D)+P(+D)P(D)P(+) = P(+ \mid D) \cdot P(D) + P(+ \mid D') \cdot P(D')

=(0.95)(0.02)+(0.10)(0.98)= (0.95)(0.02) + (0.10)(0.98)

=0.019+0.098=0.117= 0.019 + 0.098 = 0.117

Part (c) — Bayes’ theorem

P(D+)=P(+D)P(D)P(+)=(0.95)(0.02)0.117=0.0190.117=0.162P(D \mid +) = \frac{P(+ \mid D) \cdot P(D)}{P(+)} = \frac{(0.95)(0.02)}{0.117} = \frac{0.019}{0.117} = 0.162

Part (d) — Practical implications

The probability that a person who tests positive actually has the disease is only about 16.2%. This means approximately 84% of positive results are false positives.

This occurs because the disease prevalence is very low (2%). Even though the test is accurate (95% sensitivity, 90% specificity), the large number of healthy people tested generates many false positives that overwhelm the true positives.

Practically, a positive screening result should not be treated as a diagnosis — a confirmatory test should follow.

Answer: P(+)=0.117P(+) = 0.117; P(D+)=0.162P(D \mid +) = 0.162 (about 16.2%). Most positive results are false positives due to low disease prevalence, highlighting the need for confirmatory testing.


Question 3 [Regression and Correlation] [~6 marks]

A researcher collects data on the number of hours spent studying (xx) and exam score (yy) for 8 students. The regression line of yy on xx is found to be:

y^=3.2x+42.5\hat{y} = 3.2x + 42.5

The Pearson correlation coefficient is r=0.87r = 0.87 and xˉ=6.5\bar{x} = 6.5.

(a) Interpret the value of the gradient (3.2) in context.

(b) Calculate the predicted exam score for a student who studies for 6.5 hours.

(c) Explain why it would be unreliable to use this model to predict the exam score for a student who studies for 20 hours.

(d) The coefficient of determination is r2r^2. Calculate r2r^2 and interpret it in context.

Show Solution

Part (a) — Interpretation of gradient

The gradient 3.2 means that for each additional hour of study, the predicted exam score increases by 3.2 marks on average.

Part (b) — Predicted score at xˉ\bar{x}

y^=3.2(6.5)+42.5=20.8+42.5=63.3\hat{y} = 3.2(6.5) + 42.5 = 20.8 + 42.5 = 63.3

Note: This also confirms that yˉ=63.3\bar{y} = 63.3, since the regression line always passes through (xˉ,yˉ)(\bar{x}, \bar{y}).

Part (c) — Extrapolation

Predicting for x=20x = 20 hours would be extrapolation — using the model outside the range of the observed data.

  • The data was collected for a sample of students whose study hours were centred around xˉ=6.5\bar{x} = 6.5. The linear relationship may not hold at x=20x = 20.
  • Diminishing returns on study time, fatigue, or a maximum possible score could mean the relationship is non-linear at extreme values.
  • The prediction would be unreliable because we have no data to support the model’s validity at 20 hours.

Part (d) — Coefficient of determination

r2=(0.87)2=0.757r^2 = (0.87)^2 = 0.757

Interpretation: Approximately 75.7% of the variation in exam scores can be explained by the linear relationship with hours spent studying. The remaining 24.3% is due to other factors (natural ability, exam technique, etc.).

Answer: Gradient means +3.2 marks per extra hour studied. Predicted score at 6.5 hours is 63.3. Extrapolation to 20 hours is unreliable. r2=0.757r^2 = 0.757, so 75.7% of score variation is explained by study hours.

IB Formula Booklet — Complex Numbers

Modulus & Polar Form

GIVENz = r(cosθ + i sinθ) = r cis θ
GIVENz = re (Euler form)
MEMORISE|z| = √(a² + b²)
MEMORISEarg(z) — sketch point, use quadrant formula

Polar Multiplication & Division

GIVENz&sub1;z&sub2; = r&sub1;r&sub2; cis(θ&sub1; + θ&sub2;)
GIVENz&sub1;/z&sub2; = (r&sub1;/r&sub2;) cis(θ&sub1; − θ&sub2;)

De Moivre's Theorem

GIVEN(r cis θ)n = rn cis(nθ)
MEMORISEz + 1/z = 2cosθ (when |z|=1)
MEMORISEz − 1/z = 2i sinθ (when |z|=1)

nth Roots

GIVENw1/n = r1/n cis((θ + 2πk)/n), k=0..n-1
MEMORISESum of nth roots of unity = 0
MEMORISE1 + ω + ω² = 0 (cube roots)

Conjugate & Arithmetic

MEMORISEz* = a − bi
MEMORISEz · z* = |z|² (always real)
MEMORISEz + z* = 2Re(z)
MEMORISEz − z* = 2i Im(z)

Loci

MEMORISE|z − a| = r → Circle, centre a, radius r
MEMORISE|z − a| = |z − b| → Perpendicular bisector
MEMORISEarg(z − a) = θ → Ray from a

Vieta's Formulas

MEMORISEz² + az + b = 0: sum = −a, product = b
MEMORISEConjugate root theorem: real coeff → roots come in conjugate pairs