Statistics and Probability
Download PDFHow to Use This Guide
Statistics and Probability is Topic 4 of the IB Math AA HL syllabus and reliably accounts for a substantial block of marks across Paper 2 and Paper 3. This guide follows the IB AA HL syllabus point by point: probability rules, conditional probability and Bayes’ theorem, discrete random variables (binomial and Poisson), continuous random variables (normal distribution), hypothesis testing (chi-squared, t-tests), and correlation and regression. Every section includes fully worked examples drawn from past IB papers, exam alerts for the mistakes that cost marks most often, and a complete formula reference at the end.
How to approach Statistics on exams: The IB rewards structured method. For any probability question, always define your events and state the formula before you substitute values. For hypothesis tests, always write and explicitly, state the significance level, calculate or identify the test statistic, compare to the critical value or p-value, and write a conclusion in context. For distributions, always name the distribution and its parameters — for example, “Let ” — before calculating any probabilities. Your GDC can compute binomial, Poisson, and normal probabilities directly; know how to use these functions and show the setup clearly in your working.
What is and is not in the formula booklet: The booklet gives: , , , , binomial mean and variance, Poisson formula and mean/variance, normal standardisation , the chi-squared statistic, and the Pearson correlation coefficient formula. NOT given: Bayes’ theorem (you must derive it from first principles or recognise the pattern), the method for constructing tree diagrams, the decision rule for hypothesis tests, and how to interpret the regression coefficient .
Section 1: Probability (4.5–4.7)
Probability is a measure of how likely an event is to occur, expressed as a number between 0 (impossible) and 1 (certain). An event is a subset of the sample space , the set of all possible outcomes.
Notation:
- — probability that event occurs
- or — the complement of (event does NOT occur)
- — or (or both)
- — and simultaneously
1.1 Combined Probability
The addition rule gives the probability that at least one of two events occurs:
The subtraction removes the double-counting of outcomes in both and .
Special case — mutually exclusive events: If and cannot both occur, , so:
Complement rule:
Combined Probability — Addition Rule
In a class of 30 students, 18 study French, 12 study Spanish, and 6 study both. A student is chosen at random. Find the probability they study French or Spanish.
Define events: Let = studies French, = studies Spanish.
Apply the addition rule:
Venn Diagram — Finding Unknown Probabilities
Given , , and , find .
Rearrange the addition rule:
Hence find — the probability of but not .
1.2 Conditional Probability
The conditional probability of given that has occurred is:
Rearranging gives the multiplication rule:
Confusing with is one of the most costly errors in IB Statistics. These are generally different. is NOT the same as . Always ask: “which event is the condition (after the bar)?”
Conditional Probability — Two-Way Table
A survey records whether 100 students exercise regularly and whether they sleep more than 7 hours per night.
| Sleep h | Sleep h | Total | |
|---|---|---|---|
| Exercises regularly | 42 | 18 | 60 |
| Does not exercise | 23 | 17 | 40 |
| Total | 65 | 35 | 100 |
(a) Find the probability a randomly selected student exercises regularly given they sleep at least 7 hours.
(b) Find the probability a student sleeps fewer than 7 hours given they do not exercise regularly.
Part (a): Let = exercises, = sleeps h.
Part (b): Let = sleeps h, = does not exercise.
1.3 Independence
Two events and are independent if knowing one occurred gives no information about the other:
Equivalently, and .
Independence mutual exclusivity. Mutually exclusive events (cannot both occur) with non-zero probabilities are actually the most extreme form of dependence — if happens, definitely cannot, so . Students frequently confuse these two concepts.
Testing Independence
, , . Are and independent?
Check:
Since , the events are independent.
Also verify: . Confirmed.
1.4 Tree Diagrams
Tree diagrams organise sequential probability problems. Each branch carries a conditional probability, and probabilities along a path are multiplied (multiplication rule). Probabilities on branches from the same node must sum to 1.
Tree Diagram — Two-Stage Problem
A box contains 4 red and 6 blue balls. Two balls are drawn without replacement. Find the probability that both balls are the same colour.
Stage 1 branch probabilities:
- ,
Stage 2 branch probabilities (conditional on Stage 1):
- After Red: ,
- After Blue: ,
Path probabilities (same colour):
1.5 Bayes’ Theorem HL
Bayes’ theorem allows you to reverse conditional probabilities — to find the probability of a cause given an observed effect. It arises naturally from the multiplication rule.
For two complementary events and :
where the total probability is expanded as:
Combining these:
Bayes’ theorem is not in the formula booklet. You are expected to derive it or apply it directly using a tree diagram. Drawing the tree and labelling all branches is the safest approach — read off the answer by dividing the target path probability by the total probability of the observed outcome.
Bayes’ Theorem — Medical Test
A disease affects 1% of a population. A diagnostic test has a 95% sensitivity (true positive rate) and a 2% false positive rate. A randomly selected person tests positive. Find the probability they actually have the disease.
Define events:
- = has the disease, = does not have the disease
- = tests positive
Given: , , ,
Total probability of testing positive:
Apply Bayes:
Despite a positive test, there is only a 32.4% chance the person actually has the disease. This counter-intuitive result arises because the disease is rare — most positive tests come from the large pool of healthy people with false positives.
In Bayes problems, the denominator is always , computed via the law of total probability over all mutually exclusive “causes.” The most common error is forgetting to include all branches in this denominator, or using in place of .
Bayes’ Theorem — Factory Quality Control
Machine A produces 60% of a factory’s output; machine B produces the remaining 40%. Machine A has a 3% defect rate; machine B has a 5% defect rate. A randomly selected item is found to be defective. What is the probability it came from machine A?
Setup: , , ,
There is about a 47.4% chance the defective item came from machine A.
Quick Recall — Section 1
Try to answer without scrolling up:
- State Bayes’ theorem in words.
- If , what does this tell you about and ?
- What is the formula for conditional probability ?
Reveal answers
- Bayes’ theorem lets you “reverse” a conditional probability — given the outcome, find the probability of the cause.
- and are independent.
- .
Section 2: Discrete Random Variables (4.8–4.9)
A discrete random variable (DRV) takes a countable set of values, each with a defined probability. The probability distribution of is a complete list of all possible values and their probabilities .
Requirements for a valid probability distribution:
- for all
2.1 Expected Value and Variance
The expected value (mean) of is the long-run average outcome:
The variance measures the spread around the mean:
where .
The standard deviation is .
Linear transformations: For :
, not . The constant shifts the distribution but does not affect spread. Students frequently add or to the variance.
Expected Value and Variance
A biased die has the following distribution:
| 1 | 2 | 3 | 4 | |
|---|---|---|---|---|
| 0.1 | 0.3 | 0.4 | 0.2 |
(a) Find . (b) Find . (c) Find and .
Part (a):
Part (b): First compute :
Part (c):
2.2 Binomial Distribution
The binomial distribution models the number of successes in independent trials, each with probability of success.
Conditions for a binomial model:
- Fixed number of trials
- Each trial has exactly two outcomes (success / failure)
- Constant probability of success on each trial
- Trials are independent
If , then:
Binomial Quick Reference
| Quantity | Formula |
|---|---|
| Distribution | |
| P.M.F. | |
| Mean | |
| Variance | |
| GDC (Casio) | BinomPD(x, n, p) for exact; BinomCD(x, n, p) for |
Binomial — Exact Probability
A fair coin is tossed 8 times. Find the probability of getting exactly 5 heads.
State the distribution: Let = number of heads. Since trials are independent with and : .
Binomial — Cumulative and Complement
In a production line, 15% of items are defective. A batch of 20 items is selected. Find: (a) , (b) , (c) .
State:
Part (a): Use GDC cumulative binomial:
Part (b): Complement rule:
Part (c):
When NOT to use binomial: The binomial requires (1) fixed , (2) constant , (3) independence. Drawing without replacement from a small population violates independence — use the hypergeometric distribution or direct probability instead. “Selecting cards from a deck” without replacement is not binomial.
Binomial — Finding Parameters from Mean and Variance
A random variable has mean 6 and variance 4.2. Find and .
Set up equations:
Divide the second by the first:
Substitute back:
2.3 Poisson Distribution HL
The Poisson distribution models the number of occurrences of a random event in a fixed interval of time or space, when events occur independently at a constant average rate.
Conditions for a Poisson model:
- Events occur independently
- Events occur at a constant average rate per unit interval
- Two events cannot occur simultaneously
If , then:
The equality of mean and variance is a diagnostic property — if a dataset has mean variance, a Poisson model is plausible.
Poisson Quick Reference
| Quantity | Formula |
|---|---|
| Distribution | |
| P.M.F. | |
| Mean | |
| Variance | |
| GDC (Casio) | PoissonPD(x, m) for exact; PoissonCD(x, m) for |
Poisson — Direct Calculation
Calls arrive at a helpdesk at an average rate of 3 per hour. Find the probability that: (a) exactly 2 calls arrive in one hour, (b) fewer than 4 calls arrive in one hour.
State:
Part (a):
Part (b):
Poisson — Changing the Interval
Emails arrive at an average rate of 6 per hour. Find the probability that: (a) no emails arrive in 20 minutes, (b) more than 2 emails arrive in 30 minutes.
Key step: Rescale the rate to match the interval.
- In 20 minutes: , so
- In 30 minutes: , so
Part (a):
Part (b):
When the Poisson interval changes, multiply the rate proportionally. If the rate is ” per hour” and you want a 15-minute interval, use . Forgetting to rescale is one of the most common Poisson errors.
Poisson as an approximation to binomial HL: When is large and is small (rule of thumb: , ), .
Poisson Approximation to Binomial HL
A rare genetic mutation occurs in 1 in 500 births. In a sample of 400 births, find the approximate probability that at most 2 have the mutation.
Check conditions: (large), (small). Approximate with where .
Quick Recall — Section 2
Try to answer without scrolling up:
- Write the formula for the expected value of a discrete random variable.
- If , what are and ?
- When can you approximate a binomial distribution with a Poisson distribution?
Reveal answers
- .
- and .
- When is large and is small (close to 0), with .
Section 3: Continuous Random Variables (4.10–4.12)
A continuous random variable (CRV) takes values in a continuous range. Unlike DRVs, we cannot assign probability to individual values — instead we work with a probability density function (pdf) .
Properties of a pdf:
- for all
- (total area under the curve equals 1)
For a CRV: for any single value. Therefore .
Mean and variance of a CRV:
3.1 The Normal Distribution
The normal distribution is the most important continuous distribution. It models many natural phenomena — heights, measurement errors, exam scores — where data clusters symmetrically around a mean.
If , its bell-shaped pdf is:
Key properties:
- Symmetric about the mean
- Mean = Median = Mode
- Inflection points at
The 68-95-99.7 Rule
| Interval | Approximate probability |
|---|---|
| Within of | 68.3% |
| Within of | 95.4% |
| Within of | 99.7% |
These are approximations. Use your GDC for exact values.
3.2 Standardisation and Z-Scores
The standard normal distribution has mean 0 and variance 1. Any normal random variable can be converted to via:
A z-score tells you how many standard deviations an observation is from the mean. means the value is 1.5 standard deviations above the mean.
Normal Distribution — Finding Probabilities
Heights of adults are normally distributed with mean 170 cm and standard deviation 8 cm. Find: (a) , (b) .
State:
Part (a):
Using GDC or standard normal table:
Part (b): Lower bound: ; Upper bound:
When using GDC for normal probabilities, use normalcdf(lower, upper, μ, σ). For , use a very large upper bound (e.g., ), or compute using the complement. Show your GDC setup in working — write out which distribution, the parameters, and the bounds.
Normal Distribution — Symmetry Shortcut
. Find without a table.
The interval is symmetric about , each end standard deviation away.
By the 68-95-99.7 rule: .
For an exact answer: bounds are and , so using GDC: .
3.3 Inverse Normal
The inverse normal problem asks: given a probability , find the value such that .
On the GDC, use invNorm(p, μ, σ) to find directly.
Inverse Normal — Finding a Threshold
Test scores are distributed as . The top 10% of students receive a distinction. Find the minimum score needed for a distinction.
We need such that , i.e., .
Using GDC: invNorm(0.90, 65, 10)
A student needs at least 77.8 (round up to 78) to receive a distinction.
Inverse Normal — Symmetric Interval
with and . Find the value such that .
By symmetry, .
Using GDC: invNorm(0.05, 40, 5) , so .
Check: . Confirmed.
Finding or from a Normal Probability
. It is given that . Find .
Standardise:
Find the z-score:
Solve:
Section 4: Hypothesis Testing (4.13–4.15)
Statistical hypothesis testing provides a formal framework for deciding whether observed data provide sufficient evidence against a default assumption. Every test follows the same logical structure.
The universal procedure:
- State and — the null and alternative hypotheses
- State the significance level (typically 0.05 or 0.01)
- Identify the test statistic and its distribution under
- Calculate the test statistic (or use GDC)
- Find the p-value (or compare test statistic to critical value)
- Make a decision: reject if
- Write a conclusion in context
Never omit and , and never write them in terms of sample statistics. Hypotheses are always about population parameters. Write , not ”: the sample mean is 50.” Losing marks for missing hypotheses is one of the most avoidable errors in IB exams.
4.1 The p-value
The p-value is the probability of obtaining a result at least as extreme as the observed data, assuming is true. A small p-value means the observed data would be very unlikely under , providing evidence against it.
Decision rule: Reject if .
The p-value is NOT the probability that is true. It is the probability of the observed data (or more extreme) given is true. This distinction matters in written conclusions — say “there is sufficient evidence to reject ” rather than “we have proved is false.”
4.2 Chi-Squared Goodness-of-Fit Test
The chi-squared goodness-of-fit test assesses whether observed frequency data are consistent with a proposed theoretical distribution.
Test statistic:
where is the observed frequency and is the expected frequency under .
Under , where the degrees of freedom , with = number of categories and = number of parameters estimated from the data.
Conditions: All expected frequencies . If some are too small, combine adjacent categories.
Chi-Squared GOF Procedure
| Step | Action |
|---|---|
| The data follow the proposed distribution | |
| The data do not follow the proposed distribution | |
| (if no parameters estimated); (if estimated) | |
| Reject if | or |
Chi-Squared Goodness-of-Fit
A die is rolled 120 times. The observed frequencies are:
| Face | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| Observed | 17 | 22 | 18 | 25 | 19 | 19 |
Test at the 5% significance level whether the die is fair.
Step 1 — Hypotheses:
- : The die is fair (each face has probability )
- : The die is not fair
Step 2 — Significance level:
Step 3 — Expected frequencies: If fair, for each face.
Step 4 — Test statistic:
Step 5 — Degrees of freedom:
Step 6 — Critical value:
Step 7 — Decision: , so we fail to reject .
Conclusion: At the 5% significance level, there is insufficient evidence to conclude the die is unfair.
The chi-squared test requires all expected frequencies to be at least 5, not the observed frequencies. If any , merge that category with an adjacent one before calculating .
4.3 Chi-Squared Test for Independence
The chi-squared test for independence tests whether two categorical variables are associated in a contingency table.
Expected frequency for cell :
Degrees of freedom: where = number of rows and = number of columns.
Chi-Squared Test for Independence
Use the table from Section 1.2 (exercise vs. sleep). Test at 5% significance whether exercise and sleep are independent.
| Sleep h | Sleep h | Total | |
|---|---|---|---|
| Exercises | 42 | 18 | 60 |
| Does not exercise | 23 | 17 | 40 |
| Total | 65 | 35 | 100 |
Hypotheses: : Exercise and sleep hours are independent. : They are associated.
Expected frequencies: , , ,
Test statistic:
Degrees of freedom:
Critical value:
Decision: , fail to reject .
Conclusion: At the 5% level, there is insufficient evidence of an association between exercise habits and sleep duration.
4.4 t-Test for the Mean
The one-sample t-test tests whether the population mean equals a specified value , when the population standard deviation is unknown and the population is approximately normal.
Hypotheses:
- Two-tailed: vs.
- One-tailed: vs. (or )
Test statistic:
where is the sample mean, is the sample standard deviation, and is the sample size. Under , (t-distribution with degrees of freedom).
One-Sample t-Test
A manufacturer claims their light bulbs last a mean of 1000 hours. A random sample of 15 bulbs gives a mean of 985 hours with a standard deviation of 30 hours. Test at the 5% level whether the mean lifetime is less than claimed.
Hypotheses: ; (one-tailed)
Significance level:
Test statistic:
Degrees of freedom:
Critical value (one-tailed):
Decision: , so we reject .
Conclusion: At the 5% significance level, there is sufficient evidence to conclude the mean bulb lifetime is less than 1000 hours.
On a GDC (Casio), the t-test is under STAT → TEST → 1-Sample tTest. Enter , , , and , then select the correct tail. The GDC outputs the t-statistic and p-value directly — always report both in your working.
4.5 Two-Sample t-Test and Paired t-Test HL
Two-sample t-test: Tests whether two independent populations have the same mean.
Test statistic (assuming equal but unknown variances — pooled):
with degrees of freedom.
Paired t-test HL: Used when observations come in matched pairs (before/after, two measurements on the same subject).
Define the differences . Then:
Use a paired t-test when the same subject is measured twice (before/after). Use a two-sample t-test when comparing two different, independent groups. Applying the wrong test to paired data ignores the correlation between measurements and leads to incorrect inference.
Paired t-Test HL
Eight students take a memory test before and after a training programme. Their scores are:
| Student | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| Before | 62 | 71 | 58 | 80 | 65 | 73 | 55 | 70 |
| After | 68 | 76 | 60 | 84 | 67 | 78 | 61 | 74 |
Test at the 5% level whether the training improves scores.
Step 1 — Compute differences :
| | 6 | 5 | 2 | 4 | 2 | 5 | 6 | 4 |
Step 2 — Statistics of :
Step 3 — Hypotheses: ; (one-tailed)
Step 4 — Test statistic:
Step 5 — Critical value:
Step 6 — Decision: , reject .
Conclusion: At the 5% level, there is very strong evidence that the training programme significantly improves scores.
Section 5: Correlation and Regression (4.1–4.4)
Bivariate data involves two variables measured on the same individual. We explore whether changes in one variable are associated with changes in the other.
5.1 Scatter Diagrams
A scatter diagram plots pairs to visualise the relationship between two variables. Key features to comment on:
- Direction: positive (up-right) or negative (down-right) trend
- Strength: how closely do points follow a line?
- Form: linear or non-linear?
- Outliers: points far from the general pattern
5.2 Pearson’s Correlation Coefficient
The Pearson product-moment correlation coefficient (PMCC) measures the strength and direction of the linear relationship between two variables:
where:
Properties of :
- : perfect positive linear correlation
- : perfect negative linear correlation
- : no linear correlation (but there may be a non-linear relationship)
Interpreting — Approximate Guidelines
| value | Interpretation | |---|---| | – | Very weak or no linear correlation | | – | Weak linear correlation | | – | Moderate linear correlation | | – | Strong linear correlation | | – | Very strong linear correlation |
These are guidelines, not rigid rules. Context matters.
does NOT mean “90% of the variation is explained.” The coefficient of determination tells you that 81% of the variation in is explained by the linear relationship with . Students regularly confuse with .
Correlation does not imply causation. Even a very high value of does not mean that changes in cause changes in . There may be a lurking variable, or the relationship may be coincidental.
Computing PMCC
Five students’ hours of study () and exam marks () are:
| Student | ||
|---|---|---|
| A | 2 | 50 |
| B | 4 | 65 |
| C | 6 | 72 |
| D | 8 | 80 |
| E | 10 | 90 |
Calculate .
Step 1 — Means: ,
Step 2 — Sums of squares:
| A | |||||
| B | |||||
| C | |||||
| D | |||||
| E | |||||
| Sum | 190 | 40 | 919.2 |
Very strong positive linear correlation.
5.3 Spearman’s Rank Correlation Coefficient HL
Spearman’s rank correlation coefficient measures the strength and direction of a monotonic relationship between two variables. It is appropriate when:
- Data are ordinal (ranked categories), or
- The bivariate data are not normally distributed, or
- The relationship may be monotonic but not necessarily linear
Formula:
where is the difference in ranks for the -th pair and is the number of pairs.
Tied ranks: If two or more values are equal, assign each the average of the ranks they would have occupied. For example, if the 3rd and 4th values are tied, both receive rank .
On Paper 3, your GDC can calculate Spearman’s rank correlation directly — but you must also be able to rank data by hand and apply the formula step by step. Expect to show the ranking process and the table in your written working.
Two common errors: (1) forgetting to rank the raw data before computing differences — is the difference in ranks, not in raw values; (2) confusing Spearman’s with Pearson’s . Pearson’s measures linear association and requires roughly bivariate-normal data; Spearman’s measures monotonic association and makes no distributional assumption.
Spearman’s Rank — Step-by-Step
Six athletes are assessed by two judges. Their scores are:
| Athlete | A | B | C | D | E | F |
|---|---|---|---|---|---|---|
| Judge 1 score | 85 | 70 | 92 | 78 | 65 | 88 |
| Judge 2 score | 80 | 74 | 90 | 75 | 68 | 85 |
Calculate Spearman’s rank correlation coefficient.
Step 1 — Rank each judge’s scores (rank 1 = highest):
| Athlete | Judge 1 score | Rank 1 | Judge 2 score | Rank 2 |
|---|---|---|---|---|
| A | 85 | 3 | 80 | 3 |
| B | 70 | 5 | 74 | 4 |
| C | 92 | 1 | 90 | 1 |
| D | 78 | 4 | 75 | 5 |
| E | 65 | 6 | 68 | 6 |
| F | 88 | 2 | 85 | 2 |
Step 2 — Compute and :
| Athlete | Rank 1 | Rank 2 | ||
|---|---|---|---|---|
| A | 3 | 3 | 0 | 0 |
| B | 5 | 4 | 1 | 1 |
| C | 1 | 1 | 0 | 0 |
| D | 4 | 5 | 1 | |
| E | 6 | 6 | 0 | 0 |
| F | 2 | 2 | 0 | 0 |
| Sum | 2 |
Step 3 — Apply the formula (, ):
Conclusion: indicates very strong positive agreement between the two judges’ rankings.
5.4 Linear Regression
The regression line of on (also called the least-squares regression line) minimises the sum of squared residuals. Its equation is:
where the slope and intercept are:
The regression line always passes through the point of means .
In IB notation the regression line is written (not ). The GDC (Casio) gives and directly via STAT → REG → ax+b. Always write out the full equation with the numerical values of and , not just “the regression line.”
Finding and Using the Regression Line
Using the data from the PMCC example above, find the regression line and estimate the exam mark for a student who studies for 7 hours.
Step 1 — Calculate :
Step 2 — Calculate :
Regression line:
Prediction at :
Do not extrapolate beyond the data range. The regression line is only reliable for values within the range of the original data (here, ). Predicting outside this range — for example, estimating the score for 20 hours of study — may give absurd or meaningless results. State this limitation explicitly if asked.
Coefficient of Determination
For the study data, . Interpret .
Interpretation: Approximately 98.2% of the variation in exam marks is explained by the linear relationship with hours of study. This suggests the linear model is an excellent fit for these data.
Regression with GDC — Full Workflow
The following data shows temperature (, in °C) and ice cream sales per day (, in units):
| 15 | 18 | 22 | 25 | 28 | 30 | |
|---|---|---|---|---|---|---|
| 80 | 110 | 145 | 170 | 200 | 220 |
(a) Find and the regression line. (b) Estimate sales when temperature is 20°C. (c) Comment on reliability.
Part (a) — Using GDC (Casio):
- Enter values in List 1, values in List 2
STAT → CALC → 2VARto obtain: , ,STAT → REG → ax+b: ,
Regression line:
Part (b): units
Part (c): , which indicates a very strong positive linear correlation. Since lies within the data range , this is interpolation and the prediction is reliable. The model explains of the variation in sales.
Section 6: Quick Reference
Probability Rules
| Rule | Formula |
|---|---|
| Addition | |
| Complement | |
| Conditional | |
| Multiplication | |
| Independence | |
| Bayes’ theorem |
Discrete Distributions
| Distribution | Parameters | P.M.F. | Mean | Variance |
|---|---|---|---|---|
| General DRV | — | given | ||
| Binomial | ||||
| Poisson |
Continuous Distributions
| Distribution | Parameters | Key formula | Mean | Variance |
|---|---|---|---|---|
| Normal | ||||
| Standard Normal | from table/GDC |
68-95-99.7 rule: for .
Hypothesis Testing Summary
| Test | Test statistic | Conditions | ||
|---|---|---|---|---|
| GOF | Distribution fits | all cells | ||
| independence | Variables independent | all cells | ||
| 1-sample | Normal population or large | |||
| 2-sample | pooled | Independent samples | ||
| Paired HL | Matched pairs |
Correlation and Regression
| Quantity | Formula |
|---|---|
| PMCC | |
| Spearman’s rank HL | |
| Regression slope | |
| Regression intercept | |
| Regression line | passes through |
| Coefficient of determination | = proportion of variance in explained by |
Linear Transformation Rules
| Property | Formula |
|---|---|
| $ | |
| (independent) |
Mixed Practice — Exam Style
How to use this section: Unlike topic-specific practice, these questions are interleaved — they mix all topics from this guide in random order. Before answering, identify which concept or topic area the question is testing. This is exactly the skill you need on Paper 2 and Paper 3, where you don’t know in advance which topic each question covers.
-
[Normal Distribution] A continuous random variable . Find .
A. 0.7745
B. 0.8186
C. 0.9772
D. 0.6827
-
[Conditional Probability] A bag contains 4 red and 6 blue balls. Two balls are drawn without replacement. Given that the first ball drawn is red, what is the probability the second is also red?
A.
B.
C.
D.
-
[Poisson Distribution] A radioactive source emits on average 3 particles per second. Find the probability that exactly 5 particles are emitted in a given second. Leave your answer in exact form.
A.
B.
C.
D.
-
[Hypothesis Testing — Chi-Squared] A chi-squared test of independence at the 5% significance level gives with 2 degrees of freedom. The critical value is 5.991. What is the correct conclusion?
A. Accept — there is sufficient evidence of association
B. Reject — there is sufficient evidence of association at the 5% level
C. Reject — the variables are definitely not independent
D. Accept — the test is inconclusive because the p-value is unknown
-
[Bayes’ Theorem] A medical test for a disease has sensitivity 95% (probability of a positive result given disease) and specificity 90% (probability of a negative result given no disease). The disease prevalence is 2%. A patient tests positive. What is the approximate probability they have the disease?
A. 95%
B. 16%
C. 50%
D. 2%
-
[Binomial Distribution] A fair coin is tossed 8 times. Find the probability of obtaining fewer than 3 heads.
A.
B.
C.
D.
-
[Correlation and Regression] The regression line of on is and . Find .
A.
B.
C.
D.
-
[Normal Distribution — Inverse] with and . Which system of equations is correct?
A. and
B. and
C. and
D. and
-
[Conditional Probability — Independence] Events and satisfy , , and . Which statement is correct?
A. and are mutually exclusive
B. and are independent
C. and are neither mutually exclusive nor independent
D. and are both mutually exclusive and independent
-
[Hypothesis Testing — Interpretation] A student performs a one-tailed -test and obtains . At the 5% significance level, the correct interpretation is:
A. There is a 3.2% probability that is true
B. There is a 3.2% probability that the result occurred by chance if is true; reject
C. There is a 96.8% probability that is true
D. The result is not statistically significant at the 5% level
Show Answers
-
B — 0.8186. Standardise: , so and . From standard normal tables: . A (0.7745) corresponds to . C (0.9772) is , the one-tailed cumulative — a common error from reading the table without subtracting the lower tail. D (0.6827) is , ignoring the upper bound of .
-
C — . After removing one red ball, 3 red remain out of 9 total. A uses the unconditional probability of red. D applies the wrong denominator. B treats draws as independent (with replacement).
-
A — . Poisson formula: with , . B reverses and . C uses a binomial formula incorrectly applied to a Poisson context.
-
B — Reject . Since (critical value), we reject and conclude there is sufficient evidence of association at the 5% significance level. C is incorrect — rejecting means we have statistical evidence, not certainty. D is incorrect — we do not need the exact p-value to compare with the critical value.
-
B — Approximately 16%. Using Bayes’ theorem: . This counterintuitive result (a “positive” test only gives 16% probability of disease) arises because the disease is rare — this is the base rate neglect fallacy.
-
B — . “Fewer than 3” means . D is equivalent (complement) and also correct. A gives only . C has no probabilistic basis.
-
A — . The regression line always passes through : .
-
A — Correct system. means (upper tail). means (lower tail). B incorrectly uses the probabilities themselves as z-scores. This question requires knowing standard normal inverse values.
-
B — and are independent. Check: . Since the product of probabilities equals the joint probability, the events are independent. A requires , which is not the case here.
-
B — Reject ; the p-value (0.032) is less than the significance level (0.05). A is a classic misinterpretation of p-values — the p-value is NOT the probability that is true. C is another common misinterpretation; the p-value says nothing directly about the probability of .
IB Math IA Ideas — Statistics and Probability
Exploration topics from this chapter:
-
Does home advantage exist in football? — Collect win/draw/loss records for home and away matches across a season and use a chi-squared test of independence to determine whether venue is statistically associated with result. Extend by comparing leagues across different countries to investigate whether the effect size varies.
-
Modelling goal-scoring with the Poisson distribution — Goals per match in football (or other sports) often follow a distribution. Estimate from real data, perform a goodness-of-fit chi-squared test, and investigate whether the Poisson assumption holds equally well for high- and low-scoring teams.
-
The birthday problem and simulation — Derive analytically the probability that at least two people in a group of share a birthday, then verify with a Monte Carlo simulation. Extend to non-uniform birthday distributions using real birth-rate data (e.g., from national statistics offices) to see how much the real probability deviates from the uniform-distribution model.
-
Income inequality and the Gini coefficient — Obtain income-distribution data from the World Bank or OECD. Fit a log-normal distribution to model incomes, compute the theoretical Gini coefficient from the distribution parameters, and compare to the empirical value. Investigate how the Gini coefficient has changed over time for a country of your choice.
-
Regression analysis in sport or health — Choose two quantitative variables with a plausible causal link (e.g., hours of sleep and reaction time, training load and performance, or diet and cholesterol). Collect or source real data, compute the regression line and , test the significance of the correlation, and critically evaluate confounding factors.
-
Bayesian updating and medical testing — Use Bayes’ theorem to model how the probability that a patient has a disease changes as successive independent tests come back positive. Investigate how sensitivity, specificity, and prevalence interact, and calculate the number of positive tests needed to exceed a 95% posterior probability of disease.
-
Does music tempo affect heart rate? A hypothesis test — Design a small experiment: measure resting heart rate, play fast and slow music, measure again. Use a paired -test to test whether tempo has a significant effect. Discuss Type I and Type II errors and how sample size affects the power of the test.
Tip: A strong IA has a clear personal engagement angle. Pick a topic that connects to something you genuinely find interesting — sport, health, economics, or psychology — and let the mathematics serve your question, not the other way around.
May 2026 Prediction Questions
These are NOT official IB questions. These are trend-based practice questions written to reflect the topic areas and question styles most likely to appear on the May 2026 IB Math AA HL Paper 2. Based on recent exam patterns (2022-2025), expect heavy weighting on: hypothesis testing (chi-squared and -tests), normal distribution calculations, Bayes’ theorem, and linear regression.
Question 1 [Hypothesis Testing] [~8 marks]
A factory claims that the mean mass of its cereal boxes is 500 g. A quality inspector takes a random sample of 12 boxes and records the following masses (in grams):
(a) State appropriate null and alternative hypotheses for a two-tailed test.
(b) The sample mean is g and the sample standard deviation is g. Calculate the -statistic for this test.
(c) The critical values for a two-tailed -test at the 5% significance level with 11 degrees of freedom are . State the conclusion of the test, justifying your answer.
(d) State one assumption required for this test to be valid.
Show Solution
Part (a) — Hypotheses
(Two-tailed test because we are checking whether the mean differs from 500 in either direction.)
Part (b) — -statistic
Part (c) — Conclusion
(the critical value).
Since the test statistic does not fall in the rejection region (), we do not reject .
There is insufficient evidence at the 5% significance level to conclude that the mean mass of cereal boxes differs from 500 g.
Part (d) — Assumption
The masses of the cereal boxes are normally distributed in the population. (This is required for a -test with a small sample size, .)
Answer: ; since , do not reject . There is insufficient evidence that the mean mass differs from 500 g. The test assumes normality of the population.
Question 2 [Bayes’ Theorem] [~7 marks]
A medical screening test for a disease has the following properties:
- The probability of a positive result given the patient has the disease (sensitivity) is 0.95.
- The probability of a negative result given the patient does not have the disease (specificity) is 0.90.
- The prevalence of the disease in the population is 0.02.
(a) Construct a tree diagram or define the events and their probabilities.
(b) Find the probability that a randomly selected person tests positive.
(c) Find the probability that a person who tests positive actually has the disease.
(d) Comment on the practical implications of your answer to part (c).
Show Solution
Part (a) — Events and probabilities
Let = has disease, = does not have disease, = tests positive, = tests negative.
Part (b) — using the law of total probability
Part (c) — Bayes’ theorem
Part (d) — Practical implications
The probability that a person who tests positive actually has the disease is only about 16.2%. This means approximately 84% of positive results are false positives.
This occurs because the disease prevalence is very low (2%). Even though the test is accurate (95% sensitivity, 90% specificity), the large number of healthy people tested generates many false positives that overwhelm the true positives.
Practically, a positive screening result should not be treated as a diagnosis — a confirmatory test should follow.
Answer: ; (about 16.2%). Most positive results are false positives due to low disease prevalence, highlighting the need for confirmatory testing.
Question 3 [Regression and Correlation] [~6 marks]
A researcher collects data on the number of hours spent studying () and exam score () for 8 students. The regression line of on is found to be:
The Pearson correlation coefficient is and .
(a) Interpret the value of the gradient (3.2) in context.
(b) Calculate the predicted exam score for a student who studies for 6.5 hours.
(c) Explain why it would be unreliable to use this model to predict the exam score for a student who studies for 20 hours.
(d) The coefficient of determination is . Calculate and interpret it in context.
Show Solution
Part (a) — Interpretation of gradient
The gradient 3.2 means that for each additional hour of study, the predicted exam score increases by 3.2 marks on average.
Part (b) — Predicted score at
Note: This also confirms that , since the regression line always passes through .
Part (c) — Extrapolation
Predicting for hours would be extrapolation — using the model outside the range of the observed data.
- The data was collected for a sample of students whose study hours were centred around . The linear relationship may not hold at .
- Diminishing returns on study time, fatigue, or a maximum possible score could mean the relationship is non-linear at extreme values.
- The prediction would be unreliable because we have no data to support the model’s validity at 20 hours.
Part (d) — Coefficient of determination
Interpretation: Approximately 75.7% of the variation in exam scores can be explained by the linear relationship with hours spent studying. The remaining 24.3% is due to other factors (natural ability, exam technique, etc.).
Answer: Gradient means +3.2 marks per extra hour studied. Predicted score at 6.5 hours is 63.3. Extrapolation to 20 hours is unreliable. , so 75.7% of score variation is explained by study hours.
IB Formula Booklet — Complex Numbers
Modulus & Polar Form
| GIVEN | z = r(cosθ + i sinθ) = r cis θ |
| GIVEN | z = reiθ (Euler form) |
| MEMORISE | |z| = √(a² + b²) |
| MEMORISE | arg(z) — sketch point, use quadrant formula |
Polar Multiplication & Division
| GIVEN | z&sub1;z&sub2; = r&sub1;r&sub2; cis(θ&sub1; + θ&sub2;) |
| GIVEN | z&sub1;/z&sub2; = (r&sub1;/r&sub2;) cis(θ&sub1; − θ&sub2;) |
De Moivre's Theorem
| GIVEN | (r cis θ)n = rn cis(nθ) |
| MEMORISE | z + 1/z = 2cosθ (when |z|=1) |
| MEMORISE | z − 1/z = 2i sinθ (when |z|=1) |
nth Roots
| GIVEN | w1/n = r1/n cis((θ + 2πk)/n), k=0..n-1 |
| MEMORISE | Sum of nth roots of unity = 0 |
| MEMORISE | 1 + ω + ω² = 0 (cube roots) |
Conjugate & Arithmetic
| MEMORISE | z* = a − bi |
| MEMORISE | z · z* = |z|² (always real) |
| MEMORISE | z + z* = 2Re(z) |
| MEMORISE | z − z* = 2i Im(z) |
Loci
| MEMORISE | |z − a| = r → Circle, centre a, radius r |
| MEMORISE | |z − a| = |z − b| → Perpendicular bisector |
| MEMORISE | arg(z − a) = θ → Ray from a |
Vieta's Formulas
| MEMORISE | z² + az + b = 0: sum = −a, product = b |
| MEMORISE | Conjugate root theorem: real coeff → roots come in conjugate pairs |