Statistics and Probability
Download PDFIB Math AI SL — Statistics and Probability
Complete Study Guide
Topics Covered
- Descriptive Statistics — measures of central tendency and spread
- Data Presentation — histograms, box plots, cumulative frequency
- Probability — combined events, conditional probability, tree diagrams
- Probability Distributions — binomial and normal
- Statistical Tests — chi-squared test for independence
- Correlation and Regression — linear regression, and
- Practice Questions and Exam Alerts
Topic 4 of the IB Math AI SL syllabus — this is the largest topic at 36 SL hours.
The heart of Math AI: Statistics and probability makes up the largest portion of the syllabus and is heavily represented on both papers. Expect at least one full extended-response question on Paper 2 to be purely statistical. Master your GDC’s statistics functions — they are essential.
Key statistics formulas
| Measure | Formula |
|---|---|
| Mean | or |
| Standard deviation | (population) |
| Interquartile range | |
| Probability | |
| Combined events | |
| Conditional probability |
Section 1: Descriptive Statistics
1.1 Measures of Central Tendency
| Measure | What it tells you | When to use |
|---|---|---|
| Mean () | Average value | Symmetric data, no extreme outliers |
| Median | Middle value | Skewed data or data with outliers |
| Mode | Most frequent value | Categorical data |
1.2 Measures of Spread
| Measure | Formula / Description |
|---|---|
| Range | |
| Interquartile range (IQR) | (middle 50% of data) |
| Standard deviation ( or ) | Measures spread around the mean |
| Variance | (the square of standard deviation) |
GDC for statistics: Enter data into a list and use 1-Var Stats (TI-84) or STAT CALC (Casio). This gives you the mean, median, quartiles, standard deviation, and more instantly. Never calculate these by hand on Paper 2.
1.3 Outliers
An outlier is typically defined as any value:
- Below , or
- Above
Descriptive statistics in context
The daily rainfall (mm) in a city over 10 days: 0, 0, 2, 3, 5, 7, 8, 12, 15, 48.
Using GDC: , median , , , .
.
Outlier boundaries: and . The value 48 is above 27, so it is an outlier.
The median (6.0) is a better measure of centre than the mean (10.0) because the outlier pulls the mean up.
Section 2: Data Presentation
2.1 Histograms
A histogram shows the distribution of continuous data. The area of each bar is proportional to the frequency.
For unequal class widths, the -axis shows frequency density = .
2.2 Cumulative Frequency Curves
Plot cumulative frequency against the upper boundary of each class. Use to read off:
- Median: at
- : at
- : at
- Percentiles: at the appropriate fraction of
2.3 Box Plots (Box-and-Whisker Diagrams)
A box plot shows: minimum, , median, , maximum. Outliers are shown as individual points.
Comparing distributions: When two box plots are shown side by side, compare:
- Centre (median) — which group has higher/lower values
- Spread (IQR and range) — which group is more variable
- Skewness — if the median is closer to , data is positively skewed
Box plot comparison questions are almost guaranteed on the exam. Always make two comparisons (one about centre, one about spread) and refer to the context (not just “Dataset A has a higher median” but “Students in Class A scored higher on average”).
Section 3: Probability
3.1 Basic Probability
3.2 Combined Events
Addition rule:
If and are mutually exclusive ():
Multiplication rule:
If and are independent:
3.3 Conditional Probability
Conditional probability — medical testing
A disease affects 2% of a population. A test correctly identifies the disease 95% of the time (sensitivity) and correctly identifies healthy people 90% of the time (specificity). If a person tests positive, what is the probability they have the disease?
Let = has disease, = tests positive.
, , .
Only a 16.2% chance of actually having the disease, despite the positive test. This is a classic result that highlights the importance of base rates.
3.4 Tree Diagrams and Two-Way Tables
Tree diagrams are useful for sequential events. Multiply along branches, add between branches.
Two-way tables organize data for two categorical variables.
Which tool to use? If the events are sequential (first this, then that), use a tree diagram. If you have data classified by two categories, use a two-way table. On the exam, drawing the correct diagram usually earns a method mark even if the final answer is wrong.
Section 4: Probability Distributions
4.1 Discrete Random Variables
A discrete random variable takes specific values with known probabilities. The probabilities must sum to 1: .
Expected value (mean):
4.2 Binomial Distribution
Use when:
- Fixed number of independent trials ()
- Two outcomes only (success/failure)
- Constant probability of success ()
,
Binomial distribution — quality control
In a factory, 8% of items are defective. A sample of 20 items is tested. Find (a) the probability that exactly 2 are defective, (b) the probability that at most 1 is defective, (c) the expected number of defective items.
(a) Using GDC:
(b)
Using GDC binomcdf:
(c) defective items.
4.3 Normal Distribution
The normal distribution is a continuous, bell-shaped, symmetric distribution. It is fully described by its mean and standard deviation .
Key properties:
- Symmetric about the mean
- 68% of data within 1 of the mean
- 95% within 2
- 99.7% within 3
The 68-95-99.7 rule
| Range | Percentage |
|---|---|
| 68% | |
| 95% | |
| 99.7% |
4.4 Calculating Normal Probabilities with GDC
Forward problem (given , find probability):
- TI-84:
normalcdf(lower, upper, mean, sd) - Casio:
P(lower < X < upper)in DIST menu
Inverse problem (given probability, find ):
- TI-84:
invNorm(area, mean, sd) - Casio:
InvNin DIST menu
Normal distribution — exam scores
Exam scores are normally distributed with mean 65 and standard deviation 12. Find (a) the probability a student scores above 80, (b) the score that 90% of students exceed.
(a)
About 10.6% of students score above 80.
(b) We need such that , i.e., .
90% of students score above 49.6.
Always sketch the normal curve. Draw the bell curve, mark the mean, shade the area you are finding. This helps you check whether your answer is reasonable (e.g., a probability above 0.5 for a value below the mean).
Section 5: Chi-Squared Test for Independence
The test determines whether two categorical variables are independent (unrelated) or associated.
5.1 Setting Up the Test
-
State hypotheses:
- : The variables are independent
- : The variables are not independent
-
Create observed frequency table from data
-
Calculate expected frequencies:
-
Calculate the test statistic:
-
Find degrees of freedom:
-
Compare with the critical value at the given significance level, or compare the -value with .
-
Conclude in context.
GDC does it all: Enter the observed frequencies as a matrix. Run the test. The GDC gives you the test statistic, -value, degrees of freedom, and expected frequencies. You just need to set up hypotheses, run the test, and write the conclusion.
Chi-squared test — favourite subject by gender
A survey asked 200 students their favourite subject. The results:
| Science | Arts | Sport | Total | |
|---|---|---|---|---|
| Male | 45 | 25 | 30 | 100 |
| Female | 35 | 40 | 25 | 100 |
| Total | 80 | 65 | 55 | 200 |
Test at the 5% significance level whether favourite subject is independent of gender.
: Favourite subject is independent of gender.
: Favourite subject is not independent of gender.
Expected frequencies: For Male-Science: .
Full expected table:
| Science | Arts | Sport | |
|---|---|---|---|
| Male | 40 | 32.5 | 27.5 |
| Female | 40 | 32.5 | 27.5 |
. Critical value at 5%: .
Since , we do not reject .
There is insufficient evidence at the 5% level to conclude that favourite subject depends on gender.
Writing the conclusion: Always state the conclusion in the context of the question, not in statistical jargon. Say “There is insufficient evidence that favourite subject depends on gender” rather than “We fail to reject the null hypothesis.” Also state whether you are comparing with a critical value or using the -value.
5.2 Conditions for the Chi-Squared Test
- Data must be frequencies (counts), not percentages or proportions
- Expected frequencies should all be at least 5 (the IB may ask you to check this)
- Observations must be independent
Section 6: Correlation and Regression
6.1 Scatter Plots
A scatter plot shows the relationship between two quantitative variables. Describe the relationship using:
- Direction: positive (both increase), negative (one increases as other decreases)
- Strength: strong, moderate, weak
- Form: linear, non-linear, no correlation
6.2 Pearson’s Correlation Coefficient ()
measures the strength and direction of a linear relationship.
| Value of | Interpretation |
|---|---|
| Perfect positive linear | |
| Strong positive | |
| Moderate positive | |
| Weak positive | |
| No linear correlation | |
| Negative values | Same interpretation, negative direction |
only measures linear correlation. Two variables can have a strong non-linear relationship (e.g., quadratic) but . Always check the scatter plot.
6.3 Linear Regression
The least squares regression line minimizes the sum of squared residuals.
- (gradient — in the formula booklet)
- (the line passes through )
Use the regression line for interpolation (predicting within the data range). Be cautious with extrapolation.
Regression — advertising and sales
A company records weekly advertising spend ( thousands) and sales ( thousands):
| 2 | 4 | 6 | 8 | 10 | |
|---|---|---|---|---|---|
| 15 | 22 | 28 | 35 | 40 |
Using GDC linear regression: , .
Interpretation: For each additional 1000 spent on advertising, sales increase by approximately 3100 (thousands of dollars). The very high value indicates a strong positive linear relationship.
Predict sales for : thousand dollars. This is interpolation (within the data range) and is reliable.
Predict sales for : thousand. This is extrapolation (far beyond the data) and is unreliable — the linear trend may not continue.
6.4 Coefficient of Determination ()
represents the proportion of variation in explained by the linear relationship with .
For example, if , then , meaning 81% of the variation in is explained by .
Section 7: Practice Questions
Paper 1 Style (Short Answer)
Q1. A dataset has , . (a) Find the IQR. (b) Determine the outlier boundaries.
(a) IQR = .
(b) Lower: . Upper: .
Any value below or above 68 is an outlier.
Q2. Two fair dice are rolled. Find the probability that the sum is at least 10.
Outcomes with sum : (4,6), (5,5), (5,6), (6,4), (6,5), (6,6) = 6 outcomes.
Total outcomes: .
(3 s.f.)
Q3. Heights of students are normally distributed with mean 168 cm and standard deviation 7 cm. Find the probability that a randomly selected student is between 160 cm and 175 cm tall.
Approximately 68.3% of students.
Paper 2 Style (Extended Response)
Q4. A researcher records the hours of study () and exam score () for 8 students. The GDC gives: , , , regression line . (a) Describe the correlation. (b) Interpret the gradient. (c) Predict the score for a student who studies 7 hours. (d) A student claims this proves studying causes higher scores. Comment.
(a) There is a strong positive linear correlation between hours of study and exam score ().
(b) The gradient of 4.8 means that for each additional hour of study, the exam score increases by approximately 4.8 marks.
(c)
This is interpolation (7 is within the data range of approximately 1-10 hours), so the prediction is reasonably reliable.
(d) Correlation does not imply causation. Other factors may contribute to both study hours and exam scores (e.g., motivation, prior knowledge). A randomized controlled experiment would be needed to establish causation.
Q5. A company claims that 85% of its deliveries arrive on time. In a random sample of 25 deliveries, 18 arrived on time. (a) Using a binomial model, find the probability of 18 or fewer on-time deliveries if the claim is true. (b) Does this provide evidence against the company’s claim at the 5% significance level?
(a) If the claim is true: .
(b) Since , the result is statistically significant at the 5% level. There is evidence to suggest the true on-time rate is less than 85%.
However, with a sample of only 25, this should be interpreted cautiously.
Q6. A survey of 150 employees classified by department and lunch preference gives the following data. Test at the 10% significance level whether lunch preference is independent of department.
| Canteen | Packed | Eat out | Total | |
|---|---|---|---|---|
| Engineering | 30 | 15 | 5 | 50 |
| Marketing | 20 | 10 | 20 | 50 |
| Admin | 25 | 10 | 15 | 50 |
| Total | 75 | 35 | 40 | 150 |
: Lunch preference is independent of department.
: Lunch preference is not independent of department.
Expected frequencies: e.g., Engineering-Canteen: .
Using GDC: , .
-value .
Since , we reject .
There is sufficient evidence at the 10% significance level to conclude that lunch preference is not independent of department. Engineering staff are more likely to use the canteen, while Marketing staff are more likely to eat out.
Hypothesis testing checklist: (1) State and in context. (2) State the significance level. (3) Calculate the test statistic or -value using GDC. (4) Compare with critical value or . (5) State conclusion in context. Missing any step loses marks.
May 2026 Prediction Questions
These are NOT official IB questions. These are trend-based practice questions written to reflect the topic areas and question styles most likely to appear on the May 2026 IB Math AI SL Paper 2. Based on recent exam patterns (2022–2025), expect heavy weighting on: normal distribution (finding probabilities and inverse values with GDC), chi-squared test of independence with a full contingency table, and linear regression including Pearson’s interpretation and prediction reliability.
Question 1 — Normal Distribution [~8 marks]
The masses of apples from an orchard are normally distributed with mean 182 g and standard deviation 24 g. An apple is selected at random.
(a) Find the probability that the apple has a mass greater than 200 g.
(b) Find the probability that the apple has a mass between 150 g and 210 g.
(c) The lightest 10% of apples are classified as “grade C” and sold at a discount. Find the maximum mass of a grade C apple.
(d) A crate holds 30 apples selected at random. Find the expected number of apples in the crate with mass greater than 200 g.
Show Solution
Let .
(a)
About 22.7% of apples have a mass greater than 200 g.
(b)
About 78.7% of apples are between 150 g and 210 g.
(c) We need such that .
g (4 s.f.)
The maximum mass of a grade C apple is 151 g (3 s.f.).
(d) apples.
Expected number is approximately 6.80 (or 7 apples).
Question 2 — Chi-Squared Test of Independence [~8 marks]
A sports club surveyed 180 of its members on their preferred activity, categorized by age group. The results are shown below.
| Swimming | Tennis | Gym | Total | |
|---|---|---|---|---|
| Under 30 | 22 | 18 | 40 | 80 |
| 30–50 | 28 | 24 | 18 | 70 |
| Over 50 | 20 | 8 | 2 | 30 |
| Total | 70 | 50 | 60 | 180 |
(a) State the null and alternative hypotheses for a chi-squared test.
(b) Calculate the expected frequency for the “Under 30 / Swimming” cell. Show your working.
(c) Use your GDC to find the chi-squared test statistic and the -value.
(d) State the number of degrees of freedom.
(e) At the 5% significance level, determine whether preferred activity is independent of age group. State your conclusion in context.
Show Solution
(a)
: Preferred activity is independent of age group.
: Preferred activity is not independent of age group.
(b) Expected frequency for Under 30 / Swimming:
(c) Enter the observed frequencies into the GDC as a matrix and run the -Test.
GDC output: (3 s.f.), -value .
(d)
(e) Since -value , we reject .
There is sufficient evidence at the 5% significance level to conclude that preferred activity is not independent of age group. In particular, older members appear more likely to prefer swimming, while younger members show a stronger preference for the gym.
Question 3 — Linear Regression and Prediction [~7 marks]
A researcher collects data on the average daily temperature ( degrees C) and the number of visitors () to an outdoor museum on 8 selected days.
| 12 | 15 | 17 | 19 | 22 | 24 | 27 | 29 | |
|---|---|---|---|---|---|---|---|---|
| 210 | 265 | 290 | 330 | 410 | 450 | 500 | 545 |
(a) Use your GDC to find the equation of the regression line .
(b) Find Pearson’s correlation coefficient and describe the correlation.
(c) Use your model to predict the number of visitors on a day when the temperature is 20 degrees C.
(d) The researcher uses the model to predict visitor numbers on a day forecast to reach 38 degrees C. State whether this is interpolation or extrapolation, and comment on the reliability of this prediction.
Show Solution
(a) Enter -values in L1 and -values in L2. Run linear regression (LinReg).
GDC output: (values may vary slightly by GDC model; accept equivalent forms to 3 s.f.)
(b) (3 s.f.)
This indicates a very strong positive linear correlation between temperature and visitor numbers.
(c) visitors.
Since is within the data range (12 to 29), this is interpolation and the prediction is reliable.
(d) is beyond the maximum data value of 29, so this is extrapolation. The prediction is unreliable — the linear trend may not continue at higher temperatures, and there may be physical limits on visitor capacity. The model should not be used for temperatures this far outside the data range.
IB Formula Booklet — Complex Numbers
Modulus & Polar Form
| GIVEN | z = r(cosθ + i sinθ) = r cis θ |
| GIVEN | z = reiθ (Euler form) |
| MEMORISE | |z| = √(a² + b²) |
| MEMORISE | arg(z) — sketch point, use quadrant formula |
Polar Multiplication & Division
| GIVEN | z&sub1;z&sub2; = r&sub1;r&sub2; cis(θ&sub1; + θ&sub2;) |
| GIVEN | z&sub1;/z&sub2; = (r&sub1;/r&sub2;) cis(θ&sub1; − θ&sub2;) |
De Moivre's Theorem
| GIVEN | (r cis θ)n = rn cis(nθ) |
| MEMORISE | z + 1/z = 2cosθ (when |z|=1) |
| MEMORISE | z − 1/z = 2i sinθ (when |z|=1) |
nth Roots
| GIVEN | w1/n = r1/n cis((θ + 2πk)/n), k=0..n-1 |
| MEMORISE | Sum of nth roots of unity = 0 |
| MEMORISE | 1 + ω + ω² = 0 (cube roots) |
Conjugate & Arithmetic
| MEMORISE | z* = a − bi |
| MEMORISE | z · z* = |z|² (always real) |
| MEMORISE | z + z* = 2Re(z) |
| MEMORISE | z − z* = 2i Im(z) |
Loci
| MEMORISE | |z − a| = r → Circle, centre a, radius r |
| MEMORISE | |z − a| = |z − b| → Perpendicular bisector |
| MEMORISE | arg(z − a) = θ → Ray from a |
Vieta's Formulas
| MEMORISE | z² + az + b = 0: sum = −a, product = b |
| MEMORISE | Conjugate root theorem: real coeff → roots come in conjugate pairs |