Skip to content

Probability & Statistics for Machine Learning

This tutorial introduces probability and statistics fundamentals that are essential for understanding and applying machine learning. You'll learn to quantify uncertainty, understand distributions, and learn how probabilistic thinking enables ML models to make predictions under uncertainty.

Estimated time: 70 minutes

Why This Matters

Problem statement: Machine learning models don't make perfectly accurate predictions—they work with noisy, incomplete data and must quantify uncertainty. Without probability theory, we cannot model randomness, evaluate model reliability, or understand why algorithms make certain predictions.

Practical benefits: Probability provides the mathematical framework for handling uncertainty. Understanding distributions helps you choose appropriate models, compute confidence intervals, detect outliers, and interpret what predictions actually mean. Statistics enables you to validate model improvements and make data-driven decisions backed by evidence.

Professional context: Every ML algorithm involves probability. Classification models output probability distributions over classes ("80% cat, 20% dog"), regression models have probabilistic error terms, and neural networks learn probability distributions. Understanding these foundations is essential for building and trusting ML systems.

Probability

Prerequisites & Learning Objectives

Required knowledge: - Basic algebra and arithmetic - Understanding of functions and graphs - Familiarity with summation notation - Basic Python (minimal needed)

Learning outcomes: - Understand what probability measures and how to compute it - Distinguish between independent and disjoint events - Apply addition and multiplication rules for probabilities - Work with probability distributions and density functions - Compute and interpret mean, variance, and standard deviation - Recognize common distributions and their ML applications

Core Concepts

What is Probability?

Probability is a number between 0 and 1 that measures the likelihood of an event occurring.

Properties:

  • \[0 \leq P(A) \leq 1$$ for any event $$A\]
  • \[P(\text{certain event}) = 1\]
  • \[P(\text{impossible event}) = 0\]

Example: Rolling a fair die

  • \[P(\text{rolling a 6}) = \frac{1}{6} \approx 0.167\]
  • \[P(\text{rolling even number}) = \frac{3}{6} = 0.5\]
  • \(\(P(\text{rolling 7}) = 0\)\) (impossible)

In ML, probability models uncertainty. A classifier doesn't say "this IS a cat"—it says "I'm 85% confident this is a cat."

Random Variables

A random variable is a variable whose value is determined by chance.

Types:

  • Discrete: Countable values (coin flips, dice rolls, number of customers)
  • Continuous: Any value in a range (height, temperature, prediction error)

Features are random variables (customer age varies), predictions have uncertainty, and errors are random.

Basic Probability Notation

Key notation:

  • \(\(P(A)\)\): Probability event \(\(A\)\) occurs
  • \(\(P(A \cap B)\)\): Probability both \(\(A\)\) AND \(\(B\)\) occur
  • \(\(P(A \cup B)\)\): Probability \(\(A\)\) OR \(\(B\)\) occurs
  • \(\(P(A|B)\)\): Probability of \(\(A\)\) given \(\(B\)\) occurred (conditional)
  • \(\(P(A^c)\)\): Probability \(\(A\)\) does NOT occur (complement)

Complement rule:

\[ P(A^c) = 1 - P(A) \]

Example: Weather forecasting

  • \[P(\text{rain}) = 0.3\]
  • \[P(\text{no rain}) = 1 - 0.3 = 0.7\]

Independent vs Disjoint Events

These concepts are different and often confused.

Disjoint (Mutually Exclusive): Cannot both happen simultaneously

  • Rolling a 3 AND rolling a 5 on one die (impossible)
  • \(\(A \cap B = \emptyset\)\) (empty)
  • If \(\(A\)\) happens, \(\(B\)\) definitely doesn't

Independent: One event doesn't affect the other's probability

  • Flipping two different coins
  • \[P(A \cap B) = P(A) \cdot P(B)\]
  • Knowing \(\(A\)\) occurred doesn't change \(\(P(B)\)\)

Independent Events

Critical distinction: Disjoint events are NEVER independent (unless probability is 0). If events can't occur together, they're dependent!

Naive Bayes (a ML model) assumes features are independent. Checking this assumption matters for model accuracy.

Addition Rule

Addition Rule

Computes probability that at least one event occurs.

General addition rule: $$ P(A \cup B) = P(A) + P(B) - P(A \cap B) $$

Why subtract? Adding \(\(P(A) + P(B)\)\) double-counts the overlap.

Special case (disjoint): $$ P(A \cup B) = P(A) + P(B) $$

Example: Drawing a card

  • \[P(\text{Heart}) = \frac{13}{52} = 0.25\]
  • \[P(\text{Ace}) = \frac{4}{52} = 0.077\]
  • \[P(\text{Ace of Hearts}) = \frac{1}{52} = 0.019\]
  • \[P(\text{Heart OR Ace}) = 0.25 + 0.077 - 0.019 = 0.308\]

Multiplication Rule

Computes probability that both events occur.

General multiplication rule: $$ P(A \cap B) = P(A) \cdot P(B|A) $$

Independent events: $$ P(A \cap B) = P(A) \cdot P(B) $$

Example: Drawing two cards without replacement

  • \[P(\text{first ace}) = \frac{4}{52}\]
  • \[P(\text{second ace | first ace}) = \frac{3}{51}\]
  • \[P(\text{both aces}) = \frac{4}{52} \times \frac{3}{51} = 0.0045\]

ML connection: Computing joint probabilities in probabilistic models.

Conditional Probability and Bayes' Theorem

Conditional probability: Probability of \(\(A\)\) given \(\(B\)\) occurred

\[ P(A|B) = \frac{P(A \cap B)}{P(B)} \]

Bayes' Theorem (fundamental for ML): $$ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} $$

Bayes Theorem

Example: Medical testing

  • Disease prevalence: \(\(P(\text{disease}) = 0.01\)\) (1%)
  • Test accuracy: \(\(P(\text{positive | disease}) = 0.95\)\)
  • False positive rate: \(\(P(\text{positive | no disease}) = 0.05\)\)

Even with a positive test, the actual probability of having the disease is only about 16% because the disease is rare!

ML connection: Classification uses Bayes' theorem to compute \(\(P(\text{class | features})\)\) from \(\(P(\text{features | class})\)\).

Probability Distributions

A probability distribution describes how probability is allocated across possible values.

Types:

  • Discrete: Probability Mass Function (PMF) - specific values have probabilities
  • Continuous: Probability Density Function (PDF) - probabilities over ranges

All probabilities sum (discrete) or integrate (continuous) to 1.

Example: Fair die has uniform distribution

  • \[P(X = 1) = P(X = 2) = \cdots = P(X = 6) = \frac{1}{6}\]

Probability Mass Function (PMF)

For discrete variables, PMF gives \(\(P(X = x)\)\) for each value.

Properties: - \(\(p_X(x) \geq 0\)\) for all \(\(x\)\) - \(\(\sum_x p_X(x) = 1\)\)

Example: Number of heads in 3 coin flips

Heads (x) 0 1 2 3
P(X = x) 1/8 3/8 3/8 1/8

Classification outputs are PMFs over classes.

Probability Density Function (PDF)

For continuous variables, PDF \(\(f(x)\)\) describes relative likelihood.

Key difference: For continuous variables, \(\(P(X = x) = 0\)\) for any specific \(\(x\)\). We compute probabilities over intervals:

\[ P(a \leq X \leq b) = \int_a^b f(x) \, dx \]

Properties:

  • \[f(x) \geq 0\]
  • \[\int_{-\infty}^{\infty} f(x) \, dx = 1\]
  • Height \(\(f(x)\)\) is NOT a probability (can exceed 1)

Regression assumes errors follow a PDF (usually Gaussian/normal distribution).

Cumulative Distribution Function (CDF)

The CDF gives probability that a random variable is less than or equal to a value:

\[ F(x) = P(X \leq x) \]

Properties:

  • \[0 \leq F(x) \leq 1\]
  • \(\(F(x)\)\) is non-decreasing
  • \[F(-\infty) = 0$$, $$F(\infty) = 1\]

Use: Compute probability over range $$ P(a < X \leq b) = F(b) - F(a) $$

Example: For standard normal distribution

  • \(\(F(0) = 0.5\)\) (50% of values below 0)
  • \(\(F(1) = 0.841\)\) (84.1% of values below 1)
  • \(\(P(-1 \leq X \leq 1) = F(1) - F(-1) = 0.683\)\) (68.3%)

Percentiles

The p-th percentile is the value below which p% of observations fall.

Common percentiles:

  • 25th (Q1), 50th (median), 75th (Q3)
  • Interquartile Range: \(\(IQR = Q3 - Q1\)\)

Percentiles

Outlier detection: Values outside \(\([Q1 - 1.5 \times IQR, Q3 + 1.5 \times IQR]\)\)

ML connection: "This model's 90th percentile error is 5 units" means 90% of predictions have error ≤ 5.

Mean, Variance, and Standard Deviation

Let's use a simple dataset to illustrate: Test scores: 70, 75, 80, 85, 90

Mean (average): $$ \mu = \frac{70 + 75 + 80 + 85 + 90}{5} = \frac{400}{5} = 80 $$

The mean is the "center" or typical value.

Variance (spread): $$ \sigma^2 = \frac{(70-80)^2 + (75-80)^2 + (80-80)^2 + (85-80)^2 + (90-80)^2}{5} $$ $$ = \frac{100 + 25 + 0 + 25 + 100}{5} = \frac{250}{5} = 50 $$

Variance measures average squared distance from mean.

Standard deviation: $$ \sigma = \sqrt{50} \approx 7.07 $$

Standard deviation is typical distance from mean, in original units.

Interpretation: Most scores are within ±7 points of the average (80).

Properties:

  • Higher variance/std dev = more spread out data
  • Zero variance = all values identical
  • Units: variance is squared, std dev matches data units

ML connection:

  • Mean Squared Error (MSE) is the variance of prediction errors
  • Standard deviation tells you typical prediction error magnitude
  • Models try to minimize variance of errors

Common Probability Distributions

Uniform Distribution

Intuition: All values equally likely in a range.

Uniform Distribution

Use cases:

  • Random initialization of neural network weights
  • Generating test data
  • Modeling "no prior knowledge" scenarios

Example: Roll a fair die - each number has probability 1/6

ML application: Dropout in neural networks randomly selects neurons uniformly.

Normal (Gaussian) Distribution

Intuition: Bell-shaped curve, most values near the mean, symmetric.

Normal Gaussian

Use cases:

  • Natural measurements (height, IQ)
  • Modeling errors and noise
  • Central Limit Theorem applications

Parameters: Mean \(\(\mu\)\), variance \(\(\sigma^2\)\)

68-95-99.7 rule:

  • 68% of data within 1 standard deviation of mean
  • 95% within 2 standard deviations
  • 99.7% within 3 standard deviations

ML applications:

  • Linear regression assumes normal error distribution
  • Gaussian Naive Bayes classifier
  • Gaussian Processes
  • Many algorithms assume normality for mathematical convenience

Bernoulli Distribution

Intuition: Single yes/no trial (success/failure).

Bernoulli

Use cases:

  • Coin flip (heads/tails)
  • Binary classification (cat/not cat)
  • Click/no click on ad

Parameters: \(\(p\)\) = probability of success

ML connection: Logistic regression outputs Bernoulli probabilities for binary classification.

Binomial Distribution

Intuition: Number of successes in \(\(n\)\) independent yes/no trials.

Binomial

Use cases:

  • Number of heads in 10 coin flips
  • Number of customers who buy product
  • Number of correct answers on multiple-choice test

Parameters: \(\(n\)\) (trials), \(\(p\)\) (success probability)

Mean: \(\(np\)\), Variance: \(\(np(1-p)\)\)

Example: Flip coin 10 times, expect 5 heads on average if fair.

Poisson Distribution

Intuition: Number of events occurring in fixed time/space interval when events happen independently at constant average rate.

Poisson

Use cases:

  • Number of emails received per hour
  • Customer arrivals at store per day
  • Website visits per minute
  • Rare events

Parameters: \(\(\lambda\)\) (average rate)

ML connection: Poisson regression for count data (predicting number of occurrences).

Exponential Distribution

Intuition: Time between events in a Poisson process.

Exponential

Use cases: - Time until next customer arrives - Lifetime of electronic component - Time between failures

Parameters: \(\(\lambda\)\) (rate)

Relationship: If events follow Poisson, time between events follows exponential.

Quick Reference

Concept Definition ML Application
Probability Likelihood measure, 0 to 1 Quantifying uncertainty
Independent \(\(P(A \cap B) = P(A)P(B)\)\) Naive Bayes assumption
Addition rule \(\(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)\) Multiple outcomes
Conditional \(\(P(A\|B) = \frac{P(A \cap B)}{P(B)}\)\) Context-dependent prediction
Bayes' theorem \(\(P(A\|B) = \frac{P(B\|A)P(A)}{P(B)}\)\) Classification
Mean Center of distribution Expected value
Variance Spread measure Error magnitude
Normal distribution Bell curve Error modeling
PMF Discrete probabilities Classification outputs
PDF Continuous density Regression errors
CDF Cumulative probability Percentiles, p-values

Summary & Next Steps

Key accomplishments: You've learned how probability quantifies uncertainty, mastered probability rules, understood distributions as models of randomness, computed statistical measures, and connected concepts to ML applications.

Best practices:

  • Check distribution assumptions before applying methods
  • Visualize data distributions to verify model fit
  • Consider base rates (priors) in classification
  • Use appropriate metrics for your distribution type

Connections to ML:

  • Classification outputs probability distributions (softmax)
  • Regression assumes error distributions (usually normal)
  • Uncertainty quantification uses probability
  • Bayesian ML treats parameters as random variables

External resources: - Khan Academy: Probability & Statistics - interactive lessons - 3Blue1Brown: Bayes' Theorem - visual explanation - Seeing Theory - visual introduction to probability