Data Preprocessing: Preparing Data for Analysis

This tutorial covers the systematic process of preparing raw data for machine learning. You'll learn to identify data quality issues, apply appropriate corrections, and structure datasets so models can learn effectively from clean, consistent input.

Estimated time: 50 minutes

Why This Matters

Problem statement:

Raw data is never ready for models.

Real-world data arrives messy; missing values scatter throughout columns, categories use inconsistent spellings, numeric ranges vary wildly, and formats contradict each other.

Models require clean, numeric matrices with no gaps or errors.

The work between receiving data and training a model is data preprocessing, and it determines whether your model succeeds or fails.

Practical benefits: Preprocessing skills let you spot problems early, fix them systematically, and deliver data that models can actually use.

Professional context: Data scientists report spending 60-80% of their time on data preparation. Companies with rigorous preprocessing workflows build reliable systems; those that skip preprocessing steps face model failures in production.

Preprocessing isn't optional. it's the foundation. Every successful data

Data Preprocessing

Core Concepts

Understanding the Terminology

Data work involves several related but distinct activities. Understanding what each term means helps you communicate clearly with colleagues and organize your work.

Data Engineering: Building infrastructure to collect, move, store, and version data at scale. Data engineers create pipelines that deliver data from databases, APIs, and files to where analysts need it. This tutorial assumes data is already accessible.

Data Preprocessing: The systematic process of cleaning errors, handling missing values, standardizing formats, and converting data into numeric forms that models require. This is the focus of this tutorial.

Feature Engineering: Creating new variables from existing data to improve model predictions. This is a subset of preprocessing focused specifically on building better inputs.

Data Cleansing: Fixing quality problems; correcting errors, removing duplicates, standardizing inconsistent formats, and handling missing values.

Data Transformation: Converting data types and scaling numeric values so all features contribute appropriately to model training.

Data Munging/Wrangling: Reshaping data structures; pivoting, melting, merging tables—to get data into the right format.

Exploratory Data Analysis (EDA): Examining data visually and statistically to understand distributions, spot outliers, identify patterns, and determine which preprocessing steps are needed.

Phases

The Preprocessing Workflow

Preprocessing follows a systematic sequence, though you'll often loop back to earlier steps as you discover new issues.

1. Explore → 2. Clean → 3. Transform → 4. Validate → 5. Split → 6. Apply

Each phase has specific goals and standard techniques. Skip a phase, and you'll likely face problems during training or deployment.

Step-by-Step Guide

Phase 1: Exploratory Analysis Guides Preprocessing

Discover what preprocessing your data needs before making changes.

Start by examining data visually and numerically to identify issues. Each pattern you find suggests specific preprocessing steps.

Quick EDA workflow:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('raw_data.csv')

# Structure check
print(f"Shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nFirst rows:\n{df.head()}")

# Missing values
print(f"\nMissing values:\n{df.isnull().sum()}")

# Numeric distributions
print(f"\nNumeric summary:\n{df.describe()}")

# Categorical distributions
for col in df.select_dtypes(include='object').columns:
    print(f"\n{col} value counts:\n{df[col].value_counts()}")

Visual checks:

# Distribution of numeric features
df.hist(figsize=(12, 8), bins=30)
plt.tight_layout()
plt.show()

# Box plots reveal outliers
for col in df.select_dtypes(include='number').columns:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=df[col])
    plt.title(f'{col} Distribution')
    plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Feature Correlations')
plt.show()

Pattern → Preprocessing action mapping:

What EDA Shows	Preprocessing Needed
Skewed distribution (histogram)	Log or power transformation
Extreme outliers (box plot)	Capping, removal, or investigation
Missing values >5%	Imputation strategy needed
Inconsistent categories	Standardization required
Wide numeric ranges	Scaling required
High correlation (>0.9)	Consider removing redundant features

For comprehensive EDA techniques, see the Visualization tutorial.

Phase 2: Data Cleansing

Purpose: Fix quality problems that cause model errors or introduce bias.

2.1 Handling Missing Values

Missing data is common in real datasets. Your strategy depends on how much is missing and why.

Detection:

# Missing value summary
missing_summary = pd.DataFrame({
    'column': df.columns,
    'missing_count': df.isnull().sum(),
    'missing_pct': (df.isnull().sum() / len(df)) * 100
})
missing_summary = missing_summary[missing_summary['missing_count'] > 0]
print(missing_summary.sort_values('missing_pct', ascending=False))

Decision framework:

Less than 5% missing: Usually safe to drop those rows
5-30% missing: Impute with statistical measures
More than 30% missing: Consider dropping the column or using advanced methods
Missing has pattern: Investigate why values are missing

Strategies:

# Strategy 1: Drop rows (only if missing is minimal)
df_clean = df.dropna(subset=['critical_column'])

# Strategy 2: Fill numeric with median (robust to outliers)
df['age'].fillna(df['age'].median(), inplace=True)

# Strategy 3: Fill categorical with mode
df['category'].fillna(df['category'].mode(), inplace=True)

# Strategy 4: Fill based on groups
df['salary'] = df.groupby('department')['salary'].transform(
    lambda x: x.fillna(x.median())
)

# Strategy 5: Create missing indicator
df['age_was_missing'] = df['age'].isnull().astype(int)
df['age'].fillna(df['age'].median(), inplace=True)

For pandas mechanics, see the Pandas tutorial - Handling Missing Data.

2.2 Removing Duplicates

Detection:

# Exact duplicates
duplicates = df.duplicated()
print(f"Exact duplicates: {duplicates.sum()}")

# Duplicates on specific columns
duplicates_subset = df.duplicated(subset=['customer_id', 'date'])
print(f"Duplicate transactions: {duplicates_subset.sum()}")

# View duplicates
df[df.duplicated(keep=False)].sort_values('customer_id')

Removal:

# Remove exact duplicates, keep first occurrence
df_clean = df.drop_duplicates()

# Remove based on subset of columns
df_clean = df.drop_duplicates(subset=['customer_id', 'date'], keep='first')

When duplicates are valid: Sometimes duplicate rows represent real repeated events. Check with domain experts before removing.

2.3 Fixing Inconsistencies

Standardizing text:

# Lowercase and strip whitespace
df['city'] = df['city'].str.lower().str.strip()

# Fix common typos
df['city'] = df['city'].replace({
    'new york': 'new_york',
    'ny': 'new_york',
    'newyork': 'new_york'
})

# Standardize date formats
df['date'] = pd.to_datetime(df['date'], errors='coerce')

Resolving contradictions:

# Example: End date before start date
invalid_dates = df[df['end_date'] < df['start_date']]
print(f"Invalid date ranges: {len(invalid_dates)}")

# Fix by investigation or removal
df = df[df['end_date'] >= df['start_date']]

2.4 Handling Outliers

Detection methods:

# Method 1: Z-score (assumes normal distribution)
from scipy import stats
z_scores = np.abs(stats.zscore(df['price']))
outliers_z = df[z_scores > 3]

# Method 2: IQR (more robust)
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = df[(df['price'] < Q1 - 1.5*IQR) | (df['price'] > Q3 + 1.5*IQR)]

print(f"Outliers found: {len(outliers_iqr)}")

Treatment options:

# Option 1: Remove outliers
df_no_outliers = df[(df['price'] >= Q1 - 1.5*IQR) & (df['price'] <= Q3 + 1.5*IQR)]

# Option 2: Cap outliers
df['price_capped'] = df['price'].clip(lower=Q1 - 1.5*IQR, upper=Q3 + 1.5*IQR)

# Option 3: Keep outliers but flag them
df['is_outlier'] = ((df['price'] < Q1 - 1.5*IQR) | (df['price'] > Q3 + 1.5*IQR)).astype(int)

When to keep outliers: Domain knowledge is critical. A $10M house sale might be valid in certain markets. Always investigate before removing.

Phase 3: Data Transformation

Convert clean data into numeric formats with appropriate scales.

3.1 Scaling Numeric Features

Different features often have vastly different ranges. Scaling puts them on comparable scales.

Why scaling matters: Features with large values (e.g., income: $20,000-$150,000) can dominate features with small values (e.g., number of children: 0-5) during model training. Scaling prevents this.

Scaling techniques:

from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# Min-Max Scaling: Scales to  range
scaler = MinMaxScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

# Standardization: Mean=0, Std=1
scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

# Robust Scaling: Uses median and IQR (good with outliers)
scaler = RobustScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

Before and after comparison:

# Visualize scaling effect
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Before scaling
axes.hist(df_original['income'], bins=30)
axes.set_title('Before Scaling')
axes.set_xlabel('Income')

# After scaling
axes.hist(df['income'], bins=30)
axes.set_title('After Scaling')
axes.set_xlabel('Scaled Income')
plt.tight_layout()
plt.show()

Critical warning: Fit scalers only on training data, then apply to validation and test sets. Fitting on all data causes data leakage.

# CORRECT approach
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit on training only
X_test_scaled = scaler.transform(X_test)        # Apply to test

# WRONG approach
# scaler.fit_transform(X)  # Don't fit on entire dataset!

3.2 Encoding Categorical Variables

Models require numeric input, but many features are categorical. Encoding converts categories to numbers.

One-Hot Encoding (most common):

Creates binary columns for each category.

# Using pandas
df_encoded = pd.get_dummies(df, columns=['city', 'color'], drop_first=True)

# Using sklearn
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False, drop='first')
encoded = encoder.fit_transform(df[['city', 'color']])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())

Example:

Color	Color_Blue	Color_Green	Color_Red
Red	0	0	1
Blue	1	0	0
Green	0	1	0

Warning: One-hot encoding creates many columns with high-cardinality features (e.g., 50 unique cities → 50 new columns). This causes the "curse of dimensionality."

Label Encoding:

Assigns integers to categories. Use only for ordinal data (categories with natural order).

# Ordinal example: education level has order
from sklearn.preprocessing import LabelEncoder

# Define order
education_order = {'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3}
df['education_encoded'] = df['education'].map(education_order)

# Or use LabelEncoder (but it assigns arbitrary order)
encoder = LabelEncoder()
df['city_encoded'] = encoder.fit_transform(df['city'])

Danger: Using label encoding on nominal categories (no natural order) misleads models into thinking higher numbers mean "more" of something.

3.3 Feature Engineering Basics

Creating new features from existing data often improves model performance more than choosing better models.

Common transformations:

# Ratios and rates
df['price_per_sqft'] = df['price'] / df['square_feet']
df['debt_to_income'] = df['debt'] / df['income']

# Binning continuous variables
df['age_group'] = pd.cut(df['age'], bins=,[2][11][12][13][14]
                          labels=['<18', '18-35', '36-50', '51-65', '65+'])

# Date/time features
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin().astype(int)

# Interaction features
df['price_x_quality'] = df['price'] * df['quality_score']

# Polynomial features
df['age_squared'] = df['age'] ** 2

For more on creating columns, see the Pandas tutorial - Creating New Columns.

Phase 4: Avoiding Bias in Preprocessing

Recognize and prevent bias introduced during data preparation.

Preprocessing choices can inadvertently introduce bias that degrades model performance or creates unfair outcomes. Three main sources require attention:

4.1 Data Leakage

Problem: Using information that won't be available when making predictions.

Common causes:

# WRONG: Scaling before splitting
scaler.fit(X)  # Fit on entire dataset
X_train, X_test = train_test_split(X, test_size=0.2)

# CORRECT: Split first, then scale
X_train, X_test = train_test_split(X, test_size=0.2)
scaler.fit(X_train)  # Fit only on training
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Target leakage example:

# WRONG: Creating feature that contains target information
df['has_churned'] = (df['last_activity'] == df['churn_date']).astype(int)
# Problem: 'churn_date' is only known after customer churns!

# CORRECT: Use information available before prediction
df['days_since_last_activity'] = (df['today'] - df['last_activity']).dt.days

4.2 Sampling Bias

Problem: Training and test sets don't represent the same population.

Solutions:

# Use stratified splitting for imbalanced classes
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# For time-series: Use temporal splitting (never random)
train_cutoff = df['date'].quantile(0.8)
train = df[df['date'] <= train_cutoff]
test = df[df['date'] > train_cutoff]

4.3 Preprocessing-Induced Bias

Problem: Preprocessing choices that favor majority groups or create unfair patterns.

Example: Imputation bias

# Problem: Filling missing income with overall median
# If group A has missing values but different income distribution,
# this biases predictions for group A
df['income'].fillna(df['income'].median(), inplace=True)

# Better: Fill based on relevant groups
df['income'] = df.groupby('region')['income'].transform(
    lambda x: x.fillna(x.median())
)

Proxy variables: Some features indirectly encode protected attributes (e.g., ZIP code strongly correlates with race). Be aware of these relationships.

Phase 5: The Complete Preprocessing Pipeline

Purpose: Apply all steps in the correct order.

Here's the standard sequence for a complete preprocessing workflow:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# 1. Load data
df = pd.read_csv('raw_data.csv')

# 2. Initial exploration
print(df.info())
print(df.describe())
print(df.isnull().sum())

# 3. Remove duplicates
df = df.drop_duplicates()

# 4. Handle missing values (before splitting)
# Drop columns with >50% missing
high_missing = df.columns[df.isnull().mean() > 0.5]
df = df.drop(columns=high_missing)

# 5. Fix data types
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category')

# 6. Remove obvious outliers (domain-specific)
df = df[df['age'] > 0]
df = df[df['price'] > 0]

# 7. Feature engineering (before splitting)
df['price_per_unit'] = df['price'] / df['units']
df['year'] = df['date'].dt.year

# 8. Separate features and target
X = df.drop('target', axis=1)
y = df['target']

# 9. CRITICAL: Split data BEFORE fitting any transformations
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 10. Impute missing values (fit on training only)
imputer = SimpleImputer(strategy='median')
numeric_cols = X_train.select_dtypes(include='number').columns
X_train[numeric_cols] = imputer.fit_transform(X_train[numeric_cols])
X_test[numeric_cols] = imputer.transform(X_test[numeric_cols])

# 11. Encode categorical variables (fit on training only)
categorical_cols = X_train.select_dtypes(include='object').columns
encoder = OneHotEncoder(sparse=False, drop='first', handle_unknown='ignore')
X_train_cat = encoder.fit_transform(X_train[categorical_cols])
X_test_cat = encoder.transform(X_test[categorical_cols])

# 12. Scale numeric features (fit on training only)
scaler = StandardScaler()
X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])

# 13. Combine encoded and scaled features
X_train_final = np.hstack([X_train[numeric_cols].values, X_train_cat])
X_test_final = np.hstack([X_test[numeric_cols].values, X_test_cat])

# 14. Verify shapes
print(f"Training shape: {X_train_final.shape}")
print(f"Test shape: {X_test_final.shape}")

# 15. Save preprocessing objects for production use
import joblib
joblib.dump(imputer, 'imputer.pkl')
joblib.dump(encoder, 'encoder.pkl')
joblib.dump(scaler, 'scaler.pkl')

Key sequence points:

Explore before changing anything
Remove duplicates early
Basic cleaning before feature engineering
Split data BEFORE fitting transformations
Fit all transformations on training data only
Apply transformations to test data
Save fitted transformers for production

Common Preprocessing Challenges

Imbalanced Classes

Problem: One class has far more samples (e.g., 95% class A, 5% class B).

Impact: Models predict majority class for everything, achieving high accuracy but missing minority class entirely.

Solutions:

# Check balance
print(y.value_counts(normalize=True))

# Option 1: Oversample minority class
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Option 2: Undersample majority class
from imblearn.under_sampling import RandomUnderSampler
undersampler = RandomUnderSampler(random_state=42)
X_train_resampled, y_train_resampled = undersampler.fit_resample(X_train, y_train)

High-Dimensional Data

Problem: Too many features relative to samples.

Impact: Models require more data, training slows, and performance suffers.

Solutions:

# Option 1: Remove low-variance features
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_reduced = selector.fit_transform(X)

# Option 2: Remove highly correlated features
corr_matrix = X.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
X_reduced = X.drop(columns=to_drop)

# Option 3: Use feature importance from a model
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=X_train.columns)
top_features = importances.nlargest(20).index
X_reduced = X[top_features]

Memory Issues with Large Datasets

Problem: Dataset doesn't fit in memory.

Solutions:

# Option 1: Load in chunks
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    # Process each chunk
    processed = preprocess_chunk(chunk)
    # Save or aggregate results

# Option 2: Use data types to reduce memory
df['age'] = df['age'].astype('int8')  # Instead of int64
df['category'] = df['category'].astype('category')  # Instead of object

# Option 3: Select relevant columns only
relevant_cols = ['col1', 'col2', 'col3']
df = pd.read_csv('large_file.csv', usecols=relevant_cols)

Best Practices

Always explore before preprocessing: Spend 30 minutes examining data before making any changes. Understanding what you have prevents mistakes.

Document every decision: Write comments explaining why you chose specific thresholds, imputation methods, or transformations. Future you will need this.

Split data early: Separate training and test sets immediately after loading. Never fit transformations on test data.

Validate after each step: Check data shape, null counts, and sample values after every preprocessing operation to catch errors immediately.

Keep raw data untouched: Never modify original files. Apply all changes in code for reproducibility.

Version preprocessing code with models: A model is worthless without knowing how its input was prepared. Track them together.

Test pipeline with small samples: Run your full workflow on 1,000 rows before processing millions. Catch bugs fast.

Save fitted transformers: Production systems need the exact scalers, encoders, and imputers used during training. Save them as files.

Quick Reference

Preprocessing Decision Guide

Issue	When to Use	Method
Missing <5%	Small, random gaps	Drop rows
Missing 5-30%	Moderate gaps	Impute with median/mode
Missing >30%	Extensive gaps	Drop column or advanced imputation
Skewed distribution	Long right tail	Log transformation
Outliers present	Extreme values visible	IQR capping or removal
Wide numeric ranges	Features on different scales	StandardScaler or MinMaxScaler
Categorical features	Any non-numeric categories	One-hot encoding
High cardinality (>50 categories)	Too many unique values	Group rare categories or target encoding
Imbalanced classes	One class <10% of data	SMOTE or class weights

Preprocessing Sequence Checklist

Load and explore data (.info(), .describe(), .head())
Check for missing values (.isnull().sum())
Remove duplicates (.drop_duplicates())
Fix data types (.astype(), pd.to_datetime())
Handle outliers (based on domain knowledge)
Engineer features (ratios, date parts, interactions)
Split into train/test sets
Impute missing values (fit on train only)
Encode categorical variables (fit on train only)
Scale numeric features (fit on train only)
Validate final data (check shapes, null counts)
Save preprocessing objects (.pkl files)

Summary & Next Steps

Key accomplishments: You've learned to identify data quality issues through exploration, apply systematic cleaning techniques for missing values, duplicates, and outliers, encode categorical variables and scale numeric features appropriately, recognize and prevent common sources of bias, and build complete preprocessing pipelines with correct sequencing.

Critical insights:

Preprocessing determines model success: Clean, properly scaled data matters more than choosing the fanciest model
Split data before fitting: Fit all transformations on training data only to prevent data leakage
Document decisions: Your preprocessing choices need justification and reproducibility
Iteration is normal: Expect to revisit earlier steps as you discover new issues

Connections to previous tutorials:

Data Foundations: This tutorial implements the cleaning and transformation phases described there
Pandas: Your primary tool for executing every preprocessing operation
Visualization: Essential for discovering preprocessing needs and validating results

What's next:

With clean, preprocessed data, you're ready to train models. The next tutorials cover model selection, training workflows, and performance evaluation.

External resources:

scikit-learn Preprocessing Documentation - comprehensive guide to transformation methods
Kaggle Learn: Data Cleaning - interactive exercises with real datasets
Feature Engineering Book - advanced feature creation techniques

Remember: Great models start with great preprocessing. Master these techniques, and you'll build reliable systems that work in production, not just in notebooks.