Understanding Missing Values: Missingness, Simple Imputer, and Missing Indicator in Machine Learning

In real-world datasets, missing values are common — whether it’s medical records, loan applications, or user profiles. Handling them correctly is crucial for model performance and trustworthy insights.

In this blog, we’ll break down:

  1. What is Missingness?
  2. What is Simple Imputer?
  3. What is Missing Indicator?
  4. Why and when we use them in the real world
  5. How to implement step-by-step, ending with a production-ready pipeline

1. What is Missingness?

“Missingness” refers to the presence of missing values in a dataset and, more importantly, the reason or pattern behind those missing values.

It’s not just about the absence of a value — it’s about asking:

  • Why is this value missing?
  • Does the fact that it’s missing tell us something useful?

There are 3 types of missingness:

Type Description Example
MCAR (Missing Completely At Random) Missing has no pattern A server crashed randomly
MAR (Missing At Random) Missing depends on other variables Young people more likely to skip income
MNAR (Missing Not At Random) Missing depends on itself Rich people don’t disclose their income

In MAR and MNAR, the fact that a value is missing can carry information. This is where the Missing Indicator becomes powerful.


2. What is Simple Imputer?

SimpleImputer is a tool from scikit-learn used to fill missing values using a simple strategy:

  • Mean: Average of column
  • Median: Middle value
  • Most Frequent: Most common value
  • Constant: A fixed value like 0 or “Unknown”

Example:

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

data = pd.DataFrame({
'Age': [25, np.nan, 30, np.nan, 45],
'Income': [50000, 60000, np.nan, 65000, np.nan]
})

imputer = SimpleImputer(strategy='median')
filled_data = imputer.fit_transform(data)

pd.DataFrame(filled_data, columns=data.columns)

 Output:

   Age   Income
0 25.0 50000.0
1 30.0 60000.0
2 30.0 60000.0
3 30.0 65000.0
4 45.0 60000.0

💡 NaN values are replaced with the median of the column.

 Why is Simple Imputer Important in Industry?

 1. ML Models Can’t Handle NaNs

Most ML models (Linear Regression, Logistic Regression, Random Forest, etc.) don’t work if NaN is present. You must fill them.

 2. Fast and Efficient

Simple strategies (mean/median) are fast and effective for most numeric features.

 3. Preprocessing Pipelines

It integrates well into scikit-learn pipelines (used in production systems).

 4. Keeps the Distribution Stable

Median is especially useful when data has outliers.


 Where and When Do We Use Simple Imputer?

 When to Use:

Situation Use
Numeric data mean or median
Categorical data most_frequent or constant
During preprocessing Inside a scikit-learn Pipeline
When you want simplicity and speed SimpleImputer is best

 Real-World Industry Use Cases

 Loan Default Dataset

  • Feature: NumberOfDependents
  • Strategy: Fill missing with median (most realistic, avoids outliers)

 Medical Dataset

  • Feature: Blood Pressure
  • Strategy: Fill with mean or median

 Telecom (Churn) Dataset

  • Feature: LastRechargeAmount
  • Strategy: Fill with constant = 0 or median

 E-commerce Dataset

  • Feature: ProductRating
  • Strategy: Fill with most_frequent or constant = “unknown”

 Strategies Summary Table

Strategy Best For Example Use Case
mean Numeric, no outliers Age, Salary in a clean dataset
median Numeric, has outliers Loan Amount, Medical costs
most_frequent Categorical or repetitive Gender, Country, Product Brand
constant Fill all with a fixed value “Unknown”, 0, etc.

 3. Why Imputation Alone Can Be Risky

Let’s say a customer’s income was missing and we filled it with the median. That’s good, but:

We lose the signal that the income was originally missing.

Maybe customers who hide their income are more likely to default on a loan?

 To keep this information, we use a Missing Indicator.

🔹 What is Missing Indicator?

A Missing Indicator (MI) is a binary feature (0 or 1) that tells whether a value was missing in the original dataset for a particular feature.

  • 1 = Value was missing
  • 0 = Value was present

It is not a method to fill the missing value, but a signal to the model that a value was originally missing.


🔹 Why Do We Use Missing Indicator? (Importance)

In many real-world datasets, missingness itself can carry information.

✅ For example:

  • A high-income customer might refuse to disclose their salary — this “missing” value may actually indicate something about their behavior or category.
  • In a medical dataset, a missing lab test might mean that the doctor didn’t find it necessary — which is a kind of signal!

Hence, instead of blindly imputing (e.g., filling with mean), we can add an extra “Missing Indicator” column, so the model knows what was imputed.


Where and When is Missing Indicator Used?

Missing Indicator is especially useful:

  • In tree-based models (like Random Forest, XGBoost)
  • When data is not missing completely at random
  • When the fact that a value is missing is predictive
  • In financial, healthcare, telecom, insurance, etc. industries

Real-Life Examples Where MI is Mandatory

 1. Loan Default Dataset:

  • Feature: NumberOfDependents
  • Some people did not report their dependents.
  • Missing values could signal:
    • Hiding dependents → potential fraud
    • Single individuals → possibly fewer expenses → less risky

So we:

  • Impute missing values (e.g., with median)
  • Add a Missing Indicator column: NumberofDependents_missing

 2. Medical Dataset:

  • Feature: Cholestrol_Level
  • If missing, it might be:
    • Patient already healthy
    • Doctor didn’t request the test

So we:

  • Impute with median
  • Add Missing Indicator: Cholestrol_Level_missing

 3. Telecom Dataset (Churn Prediction):

  • Feature: LastRechargeAmount
  • If missing, maybe the user is inactive for a long time

Missingness = sign of likely churn!

How to Implement on Real Dataset (Using Python)

Let’s go step-by-step using pandas and scikit-learn.

 Sample Dataset (Simulated)

import pandas as pd
import numpy as np

# Simulate a small dataset
data = pd.DataFrame({
'Age': [25, 30, np.nan, 45, np.nan],
'Income': [50000, 60000, 65000, np.nan, 70000],
'Loan_Status': [1, 0, 1, 0, 1]
})

 Step 1: Add Missing Indicator Columns

pythonCopyEditfor col in ['Age', 'Income']:
data[col + '_missing'] = data[col].isnull().astype(int)

👉 Step 2: Impute the missing values (with median, for example)

for col in ['Age', 'Income']:
data[col].fillna(data[col].median(), inplace=True)

👉 Final Dataset

print(data)

Output:

     Age   Income  Loan_Status  Age_missing  Income_missing
0 25.00 50000.0 1 0 0
1 30.00 60000.0 0 0 0
2 32.50 65000.0 1 1 0
3 45.00 62500.0 0 0 1
4 32.50 70000.0 1 1 0

Age_missing and Income_missing now tell the model that the original value was missing.


🔹 Using MissingIndicator from Scikit-learn

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import MissingIndicator

# Define numeric columns
numeric_cols = ['Age', 'Income']

# Pipeline for numeric columns: impute + missing indicator
numeric_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median', add_indicator=True))
])


# Apply column transformer
transformer = ColumnTransformer(transformers=[
('num', numeric_pipeline, numeric_cols)
])

# Fit-transform
transformed_data = transformer.fit_transform(data[numeric_cols])

Clarifying the Confusion

 SimpleImputer(add_indicator=True)

This is actually how scikit-learn recommends using the Missing Indicator!

  • SimpleImputer fills the missing values (e.g., with mean/median/etc.)
  • If you set add_indicator=True, it automatically adds a missing indicator column for each feature that has missing values

So When Do You Use MissingIndicator Alone?

You use the MissingIndicator class separately only if:

  1. You don’t want to fill missing values, but just flag them.
  2. You already filled missing values in a different way, and now want to generate missing indicators separately.

✅ If You Want to Use MissingIndicator Separately

from sklearn.impute import MissingIndicator

indicator = MissingIndicator()
missing_flags = indicator.fit_transform(data)

# Get indicator column names
missing_cols = [col + '_missing' for col, is_missing in zip(data.columns, data.isnull().any()) if is_missing]

df_flags = pd.DataFrame(missing_flags, columns=missing_cols)

print(df_flags)

This just creates binary indicators without imputing the data.

Question 2: Which imputation method should we use?

There’s no one-size-fits-all, but here’s a guide:

Imputation Method When to Use
SimpleImputer (mean/median) Fast, works well for numeric data if distribution is symmetric or slightly skewed
KNNImputer Use when nearby samples (rows) are similar. Best for small/medium datasets
IterativeImputer Best when you want a model-based estimation of missing values. Powerful but slower
Most Frequent Categorical variables — fill with mode

Important: No matter which imputer you use, if you believe “missingness” contains signal, add a missing indicator column too!

Question : Why should we add a new column like age_missing if we already filled the value?

Here’s the real logic:

💡 When we impute, we “guess” the missing value.

But if you just replace the missing value with median or mean, you hide the fact that it was missing — and that missingness might carry predictive power.

So, we add age_missing to tell the model:
“Hey! This value was originally missing. We filled it, but you should know it was NA.”

This gives the model two signals:

. The filled value (e.g., Age = 32.5)

. And the fact that it was originally missing (e.g., Age_missing = 1)

 Want to Think Like a Pro?

  • Use SimpleImputer + add_indicator=True for fast, scalable ML pipelines.
  • Use IterativeImputer or KNNImputer + manually add missing indicator columns if you’re working with sensitive/predictive features (like Age, Income, Lab Results).
  • Always test model performance with and without indicator columns → keep if model improves.

 Final Implementation Using Pipeline

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.compose import ColumnTransformer

# Sample dataset
df = pd.DataFrame({
'Age': [25, 30, np.nan, 45, np.nan],
'Income': [50000, 60000, 65000, np.nan, 70000],
'Target': [0, 1, 0, 1, 0]
})

X = df[['Age', 'Income']]
y = df['Target']

# Create transformer: impute + indicator
imputer = SimpleImputer(strategy='median', add_indicator=True)

# Final pipeline with Random Forest
pipeline = Pipeline([
('imputer', imputer),
('model', RandomForestClassifier(random_state=42))
])

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and predict
pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)

# Accuracy
print("Test Accuracy:", accuracy_score(y_test, preds))

 Why Use a Pipeline?

When working on a real ML project, especially in production or with cross-validation, you want to ensure:

  • Preprocessing happens inside the training folds
  • Imputation rules are applied only from training data
  • No data leakage

That’s where pipelines help. They chain preprocessing + modeling into one safe and reproducible unit.

🎯 Final Thoughts

  • Missingness matters.
  • Don’t just fill the value — understand why it’s missing.
  • Simple Imputer ensures your model can run.
  • Missing Indicator ensures your model can learn from the absence.

Together, they help you build more robust, realistic, and high-performing models.


If you found this blog helpful, feel free to share it with your fellow learners or save it in your Machine Learning notes.

📌 Remember: In data science, small details like missing indicators can lead to big gains in model performance!

2 thoughts on “Understanding Missing Values: Missingness, Simple Imputer, and Missing Indicator in Machine Learning”

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top