Understanding Missing Values: Missingness, Simple Imputer, and Missing Indicator in Machine Learning

In real-world datasets, missing values are common — whether it’s medical records, loan applications, or user profiles. Handling them correctly is crucial for model performance and trustworthy insights. In this blog, we’ll break down: 1. What is Missingness? “Missingness” refers to the presence of missing values in a dataset and, more importantly, the reason or pattern behind those missing values. It’s not just about the absence of a value — it’s about asking: There are 3 types of missingness: Type Description Example MCAR (Missing Completely At Random) Missing has no pattern A server crashed randomly MAR (Missing At Random) Missing depends on other variables Young people more likely to skip income MNAR (Missing Not At Random) Missing depends on itself Rich people don’t disclose their income In MAR and MNAR, the fact that a value is missing can carry information. This is where the Missing Indicator becomes powerful. 2. What is Simple Imputer? SimpleImputer is a tool from scikit-learn used to fill missing values using a simple strategy: Example: import pandas as pdimport numpy as npfrom sklearn.impute import SimpleImputerdata = pd.DataFrame({ ‘Age’: [25, np.nan, 30, np.nan, 45], ‘Income’: [50000, 60000, np.nan, 65000, np.nan]})imputer = SimpleImputer(strategy=’median’)filled_data = imputer.fit_transform(data)pd.DataFrame(filled_data, columns=data.columns) Output: Age Income0 25.0 50000.01 30.0 60000.02 30.0 60000.03 30.0 65000.04 45.0 60000.0 💡 NaN values are replaced with the median of the column. Why is Simple Imputer Important in Industry? 1. ML Models Can’t Handle NaNs Most ML models (Linear Regression, Logistic Regression, Random Forest, etc.) don’t work if NaN is present. You must fill them. 2. Fast and Efficient Simple strategies (mean/median) are fast and effective for most numeric features. 3. Preprocessing Pipelines It integrates well into scikit-learn pipelines (used in production systems). 4. Keeps the Distribution Stable Median is especially useful when data has outliers. Where and When Do We Use Simple Imputer? When to Use: Situation Use Numeric data mean or median Categorical data most_frequent or constant During preprocessing Inside a scikit-learn Pipeline When you want simplicity and speed SimpleImputer is best Real-World Industry Use Cases Loan Default Dataset Medical Dataset Telecom (Churn) Dataset E-commerce Dataset Strategies Summary Table Strategy Best For Example Use Case mean Numeric, no outliers Age, Salary in a clean dataset median Numeric, has outliers Loan Amount, Medical costs most_frequent Categorical or repetitive Gender, Country, Product Brand constant Fill all with a fixed value “Unknown”, 0, etc. 3. Why Imputation Alone Can Be Risky Let’s say a customer’s income was missing and we filled it with the median. That’s good, but: We lose the signal that the income was originally missing. Maybe customers who hide their income are more likely to default on a loan? To keep this information, we use a Missing Indicator. 🔹 What is Missing Indicator? A Missing Indicator (MI) is a binary feature (0 or 1) that tells whether a value was missing in the original dataset for a particular feature. It is not a method to fill the missing value, but a signal to the model that a value was originally missing. 🔹 Why Do We Use Missing Indicator? (Importance) In many real-world datasets, missingness itself can carry information. ✅ For example: Hence, instead of blindly imputing (e.g., filling with mean), we can add an extra “Missing Indicator” column, so the model knows what was imputed. Where and When is Missing Indicator Used? Missing Indicator is especially useful: Real-Life Examples Where MI is Mandatory 1. Loan Default Dataset: 2. Medical Dataset: 3. Telecom Dataset (Churn Prediction): Missingness = sign of likely churn! How to Implement on Real Dataset (Using Python) Let’s go step-by-step using pandas and scikit-learn. Sample Dataset (Simulated) import pandas as pdimport numpy as np# Simulate a small datasetdata = pd.DataFrame({ ‘Age’: [25, 30, np.nan, 45, np.nan], ‘Income’: [50000, 60000, 65000, np.nan, 70000], ‘Loan_Status’: [1, 0, 1, 0, 1]}) Step 1: Add Missing Indicator Columns pythonCopyEditfor col in [‘Age’, ‘Income’]: data[col + ‘_missing’] = data[col].isnull().astype(int) 👉 Step 2: Impute the missing values (with median, for example) for col in [‘Age’, ‘Income’]: data[col].fillna(data[col].median(), inplace=True) 👉 Final Dataset print(data) Output: Age Income Loan_Status Age_missing Income_missing0 25.00 50000.0 1 0 01 30.00 60000.0 0 0 02 32.50 65000.0 1 1 03 45.00 62500.0 0 0 14 32.50 70000.0 1 1 0 ✅ Age_missing and Income_missing now tell the model that the original value was missing. 🔹 Using MissingIndicator from Scikit-learn from sklearn.impute import SimpleImputerfrom sklearn.pipeline import Pipelinefrom sklearn.compose import ColumnTransformerfrom sklearn.impute import MissingIndicator# Define numeric columnsnumeric_cols = [‘Age’, ‘Income’]# Pipeline for numeric columns: impute + missing indicatornumeric_pipeline = Pipeline(steps=[ (‘imputer’, SimpleImputer(strategy=’median’, add_indicator=True))])# Apply column transformertransformer = ColumnTransformer(transformers=[ (‘num’, numeric_pipeline, numeric_cols)])# Fit-transformtransformed_data = transformer.fit_transform(data[numeric_cols]) Clarifying the Confusion SimpleImputer(add_indicator=True) This is actually how scikit-learn recommends using the Missing Indicator! So When Do You Use MissingIndicator Alone? You use the MissingIndicator class separately only if: ✅ If You Want to Use MissingIndicator Separately from sklearn.impute import MissingIndicatorindicator = MissingIndicator()missing_flags = indicator.fit_transform(data)# Get indicator column namesmissing_cols = [col + ‘_missing’ for col, is_missing in zip(data.columns, data.isnull().any()) if is_missing]df_flags = pd.DataFrame(missing_flags, columns=missing_cols)print(df_flags) This just creates binary indicators without imputing the data. ✅ Question 2: Which imputation method should we use? There’s no one-size-fits-all, but here’s a guide: Imputation Method When to Use SimpleImputer (mean/median) Fast, works well for numeric data if distribution is symmetric or slightly skewed KNNImputer Use when nearby samples (rows) are similar. Best for small/medium datasets IterativeImputer Best when you want a model-based estimation of missing values. Powerful but slower Most Frequent Categorical variables — fill with mode Important: No matter which imputer you use, if you believe “missingness” contains signal, add a missing indicator column too! Question : Why should we add a new column like age_missing if we already filled the value? Here’s the real logic: 💡 When we impute, we “guess” the missing value. But if you just replace the missing value with median or mean, you hide the fact that it was missing — and that missingness might carry predictive power. So, we add age_missing to tell the model:“Hey! This value was originally missing. We filled it,