๐งโ๐ซ 1. What Are Missing Values?
Imagine youโre conducting a survey in your city to collect information like:
- Name
- Age
- Gender
- Monthly Income
Now, suppose some people didnโt fill out the โIncomeโ section, or the data entry system missed it.
That blank space is called a missing value.
In simple words:
Missing values are empty spaces in your data where information is supposed to be, but it’s not available.
๐ In a dataset, it usually looks like this:
Name Age Income Ali 25 NaN (missing) Sara 30 80,000 Bilal 22 NaN (missing)
๐ค 2. Why Should We Handle Missing Values in Data Science & ML?
Great question! Let’s see why we care about these empty spaces:
๐ซ 1. Machine Learning Hates Missing Data
ML algorithms like Logistic Regression, SVM, and Random Forest will crash if they see NaN
.
They say:
โI need clean input โ not blanks!โ
โ ๏ธ 2. Ignoring Missing Values Can Mislead the Model
If we donโt handle missing values properly:
- The model can learn wrong patterns
- Your predictions may be biased or inaccurate
๐ 3. You Lose Valuable Data
If 25% of your rows have missing values and you just delete them all, you’re wasting good information.
๐ก 4. Sometimes โMissingโ Means Something Important
For example, if someone refuses to share their income, that might mean theyโre very rich or hiding something.
You can create a feature like:
income_missing = 1 to help the model.
๐ 3. Types of Missing Values โ With Real-Life Examples
There are 3 types of missing data, based on why the value is missing.
Letโs understand them with an easy-to-remember story:
โ 1. MCAR (Missing Completely at Random)
๐ What it means:
Missing values happened by accident, not because of the person or their income.
๐ Example:
- Your survey paper got wet in the rain ๐ง๏ธ, and some “income” fields got erased.
- It doesnโt depend on the person’s age, gender, or income.
๐ Key Point:
Missing data is random and clean โ you can easily handle it by dropping or filling.
๐ง Think: “It just got lost randomly. No bias. Lucky!”
โ 2. MAR (Missing At Random)
๐ What it means:
Missing depends on some other variable that you do know.
๐ Example:
- Younger people (age < 25) donโt like to share their income(may be they have no job yet.
- But older people fill it.
- So income is missing, but it depends on age (which you have).
๐ Key Point:
Missing is not random, but you have clues (like age) to handle it.
๐ง Think: “It depends on another question that was filled correctly.”
๐ Let’s Revisit the Example
Letโs say we have this simple data:
Name | Age | Income |
---|---|---|
Ali | 22 | NaN |
Sara | 23 | NaN |
Bilal | 35 | 80,000 |
Ayesha | 40 | 90,000 |
Umar | 38 | 85,000 |
๐ญ Case 1: If the missing incomes (NaNs) are for random ages:
โ like age 22 and age 38 both missing
โ and there’s no pattern
๐ Then itโs MCAR
(Maybe erased by water, or system glitch โ no pattern)
๐ญ Case 2: If the missing incomes are only in younger people:
โ age < 25 โ missing
โ age > 25 โ present
๐ Then itโs MAR
(Because missingness depends on another column: age)
โ 3. MNAR (Missing Not At Random)
๐ What it means:
Missing depends on the same value that is missing.
๐ Example:
- People who earn too much donโt want to show their income.
- They leave it blank to hide it.
- So income is missing BECAUSE of income itself.
๐ Key Point:
This is dangerous โ the missingness is inside the data you donโt see.
๐ง Think: “I can’t see the value, and it’s missing because of itself โ thatโs tricky!”
๐ง Summary in One Line Each:
Type | Real Meaning | Real Example |
---|---|---|
MCAR | Missing randomly | Paper got wet, some data erased |
MAR | Missing depends on other answers | Young people skipped income |
MNAR | Missing because of its own value | Rich people hide their income |
โ 4. Why Itโs Important to Detect the Type of Missingness?
Detecting the type of missing data is critical because:
Type | Handling Strategy | Danger of Ignoring |
---|---|---|
MCAR | Safe to drop or impute | Low |
MAR | Needs imputation using other variables | Moderate |
MNAR | Requires domain logic, hard to fix | High Risk |
If you donโt detect the type:
- You may choose the wrong technique
- Your model may become biased
- Your accuracy or fairness may drop
๐ For example:
- Treating MNAR as MAR โ adds bias
- Assuming MCAR blindly โ may destroy subtle patterns
๐งช 5. How to Detect the Type of Missingness?
Letโs now play detective ๐ต๏ธ โ hereโs how we investigate:



โ MNAR is Hard to Detect
Because it depends on the value you donโt see, you need:
- Domain knowledge
- Business logic
๐ง Tip: If nothing explains the missingness and the column is sensitive (like income or health), assume MNAR.
๐ง Final Thoughts
Type | Short Meaning | Detectable? | How to Handle |
---|---|---|---|
MCAR | Random | โ Easy | Drop or mean impute |
MAR | Related to other columns | โ Use logic or regression | KNN, MICE, etc. |
MNAR | Missing because of itself | โ Hard | Domain knowledge, flag, careful modeling |
๐ Whatโs Next?
In the next blog, weโll go hands-on with how to handle missing values using Python:
- ๐งฎ SimpleImputer (mean, median, mode)
- ๐ MissingIndicator
- ๐ง KNN Imputer
- ๐ MICE (Multivariate Imputation by Chained Equations)
Stay tuned โ Iโll not only explain how to use them, but also when to use which one like a real Data Scientist.