Missing Values in Machine Learning: A Beginner’s Guide (With Real-Life Examples)

๐Ÿง‘โ€๐Ÿซ 1. What Are Missing Values?

Imagine youโ€™re conducting a survey in your city to collect information like:

  • Name
  • Age
  • Gender
  • Monthly Income

Now, suppose some people didnโ€™t fill out the โ€œIncomeโ€ section, or the data entry system missed it.
That blank space is called a missing value.

In simple words:

Missing values are empty spaces in your data where information is supposed to be, but it’s not available.

๐Ÿ“Š In a dataset, it usually looks like this:

Name Age Income
Ali 25 NaN (missing)
Sara 30 80,000
Bilal 22 NaN (missing)

๐Ÿค” 2. Why Should We Handle Missing Values in Data Science & ML?

Great question! Let’s see why we care about these empty spaces:

๐Ÿšซ 1. Machine Learning Hates Missing Data

ML algorithms like Logistic Regression, SVM, and Random Forest will crash if they see NaN.
They say:

โ€œI need clean input โ€” not blanks!โ€

โš ๏ธ 2. Ignoring Missing Values Can Mislead the Model

If we donโ€™t handle missing values properly:

  • The model can learn wrong patterns
  • Your predictions may be biased or inaccurate

๐Ÿ“‰ 3. You Lose Valuable Data

If 25% of your rows have missing values and you just delete them all, you’re wasting good information.

๐Ÿ’ก 4. Sometimes โ€œMissingโ€ Means Something Important

For example, if someone refuses to share their income, that might mean theyโ€™re very rich or hiding something.
You can create a feature like:
income_missing = 1 to help the model.

๐Ÿ” 3. Types of Missing Values โ€” With Real-Life Examples

There are 3 types of missing data, based on why the value is missing.

Letโ€™s understand them with an easy-to-remember story:

โœ… 1. MCAR (Missing Completely at Random)

๐Ÿ” What it means:

Missing values happened by accident, not because of the person or their income.

๐Ÿ“˜ Example:

  • Your survey paper got wet in the rain ๐ŸŒง๏ธ, and some “income” fields got erased.
  • It doesnโ€™t depend on the person’s age, gender, or income.

๐Ÿ“Œ Key Point:

Missing data is random and clean โ€” you can easily handle it by dropping or filling.

๐Ÿง  Think: “It just got lost randomly. No bias. Lucky!”

โœ… 2. MAR (Missing At Random)

๐Ÿ” What it means:

Missing depends on some other variable that you do know.

๐Ÿ“˜ Example:

  • Younger people (age < 25) donโ€™t like to share their income(may be they have no job yet.
  • But older people fill it.
  • So income is missing, but it depends on age (which you have).

๐Ÿ“Œ Key Point:

Missing is not random, but you have clues (like age) to handle it.

๐Ÿง  Think: “It depends on another question that was filled correctly.”

๐Ÿ” Let’s Revisit the Example

Letโ€™s say we have this simple data:

Name Age Income
Ali 22 NaN
Sara 23 NaN
Bilal 35 80,000
Ayesha 40 90,000
Umar 38 85,000

๐Ÿ’ญ Case 1: If the missing incomes (NaNs) are for random ages:

โ†’ like age 22 and age 38 both missing
โ†’ and there’s no pattern

๐Ÿ‘‰ Then itโ€™s MCAR
(Maybe erased by water, or system glitch โ€” no pattern)

๐Ÿ’ญ Case 2: If the missing incomes are only in younger people:

โ†’ age < 25 โ†’ missing
โ†’ age > 25 โ†’ present

๐Ÿ‘‰ Then itโ€™s MAR
(Because missingness depends on another column: age)

โœ… 3. MNAR (Missing Not At Random)

๐Ÿ” What it means:

Missing depends on the same value that is missing.

๐Ÿ“˜ Example:

  • People who earn too much donโ€™t want to show their income.
  • They leave it blank to hide it.
  • So income is missing BECAUSE of income itself.

๐Ÿ“Œ Key Point:

This is dangerous โ€” the missingness is inside the data you donโ€™t see.

๐Ÿง  Think: “I can’t see the value, and it’s missing because of itself โ€” thatโ€™s tricky!”

๐Ÿง  Summary in One Line Each:

Type Real Meaning Real Example
MCAR Missing randomly Paper got wet, some data erased
MAR Missing depends on other answers Young people skipped income
MNAR Missing because of its own value Rich people hide their income

โ“ 4. Why Itโ€™s Important to Detect the Type of Missingness?

Detecting the type of missing data is critical because:

Type Handling Strategy Danger of Ignoring
MCAR Safe to drop or impute Low
MAR Needs imputation using other variables Moderate
MNAR Requires domain logic, hard to fix High Risk

If you donโ€™t detect the type:

  • You may choose the wrong technique
  • Your model may become biased
  • Your accuracy or fairness may drop

๐Ÿ” For example:

  • Treating MNAR as MAR โ†’ adds bias
  • Assuming MCAR blindly โ†’ may destroy subtle patterns

๐Ÿงช 5. How to Detect the Type of Missingness?

Letโ€™s now play detective ๐Ÿ•ต๏ธ โ€” hereโ€™s how we investigate:

โ— MNAR is Hard to Detect

Because it depends on the value you donโ€™t see, you need:

  • Domain knowledge
  • Business logic

๐Ÿง  Tip: If nothing explains the missingness and the column is sensitive (like income or health), assume MNAR.

๐Ÿง  Final Thoughts

Type Short Meaning Detectable? How to Handle
MCAR Random โœ… Easy Drop or mean impute
MAR Related to other columns โœ… Use logic or regression KNN, MICE, etc.
MNAR Missing because of itself โŒ Hard Domain knowledge, flag, careful modeling

๐Ÿ”œ Whatโ€™s Next?

In the next blog, weโ€™ll go hands-on with how to handle missing values using Python:

  • ๐Ÿงฎ SimpleImputer (mean, median, mode)
  • ๐Ÿ” MissingIndicator
  • ๐Ÿง  KNN Imputer
  • ๐Ÿ” MICE (Multivariate Imputation by Chained Equations)

Stay tuned โ€” Iโ€™ll not only explain how to use them, but also when to use which one like a real Data Scientist.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top