Missing Values in Machine Learning: A Beginner’s Guide (With Real-Life Examples)

🧑🏫 1. What Are Missing Values? Imagine you’re conducting a survey in your city to collect information like: Now, suppose some people didn’t fill out the “Income” section, or the data entry system missed it.That blank space is called a missing value. In simple words: Missing values are empty spaces in your data where information is supposed to be, but it’s not available. 📊 In a dataset, it usually looks like this: Name Age Income Ali 25 NaN (missing) Sara 30 80,000 Bilal 22 NaN (missing) 🤔 2. Why Should We Handle Missing Values in Data Science & ML? Great question! Let’s see why we care about these empty spaces: 🚫 1. Machine Learning Hates Missing Data ML algorithms like Logistic Regression, SVM, and Random Forest will crash if they see NaN.They say: “I need clean input — not blanks!” ⚠️ 2. Ignoring Missing Values Can Mislead the Model If we don’t handle missing values properly: 📉 3. You Lose Valuable Data If 25% of your rows have missing values and you just delete them all, you’re wasting good information. 💡 4. Sometimes “Missing” Means Something Important For example, if someone refuses to share their income, that might mean they’re very rich or hiding something.You can create a feature like:income_missing = 1 to help the model. 🔍 3. Types of Missing Values — With Real-Life Examples There are 3 types of missing data, based on why the value is missing. Let’s understand them with an easy-to-remember story: ✅ 1. MCAR (Missing Completely at Random) 🔍 What it means: Missing values happened by accident, not because of the person or their income. 📘 Example: 📌 Key Point: Missing data is random and clean — you can easily handle it by dropping or filling. Name Age Income Ali 22 NaN Sara 23 NaN Bilal 35 80,000 Ayesha 40 90,000 Umar 38 85,000 💭 Case 1: If the missing incomes (NaNs) are for random ages: → like age 22 and age 38 both missing→ and there’s no pattern 👉 Then it’s MCAR(Maybe erased by water, or system glitch — no pattern) 💭 Case 2: If the missing incomes are only in younger people: → age < 25 → missing→ age > 25 → present 👉 Then it’s MAR(Because missingness depends on another column: age) ✅ 3. MNAR (Missing Not At Random) 🔍 What it means: Missing depends on the same value that is missing. 📘 Example: 📌 Key Point: This is dangerous — the missingness is inside the data you don’t see. 🧠 Think: “I can’t see the value, and it’s missing because of itself — that’s tricky!” 🧠 Summary in One Line Each: Type Real Meaning Real Example MCAR Missing randomly Paper got wet, some data erased MAR Missing depends on other answers Young people skipped income MNAR Missing because of its own value Rich people hide their income ❓ 4. Why It’s Important to Detect the Type of Missingness? Detecting the type of missing data is critical because: Type Handling Strategy Danger of Ignoring MCAR Safe to drop or impute Low MAR Needs imputation using other variables Moderate MNAR Requires domain logic, hard to fix High Risk If you don’t detect the type: 🔍 For example: 🧪 5. How to Detect the Type of Missingness? Let’s now play detective 🕵️ — here’s how we investigate: ❗ MNAR is Hard to Detect Because it depends on the value you don’t see, you need: 🧠 Tip: If nothing explains the missingness and the column is sensitive (like income or health), assume MNAR. 🧠 Final Thoughts Type Short Meaning Detectable? How to Handle MCAR Random ✅ Easy Drop or mean impute MAR Related to other columns ✅ Use logic or regression KNN, MICE, etc. MNAR Missing because of itself ❌ Hard Domain knowledge, flag, careful modeling 🔜 What’s Next? In the next blog, we’ll go hands-on with how to handle missing values using Python: Stay tuned — I’ll not only explain how to use them, but also when to use which one like a real Data Scientist.