A diagnostic & treatment suite for missing values in tabular machine learning datasets.
Missing Data Doctor helps you:
- Quantify how much data is missing and where
- Visualize missingness patterns across features and rows
- Impute missing values using multiple strategies
- Evaluate how each imputation choice affects model performance
- Report everything in a portable, self-contained HTML report
It is designed as a practical data-science tool you can drop into real workflows or showcase as a professional project on GitHub.
missing-data-doctor/
├── src/
│ ├── cli.py # Main CLI entrypoint
│ ├── loaders.py # CSV loading & schema helpers
│ ├── profiling.py # Missingness summary and stats
│ ├── imputers.py # Imputation strategies (simple, KNN, iterative)
│ ├── impact.py # Downstream model impact analysis
│ ├── viz.py # Plotting utilities for missing data
│ └── report.py # Jinja2 HTML report generation
│
├── templates/
│ └── report.html # HTML report template (embeds plots & tables)
│
├── data/
│ └── example_with_missing.csv # Example dataset with missing values
│
├── outputs/
│ └── runs/
│ └── demo/ # Example run (created after you run the demo)
│ ├── missing_data_doctor.html
│ └── plots/
│ ├── missing_bar.png
│ └── missing_heatmap.png
│
├── reports/ # Optional alternative report location (if you use --report)
└── README.md
Flat layout: all Python modules live directly under
src/(no package folders, nomdd).
cd C:\Users\Amir\Desktop\missing-data-doctor
python -m venv .venv
.\.venv\Scripts\activatepip install -r requirements.txtIf you don’t have requirements.txt, install manually:
pip install pandas numpy scikit-learn matplotlib seaborn jinja2This command:
- Loads
data/example_with_missing.csv - Profiles missingness
- Generates plots
- Runs 3 imputation strategies
- Evaluates a model for each
- Writes a self-contained run folder with plots + JSON + HTML
python src\cli.py ^
--data data\example_with_missing.csv ^
--target target ^
--task classification ^
--out_dir outputs\runs\demoYou’ll get:
outputs/runs/demo/
├── missing_summary.csv
├── impact.json # model metrics per imputation strategy (if target provided)
├── summary.json # combined summary (missingness + impact)
├── plots/
│ ├── missing_bar.png
│ └── missing_heatmap.png
└── missing_data_doctor.html
start "" outputs\runs\demo\missing_data_doctor.htmldata/example_with_missing.csv is a small synthetic dataset:
age | income | visits | score | target
25 | 30000 | 5 | 620 | 0
40 | | 10 | 680 | 1
35 | 45000 | | 640 | 0
| 70000 | 12 | 720 | 1
28 | 34000 | 6 | | 0
46 | 66000 | 11 | 700 | 1
31 | | 7 | 630 | 0
54 | 75000 | 13 | 730 | 1
29 | 35000 | | 615 | 0
43 | 59000 | 9 | 690 | 1
Key properties:
-
10 rows with 5 columns:
age,income,visits,score,target -
Missing values:
income: 2 missing → 20%visits: 2 missing → 20%age: 1 missing → 10%score: 1 missing → 10%target: no missing
-
targetis a binary label:0/1(classification problem)
This toy dataset is intentionally small so you can easily interpret the plots and metrics created by Missing Data Doctor.
After running the demo, the key figures live here:
outputs/runs/demo/plots/
├── missing_bar.png
└── missing_heatmap.png
This bar chart shows, for each column, the proportion of missing entries.
In the demo dataset, you should see:
incomeandvisitswith the highest bars (~20% missing)ageandscorewith shorter bars (~10% missing)targetat 0% missing
-
High-missing features (
income,visits)- These may require more careful imputation (KNN or iterative)
- If they are important predictors, poor imputation can heavily hurt model performance
- In extreme real-world cases (>50% missing), you might even consider dropping the feature
-
Moderate-missing features (
age,score)- Simple imputation (median/mean) may be adequate
- But you should check whether missingness is random or systematic (young users not reporting income)
-
0% missing label (
target)- This is ideal: you don’t want missing labels in supervised learning
- If the label had missing values, you’d have to exclude those rows or treat it as a semi-supervised problem
This plot is your first triage step: it answers
“Where is my dataset bleeding the most?”
This heatmap displays a row × column matrix of missing values:
- Each row = one sample (up to a capped number of rows for large datasets)
- Each column = one feature
- Colored cell = value is missing
- Blank cell = value is present
In the demo dataset, you should notice:
- For some rows, only
incomeis missing - For some rows, only
visitsis missing - For one row,
ageis missing but other features are present - For one row,
scoreis missing while others are filled - There is no obvious block pattern (like full rows of missing or a whole group of columns consistently missing together)
This plot provides intuition about missingness mechanisms:
-
MCAR (Missing Completely At Random)
- Missingness appears scattered with no obvious pattern → the demo dataset roughly looks like this
- In such cases, simple imputation strategies are usually less risky
-
MAR (Missing At Random)
- You might see patterns where missing values in one column align with values in another (low income → more missing
visits) - This is a signal to investigate feature interactions before imputing
- You might see patterns where missing values in one column align with values in another (low income → more missing
-
MNAR (Missing Not At Random)
- If missingness in a variable is strongly related to its own (unobserved) values
- Harder to see visually; you’d need more careful statistical tests and domain knowledge
In practice, this matrix helps answer:
“Do I have a random sprinkle of missing values, or is something structured (and dangerous) going on?”
Beyond visualization, Missing Data Doctor evaluates how different imputations affect model performance.
Currently, the CLI runs three strategies:
"simple"→SimpleImputer(median / most frequent)"knn"→KNNImputer(nearest neighbors on numeric features)"iterative"→IterativeImputer(MICE-like multi-feature imputation)
Given a target and task (here: target, classification), the tool:
- Imputes missing values using each strategy
- Trains a
RandomForestClassifierfor each imputed dataset - Evaluates metrics on a held-out test set
- Stores the results in:
outputs/runs/demo/impact.json
Example structure (schema, not actual values):
{
"simple": {
"AUC": 0.85,
"Accuracy": 0.80
},
"knn": {
"AUC": 0.87,
"Accuracy": 0.82
},
"iterative": {
"AUC": 0.86,
"Accuracy": 0.81
}
}The goal is not just “fill NA values”, but quantify which imputation actually leads to a better model.
The report template at:
templates/report.html
is rendered with context including:
missing_summary: list of columns with missing counts & percentagesmissing_bar_path: relative path to the bar chart,plots/missing_bar.pngmissing_heatmap_path: relative path to the heatmap,plots/missing_heatmap.pngimpact: metrics per imputation method (if a target is provided)
Inside the template, the figures are embedded like:
<h3>Missingness per Feature</h3>
<img src="{{ missing_bar_path }}">
<h3>Missingness Matrix (Sampled Rows)</h3>
<img src="{{ missing_heatmap_path }}">Because the report is saved inside the same directory as plots/ (Option A), the relative paths:
plots/missing_bar.png
plots/missing_heatmap.png
resolve correctly.
This makes every outputs/runs/<run_name>/ folder a self-contained artifact:
- You can zip it
- Send it to someone
- They can open the HTML and see plots without editing anything
Core CLI:
python src\cli.py --data <path> --target <column> --task <classification|regression> --out_dir <run_folder>Optional HTML report name (if you want a custom path instead of the default missing_data_doctor.html):
python src\cli.py ^
--data data\my_data.csv ^
--target label ^
--task classification ^
--out_dir outputs\runs\experiment_01 ^
--report outputs\runs\experiment_01\experiment_01_report.html- Make sure you did not put the report into a different folder than
out_dir. - With Option A (recommended), the report is inside
out_dirand the images live inout_dir/plots/. - Paths should be:
<img src="plots/missing_bar.png">
<img src="plots/missing_heatmap.png">Install dependencies:
pip install pandas numpy scikit-learn matplotlib seaborn jinja2If .venv is broken, delete it and recreate:
rmdir /S /Q .venv
python -m venv .venv
.\.venv\Scripts\activate
pip install -r requirements.txtYou can extend Missing Data Doctor with:
-
More imputers:
- Median vs mean comparison
- Domain-aware imputers (0 for missing count-based features)
-
Missingness, feature interaction analysis:
- Correlation between “is_missing(feature)” and numeric features
- Logistic models predicting missingness as a function of other columns
-
Fairness & subgroup analysis:
- Does missingness disproportionately affect certain subgroups?
-
Time-aware gap analysis (for time series):
- Length and location of consecutive missing segments