Missingness handling

SemSynth provides utilities for learning realistic missingness patterns from real datasets and applying them to synthetic samples. The design pairs per-column logistic models with a dataframe-level wrapper so generators can emit data that mirrors conditional dropout rates observed in the source data. The chart below highlights which Heart Disease variables drop out most often before we fit the wrapper.

# Plot observed missingness to see which columns drive the model.
import pandas as pd
import matplotlib.pyplot as plt

heart = pd.read_csv("../downloads-cache/uciml/45.csv.gz")
na_rates = heart.isna().mean().sort_values(ascending=False)
ax = na_rates.plot.bar(figsize=(8, 3), color="steelblue")
ax.set_ylabel("missing rate")
ax.set_title("Heart Disease missingness by column")
plt.tight_layout()

_images/7aa84fe60fcc87d44fda21e52bfabce16a6a318347dbb61755b1848e11e87090.png

Column-level modeling

ColumnMissingnessModel estimates the probability that a specific column is missing by fitting a logistic regression on one-hot encoded features of all other columns. It tracks the marginal missingness rate and skips modeling when a column is always present or always absent, ensuring stable behavior on edge cases.
The model samples boolean masks by scaling predicted probabilities back to the observed missingness rate when needed, preventing over- or under-estimation of missing cells during synthesis.

Dataframe-level application

DataFrameMissingnessModel fits a ColumnMissingnessModel for each column in the real dataset and stores them in a mapping. When applied to a synthetic dataframe, it iterates over the learned masks to inject NaN values column by column, respecting the fitted conditional patterns.
MissingnessWrappedGenerator wraps any base generator callable and first fits missingness on the real data. Subsequent calls to .sample(n) produce synthetic rows with the learned missingness applied, enabling drop-in realism without modifying the underlying generator implementation.

# Fit and apply the wrapper to a small sample.
from semsynth.missingness import DataFrameMissingnessModel

model = DataFrameMissingnessModel(random_state=7).fit(heart)
sampled = model.apply(heart.head(20).copy())
sampled.isna().mean()

/Users/benno/Documents/ud/projects/generative-metadata/semsynth/.venv/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:406: ConvergenceWarning: lbfgs failed to converge after 200 iteration(s) (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=200).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Users/benno/Documents/ud/projects/generative-metadata/semsynth/.venv/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:406: ConvergenceWarning: lbfgs failed to converge after 200 iteration(s) (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=200).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

age         0.00
sex         0.00
cp          0.00
trestbps    0.00
chol        0.00
fbs         0.00
restecg     0.00
thalach     0.00
exang       0.00
oldpeak     0.00
slope       0.00
ca          0.00
thal        0.05
num         0.00
dtype: float64

References

Logistic regression missingness modeling with scikit-learn: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
One-hot encoding for categorical features: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html