semsynth.missingness

Missingness modeling utilities for backend generators.

Functions

apply_missingness_to_outputs(*, run_dir, ...)

Apply missingness to backend artefacts and refresh derived metrics.

dataclass([cls, init, repr, eq, order, ...])

Add dunder methods based on the fields defined in the class.

field(*[, default, default_factory, init, ...])

Return an object to identify dataclass fields.

fit_missingness_model(df, *[, random_state])

Fit a dataframe-level missingness model with logging safeguards.

per_variable_distances(real_df, synth_df, ...)

summarize_distance_metrics(distances)

Compute aggregate statistics for per-variable distance metrics.

summarize_missingness_model(missingness_model)

Build a reporting-friendly summary of the missingness model.

Classes

Any(*args, **kwargs)

Special type indicating an unconstrained type.

ColumnMissingnessModel(col[, p_missing_, ...])

Estimate conditional missingness for a single column.

ColumnTransformer(transformers, *[, ...])

Applies transformers to columns of an array or pandas DataFrame.

DataFrameMissingnessModel(random_state, ...)

Learn and apply missingness patterns across dataframe columns.

LogisticRegression([penalty, C, l1_ratio, ...])

Logistic Regression (aka logit, MaxEnt) classifier.

MissingnessWrappedGenerator(base_generator, ...)

Wrap a base generator to inject realistic missing values.

OneHotEncoder(*[, categories, drop, ...])

Encode categorical features as a one-hot numeric array.

Path(*args, **kwargs)

PurePath subclass that can make system calls.

Pipeline(steps, *[, transform_input, ...])

A sequence of data transformers with an optional final predictor.

SimpleImputer(*[, missing_values, strategy, ...])

Univariate imputer for completing missing values with simple strategies.

make_column_selector([pattern, ...])

Create a callable to select columns to be used with ColumnTransformer.

class semsynth.missingness.ColumnMissingnessModel(col: str, p_missing_: float = 0.0, pipeline_: Pipeline | None = None)

Bases: object

Estimate conditional missingness for a single column.

col: str
fit(df: DataFrame) ColumnMissingnessModel

Fit the column-level missingness model.

Parameters:

df – Real dataframe that may contain missing values.

Returns:

Self after fitting conditional probability estimators.

p_missing_: float = 0.0
pipeline_: Pipeline | None = None
sample_mask(df: DataFrame, rng: Generator) Series

Sample a boolean mask indicating where the column should be missing.

Parameters:
  • df – Synthetic dataframe prior to applying missingness.

  • rng – Random number generator used for reproducibility.

Returns:

Boolean series indexed like df with True for missing values.

class semsynth.missingness.DataFrameMissingnessModel(random_state: int | None = None, models_: Dict[str, ~semsynth.missingness.ColumnMissingnessModel]=<factory>)

Bases: object

Learn and apply missingness patterns across dataframe columns.

apply(df: DataFrame) DataFrame

Apply learned missingness patterns to a synthetic dataframe.

Parameters:

df – Synthetic dataframe before introducing missing values.

Returns:

Copy of df with missingness injected per fitted distributions.

fit(df: DataFrame) DataFrameMissingnessModel

Fit per-column missingness models on the provided dataframe.

Parameters:

df – Real dataframe used to learn missingness structure.

Returns:

Self with fitted column models.

models_: Dict[str, ColumnMissingnessModel]
random_state: int | None = None
class semsynth.missingness.MissingnessWrappedGenerator(base_generator: Callable[[...], DataFrame], missingness_model: DataFrameMissingnessModel)

Bases: object

Wrap a base generator to inject realistic missing values.

classmethod from_real_data(base_generator: Callable[[...], DataFrame], real_df: DataFrame, random_state: int | None = None) MissingnessWrappedGenerator

Create a wrapper by fitting missingness to real data.

Parameters:
  • base_generator – Callable producing synthetic samples.

  • real_df – Real dataframe used to estimate missingness.

  • random_state – Optional RNG seed for reproducibility.

Returns:

Configured MissingnessWrappedGenerator instance.

sample(n: int, **kwargs) DataFrame

Generate n samples with realistic missing values applied.

semsynth.missingness.apply_missingness_to_outputs(*, run_dir: Path, synth_df: DataFrame, missingness_model: DataFrameMissingnessModel, real_df: DataFrame, disc_cols: Iterable[str], cont_cols: Iterable[str], backend_name: str) Tuple[DataFrame, bool]

Apply missingness to backend artefacts and refresh derived metrics.

Parameters:
  • run_dir – Directory containing backend outputs.

  • synth_df – Synthetic dataframe prior to missingness injection.

  • missingness_model – Learned missingness model to apply.

  • real_df – Real dataframe without missing values for metric refresh.

  • disc_cols – Iterable of discrete column names.

  • cont_cols – Iterable of continuous column names.

  • backend_name – Name of the backend used for logging context.

Returns:

Tuple of the updated synthetic dataframe and a boolean indicating whether missingness was successfully applied.

semsynth.missingness.fit_missingness_model(df: DataFrame, *, random_state: int | None = None) DataFrameMissingnessModel | None

Fit a dataframe-level missingness model with logging safeguards.

Parameters:
  • df – Dataframe used to estimate missingness behaviour.

  • random_state – Optional seed for reproducibility.

Returns:

Fitted DataFrameMissingnessModel or None if fitting failed.

semsynth.missingness.summarize_missingness_model(missingness_model: DataFrameMissingnessModel | None) Dict[str, Any] | None

Build a reporting-friendly summary of the missingness model.

Parameters:

missingness_model – Optional fitted missingness model from preprocessing.

Returns:

Mapping describing fitted column-level missingness rates or None if no model was provided.