semsynth.missingness
Missingness modeling utilities for backend generators.
Functions
|
Apply missingness to backend artefacts and refresh derived metrics. |
|
Add dunder methods based on the fields defined in the class. |
|
Return an object to identify dataclass fields. |
|
Fit a dataframe-level missingness model with logging safeguards. |
|
|
|
Compute aggregate statistics for per-variable distance metrics. |
|
Build a reporting-friendly summary of the missingness model. |
Classes
|
Special type indicating an unconstrained type. |
|
Estimate conditional missingness for a single column. |
|
Applies transformers to columns of an array or pandas DataFrame. |
|
Learn and apply missingness patterns across dataframe columns. |
|
Logistic Regression (aka logit, MaxEnt) classifier. |
|
Wrap a base generator to inject realistic missing values. |
|
Encode categorical features as a one-hot numeric array. |
|
PurePath subclass that can make system calls. |
|
A sequence of data transformers with an optional final predictor. |
|
Univariate imputer for completing missing values with simple strategies. |
|
Create a callable to select columns to be used with |
- class semsynth.missingness.ColumnMissingnessModel(col: str, p_missing_: float = 0.0, pipeline_: Pipeline | None = None)
Bases:
objectEstimate conditional missingness for a single column.
- col: str
- fit(df: DataFrame) ColumnMissingnessModel
Fit the column-level missingness model.
- Parameters:
df – Real dataframe that may contain missing values.
- Returns:
Self after fitting conditional probability estimators.
- p_missing_: float = 0.0
- pipeline_: Pipeline | None = None
- sample_mask(df: DataFrame, rng: Generator) Series
Sample a boolean mask indicating where the column should be missing.
- Parameters:
df – Synthetic dataframe prior to applying missingness.
rng – Random number generator used for reproducibility.
- Returns:
Boolean series indexed like
dfwithTruefor missing values.
- class semsynth.missingness.DataFrameMissingnessModel(random_state: int | None = None, models_: Dict[str, ~semsynth.missingness.ColumnMissingnessModel]=<factory>)
Bases:
objectLearn and apply missingness patterns across dataframe columns.
- apply(df: DataFrame) DataFrame
Apply learned missingness patterns to a synthetic dataframe.
- Parameters:
df – Synthetic dataframe before introducing missing values.
- Returns:
Copy of
dfwith missingness injected per fitted distributions.
- fit(df: DataFrame) DataFrameMissingnessModel
Fit per-column missingness models on the provided dataframe.
- Parameters:
df – Real dataframe used to learn missingness structure.
- Returns:
Self with fitted column models.
- models_: Dict[str, ColumnMissingnessModel]
- random_state: int | None = None
- class semsynth.missingness.MissingnessWrappedGenerator(base_generator: Callable[[...], DataFrame], missingness_model: DataFrameMissingnessModel)
Bases:
objectWrap a base generator to inject realistic missing values.
- classmethod from_real_data(base_generator: Callable[[...], DataFrame], real_df: DataFrame, random_state: int | None = None) MissingnessWrappedGenerator
Create a wrapper by fitting missingness to real data.
- Parameters:
base_generator – Callable producing synthetic samples.
real_df – Real dataframe used to estimate missingness.
random_state – Optional RNG seed for reproducibility.
- Returns:
Configured
MissingnessWrappedGeneratorinstance.
- sample(n: int, **kwargs) DataFrame
Generate
nsamples with realistic missing values applied.
- semsynth.missingness.apply_missingness_to_outputs(*, run_dir: Path, synth_df: DataFrame, missingness_model: DataFrameMissingnessModel, real_df: DataFrame, disc_cols: Iterable[str], cont_cols: Iterable[str], backend_name: str) Tuple[DataFrame, bool]
Apply missingness to backend artefacts and refresh derived metrics.
- Parameters:
run_dir – Directory containing backend outputs.
synth_df – Synthetic dataframe prior to missingness injection.
missingness_model – Learned missingness model to apply.
real_df – Real dataframe without missing values for metric refresh.
disc_cols – Iterable of discrete column names.
cont_cols – Iterable of continuous column names.
backend_name – Name of the backend used for logging context.
- Returns:
Tuple of the updated synthetic dataframe and a boolean indicating whether missingness was successfully applied.
- semsynth.missingness.fit_missingness_model(df: DataFrame, *, random_state: int | None = None) DataFrameMissingnessModel | None
Fit a dataframe-level missingness model with logging safeguards.
- Parameters:
df – Dataframe used to estimate missingness behaviour.
random_state – Optional seed for reproducibility.
- Returns:
Fitted
DataFrameMissingnessModelorNoneif fitting failed.
- semsynth.missingness.summarize_missingness_model(missingness_model: DataFrameMissingnessModel | None) Dict[str, Any] | None
Build a reporting-friendly summary of the missingness model.
- Parameters:
missingness_model – Optional fitted missingness model from preprocessing.
- Returns:
Mapping describing fitted column-level missingness rates or
Noneif no model was provided.