---
filetype: mystnb
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.16.3
kernelspec:
name: python3
display_name: Python 3
---
# Downstream fidelity
The downstream-fidelity module compares real and synthetic datasets by fitting equivalent predictive models and inspecting how coefficients, directions of effect, and uncertainty translate between them. It automatically derives a modeling formula from metadata, performs multiple imputation for missing data, and reports side-by-side parameter estimates to highlight agreement or drift. The example below uses the curated SemMap metadata (which marks `num` as the target).
```{code-cell} python
# Visualize correlations to motivate downstream modeling choices.
import pandas as pd
import matplotlib.pyplot as plt
heart = pd.read_csv("../downloads-cache/uciml/45.csv.gz")
corr = heart.corr(numeric_only=True)
fig, ax = plt.subplots(figsize=(6, 5))
im = ax.imshow(corr, cmap="RdBu_r", vmin=-1, vmax=1)
ax.set_xticks(range(len(corr.columns)))
ax.set_xticklabels(corr.columns, rotation=90)
ax.set_yticks(range(len(corr.index)))
ax.set_yticklabels(corr.index)
ax.set_title("Numeric correlation (Heart Disease)")
fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
plt.tight_layout()
```
## Formula discovery
- `auto_formula` builds a Patsy formula by inferring target roles from the dataset schema, coercing dtypes (including categorical levels from codebooks), and generating main effects plus interaction candidates. Cross-validated feature screening enforces strong heredity, keeping parents of any retained interactions to stabilize the model.
```{code-cell} python
# Create a target-aware formula straight from the SemMap metadata.
import json
import sys
from pathlib import Path
from importlib import reload
sys.path.insert(0, str(Path("..").resolve()))
import semsynth.downstream_fidelity as dfid
dfid = reload(dfid)
meta = json.load(open("../mappings/uciml-45.metadata.json"))
formula = dfid.auto_formula(heart, meta, dfid.DownstreamConfig())
formula
```
The curated metadata already declares the `num` diagnosis column as the target, so `auto_formula` discovers the correct prediction task without manual overrides.
## Multiple imputation and estimation
- `fit_with_mi` recodes categorical variables, replaces missing codes, and runs MICE (`statsmodels.imputation.mice`) to produce pooled estimates for generalized linear models appropriate to the target type (binomial, Poisson, or OLS).
## Comparative reporting
- `compare_real_vs_synth` sanitizes the inputs, fits the auto-discovered model with multiple imputation on real and synthetic data, and returns a dataframe of paired coefficients and standard errors with a sign-match indicator. When model fitting fails, it falls back to a simpler logistic comparison but still surfaces the skipped reason to the caller.
## References
- Patsy formula language:
- Rubin's rules for multiple imputation:
- statsmodels MICE implementation: