--- filetype: mystnb jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.16.3 kernelspec: name: python3 display_name: Python 3 --- # Backends overview SemSynth ships three interchangeable generation backends—MetaSyn, PyBNesian, and SynthCity—that follow the common `run_experiment` contract and produce aligned artifacts (synthetic CSVs, per-variable metrics, and manifests). The backends share the same type inference and train/test splitting utilities so synthetic rows and downstream metrics remain comparable across runs. If a backend's optional dependency is missing, the implementation raises a clear runtime error suggesting the corresponding extras install (for example, `pip install semsynth[metasyn]`). We keep all backend examples grounded in the cached Heart Disease dataset to keep comparisons consistent. ```{code-cell} python # Quick look at the target dataset used below. import pandas as pd import matplotlib.pyplot as plt heart = pd.read_csv("../downloads-cache/uciml/45.csv.gz") ax = heart["age"].plot.hist(bins=20, figsize=(6, 3), color="salmon", edgecolor="black") ax.set_title("Heart Disease (UCI 45) age distribution") ax.set_xlabel("age") plt.tight_layout() ``` ## MetaSyn - Uses `MetaFrame.fit_dataframe` to learn column-wise distributions after inferring discrete and continuous fields, then synthesizes a user-specified number of rows. The backend coerces continuous features to floats, preserves categorical categories, and writes a `synthetic.csv` aligned to the original schema. - Persists evaluation artifacts alongside the dataset, including per-variable distance metrics and summary statistics derived from the held-out test split. A manifest records the backend name, dataset identifiers, seed, requested rows, and split ratio to keep runs auditable. ## PyBNesian - Learns a Bayesian network with hill climbing and configurable network types (`clg` or `semiparametric`) and scoring/structure-search options. Sensitive roots (age, sex, race by default) are blacklisted from being child nodes to avoid trivial leakage pathways. - Samples synthetic rows from the fitted network, exports optional SemMap parquet, and computes per-variable distances plus a held-out log-likelihood statistic. The backend serializes GraphViz and GraphML structures (when optional dependencies are present) and stores a manifest capturing structure parameters and dataset metadata. ## SynthCity - Normalizes generator aliases (e.g., `ctgan`, `pategan`, `bayesiannetwork`) to the canonical SynthCity plugin names, then loads the chosen plugin with sanitized parameters. Continuous features are coerced to numeric types and categories are string-safe before fitting on the training split. - Generates synthetic rows through the plugin's `generate` API, aligns columns to the training schema, and writes synthetic CSVs and optional SemMap parquet. Per-variable distances and summary statistics are emitted alongside a manifest documenting the generator, seed, row count, and split settings. ```{code-cell} python # Inspect a tiny inline TOML snippet to see which backends would execute. from semsynth.models import load_model_configs, parse_toml_config toml_text = """ [pipeline] generate_umap = true compute_privacy = false [[models]] name = "metasyn" backend = "metasyn" [[models]] name = "clg_mi2" backend = "pybnesian" """ config = parse_toml_config(toml_text) bundle = load_model_configs(config_data=config) [(spec.name, spec.backend) for spec in bundle.specs] ``` After you run a report, check `output/Heart Disease/models//` to compare manifests, metrics, and synthetic CSVs across engines. ## References - MetaSyn documentation: - PyBNesian documentation: - SynthCity documentation: