SemSynth ๐๏
SemSynth is a compact toolkit to profile tabular datasets, synthesize data with multiple backends, and generate a clean HTML report. It supports datasets from OpenML and the UCI Machine Learning Repository.
โจ Features๏
Unified model interface: run PyBNesian, SynthCity, and MetaSyn models from a single TOML bundle.
Uniform outputs: each model writes artifacts under
dataset/models/<model-name>/.Optional MetaSyn baseline: enable or disable per report via the config bundle.
Provider-aware metadata and UMAP visuals.
โ๏ธ Install๏
Clone and run python -m pip install -e .
For extra features, run python -m pip install -e .[EXTRA] with EXTRA in (
app
metasyn
pybnesian
synthcity
umap
statsmodels
mapping).
๐ Quick start๏
Search datasets
OpenML:
python -m semsynth search openml --name-substr adultUCI ML:
python -m semsynth search uciml --area "Health and Medicine" --name-substr heart
The
searchcommand accepts:--name-substr(substring filter applied case-insensitively)--area(UCI ML topic area, ignored for OpenML)--cat-min/--num-min(minimum categorical/numeric columns)--verbose(emits info logs while querying providers)
Minimal report (metadata only) ๐งช
Leave
--configs-tomlempty to skip model execution.Example:
python -m semsynth report uciml --datasets 45 -vOptional flags for reports:
--datasets(one or more dataset identifiers)--outdir(defaults tooutput/)--configs-toml(path to a TOML bundle; omit for metadata-only runs)--verbose(turn on info logging)
With the command above you receive dataset metadata, a real-data UMAP projection, and HTML/Markdown reports under
output/<Dataset Name>/.Full report with synthetic models ๐ค
Pick a configuration bundle from
configs/:configs/simple_config.toml(MetaSyn + two PyBNesian models)configs/advanced_config.toml(MetaSyn, PyBNesian, and SynthCity models)configs/maximal_config.toml(keeps MetaSyn with aggressive options enabled while remaining runnable without GPU or shared-memory support)configs/only_metasyn_config.toml(MetaSyn baseline only)
Example:
python -m semsynth report openml --datasets adult --configs-toml @configs/advanced_config.toml
Pipeline toggles (UMAP, privacy, downstream, missingness, SemMap context URL, etc.) now live in the TOML bundle under
[pipeline]rather than separate CLI flags.Catalog + app helpers
Build a DCAT catalog and HTML index from existing outputs:
python -m semsynth build-catalogRegenerate metadata-only reports for all curated SemMap mappings and refresh the catalog/index:
python -m semsynth mappings --pipeline-configs ""The generated
output/index.htmlnow embeds YASGUI + Comunica (browser) with a static endpoint identifier (browser://semsynth-static-catalog) and loads runnable query tabs fromoutput/sparql/*.rqfiles for SemMap and provenance-linked synthetic artifacts.Launch a minimal Flask UI for search and report actions:
python -m semsynth app --host 0.0.0.0 --port 5000
๐ TOML bundles๏
configs/simple_config.tomlmixes MetaSyn with two PyBNesian baselines.configs/advanced_config.tomlextends the simple bundle with SynthCity generators.configs/maximal_config.tomltoggles every optional report hook (UMAP, privacy, downstream) but ships with a MetaSyn-only bundle so it continues to execute inside restricted environments.configs/only_metasyn_config.tomlkeeps MetaSyn as the single synthetic data baseline.
Example:
[[models]]
name = "metasyn"
backend = "metasyn"
[[models]]
name = "clg_mi2"
backend = "pybnesian"
rows = 1000
seed = 42
[models.model]
type = "clg"
score = "bic"
operators = ["arcs"]
max_indegree = 2
[[models]]
name = "ctgan_fast"
backend = "synthcity"
rows = 1000
seed = 42
[models.model]
type = "ctgan"
epochs = 5
batch_size = 256
[pipeline]
generate_umap = true
compute_privacy = true
compute_downstream = true
semmap_context_url = "https://w3id.org/semmap/context"
๐ฆ Outputs๏
Per dataset (e.g.,
output/Heart Disease/):index.htmlandreport.mddataset.semmap.jsonanddataset.semmap.html(when SemMap metadata is available)umap_real.png(when--generate-umapis enabled and UMAP dependencies are installed)
Per model (e.g.,
output/Heart Disease/models/<name>/):synthetic.csv,metrics.json, andumap.pngOptional metrics sidecars:
metrics.privacy.json,metrics.downstream.jsonPyBNesian extras:
structure.png,structure.graphml,model.picklesynthetic.semmap.parquet(when SemMap metadata is available)
๐งฐ Metadata templates & column mappings๏
semsynth/dataproviders/uciml.pyexposes a CLI that turns the JSON payloads cached underuciml-cache/(sometimes referenced asuci-cache/in earlier docs) into DCAT + DSV JSON-LD that downstream tools can ingest.Scripts under
map_columns/take that JSON-LD and suggest or write terminology mappings (seemap_columns/README.mdfor the full strategy catalog covering lexical, Datasette keyword, embedding, and LLM workflows).python -m semsynth create-mapping uciml --datasets 145 --method lexicalautomates the end-to-end workflow (JSON-LD export โ terminology scoring โ SSSOM merge) and stores the results undermappings/.Curated SemMap JSON files are repo-only under
mappings/and are not bundled into wheels; when installing from PyPI, provide your ownmappings/directory if you want curated mappings.
Example: UCI dataset 45 (Heart Disease)๏
Fetch the dataset metadata. Any command that touches the UCI provider will populate
uciml-cache/<id>.json. For example:python -m semsynth report uciml --datasets 45 --configs-toml @configs/only_metasyn_config.toml
This creates
uciml-cache/45.jsonalongside the cached CSV/metadata used by the reporting pipeline.Convert the cached metadata into DCAT + DSV JSON-LD:
python semsynth/dataproviders/uciml.py uciml-cache/45.json heart-dataset.jsonld
The resulting
heart-dataset.jsonldcontains dataset-leveldcat:Datasetfields plus adsv:datasetSchemablock with each variable from the Heart Disease dataset.Suggest terminology mappings for every variable description using the keyword search helper (writes SSSOM when
--outputis provided):python map_columns/kwd_map_columns.py heart-dataset.jsonld \ --datasette-db-url http://127.0.0.1:8001/terminology \ --table codes \ --limit 10 \ --top-k 3 \ --output mappings/uciml-45.keyword.sssom.tsv \ --verbose
Swap in
map_columns/embed_map_columns.pyormap_columns/llm_map_columns.pyto use embedding or LLM variants (seemap_columns/README.mdfor full parameter lists).Run the integrated helper to write mappings and SemMap metadata in one go:
python -m semsynth create-mapping uciml \ --datasets 45 \ --method embed \ --codes-tsv map_columns/codes.tsv \ --datasette-url http://127.0.0.1:8001/terminology \ --lexical-threshold 0.25 \ --top-k 3 \ --verbose
The resulting files (e.g.,
mappings/uciml-45.sssom.tsv) can be fed back into the reporting pipeline without extra steps.
๐ Notes๏
Metadata-only reports require no TOML file; pass
--configs-tomlto opt into synthetic runs.All models are treated uniformly in the report; UMAPs share the same projection trained on real data.
When optional dependencies (e.g.,
umap-learn, PyBNesian, SynthCity) are missing or blocked by runtime sandboxing, SemSynth logs a warning and continues with the available components. The shippedmaximal_configkeeps MetaSyn as the default to guarantee completion under those constraints.UCI dataset 1150 (Gallstone) is currently unavailable via
ucimlrepo; themappingstarget skips it with a warning until a cached stand-in or upstream access is restored.If external dataset hosts are unreachable, cached payloads under
downloads-cache/may contain synthetic stand-ins that mimic the documented schema so pipeline checks can proceed. Keep README notes in sync when such fallbacks are introduced.
๐ Testing, Contributing, Documentation๏
Install dev and documentation deps with
python -m pip install -e .[dev,docs](use.[dev]if you do not need to build docs).Run tests with
python -m pytestBuild the Sphinx site with
sphinx-build -b html sphinx docs.
๐ Static SPARQL endpoint (demo index)๏
python -m semsynth build-catalogwritesoutput/index.html,output/catalog.json,output/catalog.jsonld, andoutput/sparql/*.rqwith an embedded YASGUI editor and browser Comunica query engine targeting static files only.The page advertises endpoint id
browser://semsynth-static-catalogand includes ready-to-run query tabs covering:datasets that publish
dataset.semmap.json,provenance-linked synthetic artifacts,
distribution counts per dataset.
This mirrors the static-browser pattern from
data-catalog-sparql-playground(catalog + local query UI, no server-side SPARQL service required).SPARQL query templates are also described as catalog distributions so tooling can discover and preload them as tabs.