Getting started
This guide summarises the minimum commands required to fetch a dataset, generate semantic mappings, and produce a SemSynth report driven by those mappings.
Installation
python -m pip install -e .
Optional extras (UMAP, PyBNesian, SynthCity, etc.) can be layered on top by supplying the relevant extras group, for example:
python -m pip install -e .[umap,pybnesian,synthcity]
Create semantic mappings
SemSynth now ships with a dedicated command that orchestrates the full
mapping workflow – metadata parsing, terminology lookup, SSSOM emission, and
SemMap enrichment. Pick a strategy with --method:
python -m semsynth create-mapping uciml \
--datasets 145 \
--method lexical \
--codes-tsv map_columns/codes.tsv \
--manual-overrides-dir map_columns/manual \
--datasette-url http://127.0.0.1:8001/terminology \
--lexical-threshold 0.3 \
--top-k 3 \
--verbose
The command writes *.sssom.tsv and *.metadata.json artefacts under
mappings/. Manual overrides are optional JSON files where each key is a
column identifier pointing to a list of SSSOM-style dictionaries. Alternate
strategies include --method keyword (Datasette keyword search),
--method embed (sentence-transformer re-ranking), and --method llm (LLM +
Datasette). Each honours the flags documented in map_columns/README.md.
To rebuild the Wikidata terminology table offline, run:
python map_columns/build_wikidata_medical_codes_table.py
This produces an updated map_columns/codes.tsv enriched with descriptions
and alternate labels.
Generate reports
With mappings in place, execute the reporting pipeline. The
configs/only_metasyn_config.yaml keeps MetaSyn as the primary backend to ensure
runs complete even without GPU or shared-memory support:
python -m semsynth report uciml --datasets 145 \
--configs-yaml configs/only_metasyn_config.yaml \
--verbose
SemSynth merges the curated mappings during preprocessing. Discrete versus continuous inference now honours the statistical data type hints stored in the SemMap metadata, so integer-coded categoricals remain categorical in the downstream analysis.
After installation, confirm the cache and curated metadata are available.
# Preview the cached dataset and verify metadata loads cleanly.
import pandas as pd
import json
from semsynth.semmap import Metadata
heart = pd.read_csv("../downloads-cache/uciml/45.csv.gz")
meta = json.load(open("../mappings/uciml-45.metadata.json"))
meta_obj = Metadata.from_dcat_dsv(meta)
heart.head()
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63 | 1 | 1 | 145 | 233 | 1 | 2 | 150 | 0 | 2.3 | 3 | 0.0 | 6.0 | 0 |
| 1 | 67 | 1 | 4 | 160 | 286 | 0 | 2 | 108 | 1 | 1.5 | 2 | 3.0 | 3.0 | 2 |
| 2 | 67 | 1 | 4 | 120 | 229 | 0 | 2 | 129 | 1 | 2.6 | 2 | 2.0 | 7.0 | 1 |
| 3 | 37 | 1 | 3 | 130 | 250 | 0 | 0 | 187 | 0 | 3.5 | 3 | 0.0 | 3.0 | 0 |
| 4 | 41 | 0 | 2 | 130 | 204 | 0 | 2 | 172 | 0 | 1.4 | 1 | 0.0 | 3.0 | 0 |