Getting started

This guide summarises the minimum commands required to fetch a dataset, generate semantic mappings, and produce a SemSynth report driven by those mappings.

Installation

python -m pip install -e .

Optional extras (UMAP, PyBNesian, SynthCity, etc.) can be layered on top by supplying the relevant extras group, for example:

python -m pip install -e .[umap,pybnesian,synthcity]

Create semantic mappings

SemSynth now ships with a dedicated command that orchestrates the full mapping workflow – metadata parsing, terminology lookup, SSSOM emission, and SemMap enrichment. Pick a strategy with --method:

python -m semsynth create-mapping uciml \
    --datasets 145 \
    --method lexical \
    --codes-tsv map_columns/codes.tsv \
    --manual-overrides-dir map_columns/manual \
    --datasette-url http://127.0.0.1:8001/terminology \
    --lexical-threshold 0.3 \
    --top-k 3 \
    --verbose

The command writes *.sssom.tsv and *.metadata.json artefacts under mappings/. Manual overrides are optional JSON files where each key is a column identifier pointing to a list of SSSOM-style dictionaries. Alternate strategies include --method keyword (Datasette keyword search), --method embed (sentence-transformer re-ranking), and --method llm (LLM + Datasette). Each honours the flags documented in map_columns/README.md.

To rebuild the Wikidata terminology table offline, run:

python map_columns/build_wikidata_medical_codes_table.py

This produces an updated map_columns/codes.tsv enriched with descriptions and alternate labels.

Generate reports

With mappings in place, execute the reporting pipeline. The configs/only_metasyn_config.yaml keeps MetaSyn as the primary backend to ensure runs complete even without GPU or shared-memory support:

python -m semsynth report uciml --datasets 145 \
    --configs-yaml configs/only_metasyn_config.yaml \
    --verbose

SemSynth merges the curated mappings during preprocessing. Discrete versus continuous inference now honours the statistical data type hints stored in the SemMap metadata, so integer-coded categoricals remain categorical in the downstream analysis.

After installation, confirm the cache and curated metadata are available.

# Preview the cached dataset and verify metadata loads cleanly.
import pandas as pd
import json
from semsynth.semmap import Metadata

heart = pd.read_csv("../downloads-cache/uciml/45.csv.gz")
meta = json.load(open("../mappings/uciml-45.metadata.json"))
meta_obj = Metadata.from_dcat_dsv(meta)

heart.head()

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	num
0	63	1	1	145	233	1	2	150	0	2.3	3	0.0	6.0	0
1	67	1	4	160	286	0	2	108	1	1.5	2	3.0	3.0	2
2	67	1	4	120	229	0	2	129	1	2.6	2	2.0	7.0	1
3	37	1	3	130	250	0	0	187	0	3.5	3	0.0	3.0	0
4	41	0	2	130	204	0	2	172	0	1.4	1	0.0	3.0	0