Getting started

This guide summarises the minimum commands required to fetch a dataset, generate semantic mappings, and produce a SemSynth report driven by those mappings.

Installation

python -m pip install -e .

Optional extras (UMAP, PyBNesian, SynthCity, etc.) can be layered on top by supplying the relevant extras group, for example:

python -m pip install -e .[umap,pybnesian,synthcity]

Create semantic mappings

SemSynth now ships with a dedicated command that orchestrates the full mapping workflow – metadata parsing, terminology lookup, SSSOM emission, and SemMap enrichment. Pick a strategy with --method:

python -m semsynth create-mapping uciml \
    --datasets 145 \
    --method lexical \
    --codes-tsv map_columns/codes.tsv \
    --manual-overrides-dir map_columns/manual \
    --datasette-url http://127.0.0.1:8001/terminology \
    --lexical-threshold 0.3 \
    --top-k 3 \
    --verbose

The command writes *.sssom.tsv and *.metadata.json artefacts under mappings/. Manual overrides are optional JSON files where each key is a column identifier pointing to a list of SSSOM-style dictionaries. Alternate strategies include --method keyword (Datasette keyword search), --method embed (sentence-transformer re-ranking), and --method llm (LLM + Datasette). Each honours the flags documented in map_columns/README.md.

To rebuild the Wikidata terminology table offline, run:

python map_columns/build_wikidata_medical_codes_table.py

This produces an updated map_columns/codes.tsv enriched with descriptions and alternate labels.

Generate reports

With mappings in place, execute the reporting pipeline. The configs/only_metasyn_config.yaml keeps MetaSyn as the primary backend to ensure runs complete even without GPU or shared-memory support:

python -m semsynth report uciml --datasets 145 \
    --configs-yaml configs/only_metasyn_config.yaml \
    --verbose

SemSynth merges the curated mappings during preprocessing. Discrete versus continuous inference now honours the statistical data type hints stored in the SemMap metadata, so integer-coded categoricals remain categorical in the downstream analysis.

After installation, confirm the cache and curated metadata are available.

# Preview the cached dataset and verify metadata loads cleanly.
import pandas as pd
import json
from semsynth.semmap import Metadata

heart = pd.read_csv("../downloads-cache/uciml/45.csv.gz")
meta = json.load(open("../mappings/uciml-45.metadata.json"))
meta_obj = Metadata.from_dcat_dsv(meta)

heart.head()
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num
0 63 1 1 145 233 1 2 150 0 2.3 3 0.0 6.0 0
1 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0 2
2 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0 1
3 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0 0
4 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0 0