--- filetype: mystnb jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.16.3 kernelspec: name: python3 display_name: Python 3 --- # Getting started This guide summarises the minimum commands required to fetch a dataset, generate semantic mappings, and produce a SemSynth report driven by those mappings. ## Installation ```bash python -m pip install -e . ``` Optional extras (UMAP, PyBNesian, SynthCity, etc.) can be layered on top by supplying the relevant extras group, for example: ```bash python -m pip install -e .[umap,pybnesian,synthcity] ``` ## Create semantic mappings SemSynth now ships with a dedicated command that orchestrates the full mapping workflow – metadata parsing, terminology lookup, SSSOM emission, and SemMap enrichment. Pick a strategy with `--method`: ```bash python -m semsynth create-mapping uciml \ --datasets 145 \ --method lexical \ --codes-tsv map_columns/codes.tsv \ --manual-overrides-dir map_columns/manual \ --datasette-url http://127.0.0.1:8001/terminology \ --lexical-threshold 0.3 \ --top-k 3 \ --verbose ``` The command writes `*.sssom.tsv` and `*.metadata.json` artefacts under `mappings/`. Manual overrides are optional JSON files where each key is a column identifier pointing to a list of SSSOM-style dictionaries. Alternate strategies include `--method keyword` (Datasette keyword search), `--method embed` (sentence-transformer re-ranking), and `--method llm` (LLM + Datasette). Each honours the flags documented in `map_columns/README.md`. To rebuild the Wikidata terminology table offline, run: ```bash python map_columns/build_wikidata_medical_codes_table.py ``` This produces an updated `map_columns/codes.tsv` enriched with descriptions and alternate labels. ## Generate reports With mappings in place, execute the reporting pipeline. The `configs/only_metasyn_config.yaml` keeps MetaSyn as the primary backend to ensure runs complete even without GPU or shared-memory support: ```bash python -m semsynth report uciml --datasets 145 \ --configs-yaml configs/only_metasyn_config.yaml \ --verbose ``` SemSynth merges the curated mappings during preprocessing. Discrete versus continuous inference now honours the statistical data type hints stored in the SemMap metadata, so integer-coded categoricals remain categorical in the downstream analysis. After installation, confirm the cache and curated metadata are available. ```{code-cell} python # Preview the cached dataset and verify metadata loads cleanly. import pandas as pd import json from semsynth.semmap import Metadata heart = pd.read_csv("../downloads-cache/uciml/45.csv.gz") meta = json.load(open("../mappings/uciml-45.metadata.json")) meta_obj = Metadata.from_dcat_dsv(meta) heart.head() ```