Mapping helpers are easiest to demonstrate on the curated Heart Disease payload.
This quick count shows which codelists appear most often, illustrating how curated mappings guide downstream roles and privacy handling.
Column terminology mapping
The map_columns/ directory contains utilities for building terminology
resources, mapping dataset columns to codes, and evaluating the resulting SSSOM
artifacts. The tooling supports offline TSV lookups, Datasette-backed keyword
search, embedding re-ranking, and LLM-assisted coding.
Table of Contents
Scripts Overview
build_snomed_loinc_codes_table.py– Build a TSV containing SNOMED CT and LOINC codes.build_wikidata_medical_codes_table.py– Extract medical terminology from Wikidata intocodes.tsv.codes_map_columns.py– Perform offline lexical matching between dataset columns and entries incodes.tsv.kwd_map_columns.py– Query a Datasette instance and apply lexical scoring to rank results.embed_map_columns.py– Re-rank terminology candidates by combining sentence-transformer cosine similarity with lexical overlap diagnostics.llm_map_columns.py– Orchestrate an LLM with Datasette tool access to obtain curated mappings.evaluate.py– Compute micro/macro precision/recall/F1, MAP, and nDCG for SSSOM TSV files against a gold standard.
Prerequisites
Install the baseline dependencies:
pip install defopt requests pandas numpy
Optional extras:
Datasette helpers:
pip install datasette sqlite-utils llm-tools-datasetteEmbedding re-ranking:
pip install sentence-transformers torchLLM orchestration:
pip install llmEvaluation: no additional packages beyond the baseline list
Usage
1. Build Terminology Tables
Option A: SNOMED + LOINC
python build_snomed_loinc_codes_table.py \
--snomed-description /path/to/Snapshot/Terminology/sct2_Description_Snapshot-en_INT_*.txt \
--loinc /path/to/Loinc.csv \
--out codes.tsv \
--max-snomed 50000
Option B: Wikidata snapshot
python build_wikidata_medical_codes_table.py
After generating codes.tsv, you can load it into a SQLite / Datasette friendly
database:
sqlite-utils insert terminology.db codes codes.tsv --tsv
sqlite-utils enable-fts terminology.db codes label synonyms --create-triggers
2. Mapping Approaches
2.1 Offline TSV (lexical)
python -m map_columns.codes_map_columns \
dataset.semmap.json \
--codes-tsv map_columns/codes.tsv \
--manual-overrides map_columns/manual/uciml-145.json \
--output-tsv mappings/uciml-145.sssom.tsv \
--verbose
This mode keeps everything offline by comparing dataset metadata with the
synonyms contained in codes.tsv. Optional manual overrides provide exact
matches when lexical scoring is insufficient.
2.2 Datasette keyword search
python map_columns/kwd_map_columns.py dataset.json \
--datasette-db-url http://127.0.0.1:8001/terminology \
--table codes \
--limit 10 \
--lexical-threshold 0.3 \
--top-k 3 \
--output mappings/uciml-145.keyword.sssom.tsv
The script requests candidate rows from Datasette, re-scores them with lexical
overlap, and emits SSSOM TSV rows. Results can be redirected to stdout by
omitting --output.
2.3 Embedding re-ranking
python map_columns/embed_map_columns.py dataset.json \
map_columns/codes.tsv \
--model-name sentence-transformers/all-MiniLM-L6-v2 \
--top-k 5 \
--cosine-threshold 0.25 \
--lexical-threshold 0.25 \
--output mappings/uciml-145.embed.sssom.tsv
Candidates are first retrieved by lexical similarity and then re-ranked using a sentence-transformer model. The exported TSV includes both lexical and cosine diagnostics in the comments.
2.4 LLM-assisted mapping
python map_columns/llm_map_columns.py dataset.json \
--datasette-url http://127.0.0.1:8001/terminology \
--model gpt-4.1-mini \
--top-k 3 \
--confidence-threshold 0.5 \
--output mappings/uciml-145.llm.sssom.tsv
The LLM uses the Datasette tool to inspect code candidates and emits SSSOM rows
directly. Use --extra-prompt to provide project-specific guidance.
3. Evaluate SSSOM outputs
python map_columns/evaluate.py \
gold/uciml-145.gold.sssom.tsv \
--predictions mappings/uciml-145.sssom.tsv mappings/uciml-145.embed.sssom.tsv \
--output eval/uciml-145.json
Metrics (micro/macro P/R/F1, MAP, nDCG) are printed for each prediction file.
When --output is supplied, the metrics are also written to a JSON report.
SemSynth CLI integration
The end-to-end workflow can be launched via the SemSynth CLI. All mapping strategies are now available under a single command:
python -m semsynth create-mapping uciml \
--datasets 145 \
--method embed \
--codes-tsv map_columns/codes.tsv \
--datasette-url http://127.0.0.1:8001/terminology \
--lexical-threshold 0.25 \
--top-k 3 \
--outdir mappings/
Key flags:
--method:lexical(default),keyword,embed, orllmDatasette-backed methods honour
--datasette-url,--datasette-table, and--datasette-limitEmbedding mode accepts
--embed-model,--candidate-pool-multiplier, and--cosine-thresholdLLM mode exposes
--llm-model,--llm-extra-prompt,--llm-subject-prefix, and--confidence-threshold
Manual overrides continue to be respected. When overrides exist for a column, the CLI replaces any automatically generated matches with the curated entries.
# Summarize the most common codebook notations in the curated mapping.
import json
from collections import Counter
meta = json.load(open("../mappings/uciml-45.metadata.json"))
codes = []
for col in meta["datasetSchema"]["columns"]:
cb = col.get("columnProperty", {}).get("hasCodeBook") or {}
concepts = cb.get("hasConcept") or []
codes.extend([c.get("notation") for c in concepts if isinstance(c, dict) and c.get("notation")])
Counter(codes).most_common(5)
[]