Mapping helpers are easiest to demonstrate on the curated Heart Disease payload.

This quick count shows which codelists appear most often, illustrating how curated mappings guide downstream roles and privacy handling.

Column terminology mapping

The map_columns/ directory contains utilities for building terminology resources, mapping dataset columns to codes, and evaluating the resulting SSSOM artifacts. The tooling supports offline TSV lookups, Datasette-backed keyword search, embedding re-ranking, and LLM-assisted coding.

Scripts Overview

build_snomed_loinc_codes_table.py – Build a TSV containing SNOMED CT and LOINC codes.
build_wikidata_medical_codes_table.py – Extract medical terminology from Wikidata into codes.tsv.
codes_map_columns.py – Perform offline lexical matching between dataset columns and entries in codes.tsv.
kwd_map_columns.py – Query a Datasette instance and apply lexical scoring to rank results.
embed_map_columns.py – Re-rank terminology candidates by combining sentence-transformer cosine similarity with lexical overlap diagnostics.
llm_map_columns.py – Orchestrate an LLM with Datasette tool access to obtain curated mappings.
evaluate.py – Compute micro/macro precision/recall/F1, MAP, and nDCG for SSSOM TSV files against a gold standard.

Prerequisites

Install the baseline dependencies:

pip install defopt requests pandas numpy

Optional extras:

Datasette helpers: pip install datasette sqlite-utils llm-tools-datasette
Embedding re-ranking: pip install sentence-transformers torch
LLM orchestration: pip install llm
Evaluation: no additional packages beyond the baseline list

Usage

1. Build Terminology Tables

Option A: SNOMED + LOINC

python build_snomed_loinc_codes_table.py \
    --snomed-description /path/to/Snapshot/Terminology/sct2_Description_Snapshot-en_INT_*.txt \
    --loinc /path/to/Loinc.csv \
    --out codes.tsv \
    --max-snomed 50000

Option B: Wikidata snapshot

python build_wikidata_medical_codes_table.py

After generating codes.tsv, you can load it into a SQLite / Datasette friendly database:

sqlite-utils insert terminology.db codes codes.tsv --tsv
sqlite-utils enable-fts terminology.db codes label synonyms --create-triggers

2. Mapping Approaches

2.1 Offline TSV (lexical)

python -m map_columns.codes_map_columns \
    dataset.semmap.json \
    --codes-tsv map_columns/codes.tsv \
    --manual-overrides map_columns/manual/uciml-145.json \
    --output-tsv mappings/uciml-145.sssom.tsv \
    --verbose

This mode keeps everything offline by comparing dataset metadata with the synonyms contained in codes.tsv. Optional manual overrides provide exact matches when lexical scoring is insufficient.

2.2 Datasette keyword search

python map_columns/kwd_map_columns.py dataset.json \
    --datasette-db-url http://127.0.0.1:8001/terminology \
    --table codes \
    --limit 10 \
    --lexical-threshold 0.3 \
    --top-k 3 \
    --output mappings/uciml-145.keyword.sssom.tsv

The script requests candidate rows from Datasette, re-scores them with lexical overlap, and emits SSSOM TSV rows. Results can be redirected to stdout by omitting --output.

2.3 Embedding re-ranking

python map_columns/embed_map_columns.py dataset.json \
    map_columns/codes.tsv \
    --model-name sentence-transformers/all-MiniLM-L6-v2 \
    --top-k 5 \
    --cosine-threshold 0.25 \
    --lexical-threshold 0.25 \
    --output mappings/uciml-145.embed.sssom.tsv

Candidates are first retrieved by lexical similarity and then re-ranked using a sentence-transformer model. The exported TSV includes both lexical and cosine diagnostics in the comments.

2.4 LLM-assisted mapping

python map_columns/llm_map_columns.py dataset.json \
    --datasette-url http://127.0.0.1:8001/terminology \
    --model gpt-4.1-mini \
    --top-k 3 \
    --confidence-threshold 0.5 \
    --output mappings/uciml-145.llm.sssom.tsv

The LLM uses the Datasette tool to inspect code candidates and emits SSSOM rows directly. Use --extra-prompt to provide project-specific guidance.

3. Evaluate SSSOM outputs

python map_columns/evaluate.py \
    gold/uciml-145.gold.sssom.tsv \
    --predictions mappings/uciml-145.sssom.tsv mappings/uciml-145.embed.sssom.tsv \
    --output eval/uciml-145.json

Metrics (micro/macro P/R/F1, MAP, nDCG) are printed for each prediction file. When --output is supplied, the metrics are also written to a JSON report.

SemSynth CLI integration

The end-to-end workflow can be launched via the SemSynth CLI. All mapping strategies are now available under a single command:

python -m semsynth create-mapping uciml \
    --datasets 145 \
    --method embed \
    --codes-tsv map_columns/codes.tsv \
    --datasette-url http://127.0.0.1:8001/terminology \
    --lexical-threshold 0.25 \
    --top-k 3 \
    --outdir mappings/

Key flags:

--method: lexical (default), keyword, embed, or llm
Datasette-backed methods honour --datasette-url, --datasette-table, and --datasette-limit
Embedding mode accepts --embed-model, --candidate-pool-multiplier, and --cosine-threshold
LLM mode exposes --llm-model, --llm-extra-prompt, --llm-subject-prefix, and --confidence-threshold

Manual overrides continue to be respected. When overrides exist for a column, the CLI replaces any automatically generated matches with the curated entries.

# Summarize the most common codebook notations in the curated mapping.
import json
from collections import Counter

meta = json.load(open("../mappings/uciml-45.metadata.json"))
codes = []
for col in meta["datasetSchema"]["columns"]:
    cb = col.get("columnProperty", {}).get("hasCodeBook") or {}
    concepts = cb.get("hasConcept") or []
    codes.extend([c.get("notation") for c in concepts if isinstance(c, dict) and c.get("notation")])

Counter(codes).most_common(5)

[]