---
filetype: mystnb
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.16.3
kernelspec:
  name: python3
  display_name: Python 3
---

# SemMap metadata schema

This project represents dataset semantics using a SemMap-flavoured JSON-LD
profile that blends [DCAT](https://www.w3.org/TR/vocab-dcat-3/)/[DSV](https://w3id.org/dsv-ontology),
[CSVW](https://www.w3.org/TR/tabular-data-model/), [PROV](https://www.w3.org/TR/prov-o/),
[QUDT](https://qudt.org/)/[UCUM](https://ucum.org/), and
[SKOS](https://www.w3.org/TR/skos-reference/) mappings. The
canonical definitions live in `semsynth.semmap` and power every stage of the
pipeline, from ingestion to reporting.


## Core objects

- **Metadata**: Root document that carries dataset-level [DCAT](https://www.w3.org/TR/vocab-dcat-3/)
  information
  (title, description, purpose, landing page, citations, identifiers, funding,
  access rights) plus `summaryStatistics` such as dataset completeness and row
  counts. It contains a `DatasetSchema`.
- **DatasetSchema**: Holds an ordered list of `Column` nodes.
- **Column**: Captures [CSVW](https://www.w3.org/TR/tabular-data-model/)/[DSV](https://w3id.org/dsv-ontology)
  fields (`name`, `titles`, descriptions,
  `prov:hadRole`, defaults) and optional `summaryStatistics` with declared
  `statisticalDataType`, completeness, missing-value formats, and numeric
  aggregates. Columns may embed a `ColumnProperty`.
- **ColumnProperty**: Encodes richer semantics including units
  ([UCUM](https://ucum.org/)/[QUDT](https://qudt.org/)), codebooks
  (`hasCodeBook` with [SKOS](https://www.w3.org/TR/skos-reference/) concepts),
  links to variable definitions, provenance (`source`), and
  [SKOS](https://www.w3.org/TR/skos-reference/) mappings
  (`exactMatch`, `closeMatch`, etc.).
- **CodeBook/CodeConcept**: Capture enumerations or codelists, each
  [SKOS](https://www.w3.org/TR/skos-reference/) concept
  supporting mappings to external ontologies. Presence of a codebook implies a
  categorical type for privacy metadata.

Every dataclass inherits `RDFMixin`, enabling round-trips via
`to_jsonld()/from_jsonld` and storage inside parquet files through the
pandas `semmap` accessor.

```{code-cell} python
# Inspect the first few columns and their statistical data types.
import json
from semsynth.semmap import Metadata

meta = Metadata.from_dcat_dsv(json.load(open("../mappings/uciml-45.metadata.json")))
[(c.name, getattr(c.summaryStatistics, "statisticalDataType", None)) for c in meta.datasetSchema.columns[:5]]
```

## Creation and ingestion

1. **Templates**: `semsynth/dataproviders/uciml.py` emits [DCAT](https://www.w3.org/TR/vocab-dcat-3/)/[DSV](https://w3id.org/dsv-ontology)
   JSON-LD that matches the SemMap dataclasses. The same layout is loaded by
   `Metadata.from_dcat_dsv`.
2. **Column mapping**: `map_columns/shared.py` parses curated metadata for LLM
   prompts and returns a `Metadata` instance; `map_columns/sssom_to_semmap.py`
   merges [SSSOM](https://w3id.org/sssom/) mapping files into `ColumnProperty`
   [SKOS](https://www.w3.org/TR/skos-reference/) relationships to
   keep column mappings aligned with downstream semantics.
3. **Pipeline loading**: `semsynth.pipeline.Preprocessor` applies curated
   metadata (`DatasetSpec.meta` or JSON-LD files) to incoming dataframes through
   the pandas SemMap accessor, producing a shared `Metadata` object and
   JSON-LD export for later stages.

## Updates during preprocessing

- **Type inference**: Inferred discrete/continuous hints are stored alongside
  metadata and used as fallbacks when semantics are incomplete.
- **Missingness**: When missingness wrapping is enabled, `Metadata` is updated
  via `update_completeness_from_missingness` to refresh dataset/column
  completeness and missing-value annotations derived from the fitted model.
- **Persistence**: The updated metadata is carried through
  `PreprocessingResult.semmap_metadata` and stored with artifacts for each run.

## Use in analytics

- **Privacy metrics**: `Metadata.to_privacy_frame` normalizes SemMap roles to
  the privacy expectations (`qi`, `sensitive`, `id`, `ignore`, `target`) and
  maps `statisticalDataType` or codebooks to coarse types
  (`numeric`/`categorical`). The resulting dataframe is passed to
  `privacy_metrics.summarize_privacy_synthcity`, ensuring curated roles and
  types drive quasi-identifier selection and sensitive feature handling. When
  SemMap metadata is unavailable, the pipeline falls back to the inferred type
  map.
- **Downstream fidelity**: The same `Metadata` instance is serialized to JSON-LD
  and provided to `downstream_fidelity.compare_real_vs_synth`, allowing role and
  codebook semantics to guide modeling and encoding.

## Reporting and exports

`semsynth.reporting.write_report_md` receives the shared `Metadata` and renders
human-readable dataset descriptions, citations, and completeness figures that
match the SemMap export stored alongside backend outputs. This keeps reports,
privacy metrics, and SemMap artifacts consistent.