SemMap metadata schema
This project represents dataset semantics using a SemMap-flavoured JSON-LD
profile that blends DCAT/DSV,
CSVW, PROV,
QUDT/UCUM, and
SKOS mappings. The
canonical definitions live in semsynth.semmap and power every stage of the
pipeline, from ingestion to reporting.
Core objects
Metadata: Root document that carries dataset-level DCAT information (title, description, purpose, landing page, citations, identifiers, funding, access rights) plus
summaryStatisticssuch as dataset completeness and row counts. It contains aDatasetSchema.DatasetSchema: Holds an ordered list of
Columnnodes.Column: Captures CSVW/DSV fields (
name,titles, descriptions,prov:hadRole, defaults) and optionalsummaryStatisticswith declaredstatisticalDataType, completeness, missing-value formats, and numeric aggregates. Columns may embed aColumnProperty.ColumnProperty: Encodes richer semantics including units (UCUM/QUDT), codebooks (
hasCodeBookwith SKOS concepts), links to variable definitions, provenance (source), and SKOS mappings (exactMatch,closeMatch, etc.).CodeBook/CodeConcept: Capture enumerations or codelists, each SKOS concept supporting mappings to external ontologies. Presence of a codebook implies a categorical type for privacy metadata.
Every dataclass inherits RDFMixin, enabling round-trips via
to_jsonld()/from_jsonld and storage inside parquet files through the
pandas semmap accessor.
# Inspect the first few columns and their statistical data types.
import json
from semsynth.semmap import Metadata
meta = Metadata.from_dcat_dsv(json.load(open("../mappings/uciml-45.metadata.json")))
[(c.name, getattr(c.summaryStatistics, "statisticalDataType", None)) for c in meta.datasetSchema.columns[:5]]
[('age', None), ('sex', None), ('cp', None), ('fbs', None), ('trestbps', None)]
Creation and ingestion
Templates:
semsynth/dataproviders/uciml.pyemits DCAT/DSV JSON-LD that matches the SemMap dataclasses. The same layout is loaded byMetadata.from_dcat_dsv.Column mapping:
map_columns/shared.pyparses curated metadata for LLM prompts and returns aMetadatainstance;map_columns/sssom_to_semmap.pymerges SSSOM mapping files intoColumnPropertySKOS relationships to keep column mappings aligned with downstream semantics.Pipeline loading:
semsynth.pipeline.Preprocessorapplies curated metadata (DatasetSpec.metaor JSON-LD files) to incoming dataframes through the pandas SemMap accessor, producing a sharedMetadataobject and JSON-LD export for later stages.
Updates during preprocessing
Type inference: Inferred discrete/continuous hints are stored alongside metadata and used as fallbacks when semantics are incomplete.
Missingness: When missingness wrapping is enabled,
Metadatais updated viaupdate_completeness_from_missingnessto refresh dataset/column completeness and missing-value annotations derived from the fitted model.Persistence: The updated metadata is carried through
PreprocessingResult.semmap_metadataand stored with artifacts for each run.
Use in analytics
Privacy metrics:
Metadata.to_privacy_framenormalizes SemMap roles to the privacy expectations (qi,sensitive,id,ignore,target) and mapsstatisticalDataTypeor codebooks to coarse types (numeric/categorical). The resulting dataframe is passed toprivacy_metrics.summarize_privacy_synthcity, ensuring curated roles and types drive quasi-identifier selection and sensitive feature handling. When SemMap metadata is unavailable, the pipeline falls back to the inferred type map.Downstream fidelity: The same
Metadatainstance is serialized to JSON-LD and provided todownstream_fidelity.compare_real_vs_synth, allowing role and codebook semantics to guide modeling and encoding.
Reporting and exports
semsynth.reporting.write_report_md receives the shared Metadata and renders
human-readable dataset descriptions, citations, and completeness figures that
match the SemMap export stored alongside backend outputs. This keeps reports,
privacy metrics, and SemMap artifacts consistent.