Privacy metrics
SemSynth summarizes disclosure risk by combining SynthCity’s privacy and sanity checks with quasi-identifier analysis derived from dataset metadata. The DatasetPrivacySummary dataclass captures row counts, quasi-identifier lists, overlap/duplicate rates, nearest-neighbor distances, k-anonymity figures, t-closeness diagnostics, and (when available) identifiability and delta-presence metrics.
# Build the privacy metadata frame derived from curated roles/types.
import json
import pandas as pd
from semsynth.semmap import Metadata
from semsynth.utils import infer_types
heart = pd.read_csv("../downloads-cache/uciml/45.csv.gz")
disc, cont = infer_types(heart)
inferred = {c: ("discrete" if c in disc else "continuous") for c in heart.columns}
meta = Metadata.from_dcat_dsv(json.load(open("../mappings/uciml-45.metadata.json")))
privacy_frame = meta.to_privacy_frame(inferred)
privacy_frame.head()
| variable | role | type | |
|---|---|---|---|
| 0 | age | qi | numeric |
| 1 | sex | qi | categorical |
| 2 | cp | qi | categorical |
| 3 | fbs | qi | categorical |
| 4 | trestbps | qi | numeric |
Metric suite
SynthCity’s
CommonRowsProportion,CloseValuesProbability, andNearestSyntheticNeighborDistancequantify exact overlaps, near-duplicates (using SynthCity’s fixed 0.2 threshold), and neighbor distance statistics between real and synthetic records.k-map and k-anonymity values are computed on quasi-identifier groupings drawn from the metadata roles, while rare quasi-identifier reproduction rates flag synthetic groups that repeat sparse real combinations.
t-closeness is reported per sensitive attribute: numerical sensitive variables use a Wasserstein-1 distance between group and global distributions, and categorical variables use total variation distance on normalized counts.
Optional identifiability and delta-presence scores are attempted via SynthCity; failures emit warnings but do not halt reporting, keeping the pipeline resilient to missing optional dependencies.
# Check quasi-identifier group sizes that feed k-anonymity.
qi = list(privacy_frame.loc[privacy_frame.role == "qi", "variable"])
group_sizes = heart[qi].value_counts().head()
group_sizes
age sex cp fbs trestbps chol restecg thalach exang oldpeak slope ca thal
77 1 4 0 125 304 2 162 1 0.0 1 3.0 3.0 1
29 1 2 0 130 204 2 202 0 0.0 1 0.0 3.0 1
34 0 2 0 118 210 0 192 0 0.7 1 0.0 3.0 1
1 1 0 118 182 2 174 0 0.0 1 0.0 3.0 1
35 0 4 0 138 183 0 182 0 1.4 1 0.0 3.0 1
Name: count, dtype: int64
These counts show how often common quasi-identifier combinations appear; higher counts typically boost k-anonymity in the privacy report.
References
SynthCity metric APIs: https://synthcity.readthedocs.io/en/latest/generated/synthcity.metrics.eval_privacy.html
k-map and k-anonymity background: https://doi.org/10.1145/1401890.1401904
t-closeness definition: https://doi.org/10.1109/ICDE.2007.367856