Privacy metrics

SemSynth summarizes disclosure risk by combining SynthCity’s privacy and sanity checks with quasi-identifier analysis derived from dataset metadata. The DatasetPrivacySummary dataclass captures row counts, quasi-identifier lists, overlap/duplicate rates, nearest-neighbor distances, k-anonymity figures, t-closeness diagnostics, and (when available) identifiability and delta-presence metrics.

# Build the privacy metadata frame derived from curated roles/types.
import json
import pandas as pd
from semsynth.semmap import Metadata
from semsynth.utils import infer_types

heart = pd.read_csv("../downloads-cache/uciml/45.csv.gz")
disc, cont = infer_types(heart)
inferred = {c: ("discrete" if c in disc else "continuous") for c in heart.columns}

meta = Metadata.from_dcat_dsv(json.load(open("../mappings/uciml-45.metadata.json")))
privacy_frame = meta.to_privacy_frame(inferred)
privacy_frame.head()

	variable	role	type
0	age	qi	numeric
1	sex	qi	categorical
2	cp	qi	categorical
3	fbs	qi	categorical
4	trestbps	qi	numeric

Metric suite

SynthCity’s CommonRowsProportion, CloseValuesProbability, and NearestSyntheticNeighborDistance quantify exact overlaps, near-duplicates (using SynthCity’s fixed 0.2 threshold), and neighbor distance statistics between real and synthetic records.
k-map and k-anonymity values are computed on quasi-identifier groupings drawn from the metadata roles, while rare quasi-identifier reproduction rates flag synthetic groups that repeat sparse real combinations.
t-closeness is reported per sensitive attribute: numerical sensitive variables use a Wasserstein-1 distance between group and global distributions, and categorical variables use total variation distance on normalized counts.
Optional identifiability and delta-presence scores are attempted via SynthCity; failures emit warnings but do not halt reporting, keeping the pipeline resilient to missing optional dependencies.

# Check quasi-identifier group sizes that feed k-anonymity.
qi = list(privacy_frame.loc[privacy_frame.role == "qi", "variable"])
group_sizes = heart[qi].value_counts().head()
group_sizes

age  sex  cp  fbs  trestbps  chol  restecg  thalach  exang  oldpeak  slope  ca   thal
 1    4   0    125       304   2        162      1      0.0      1      3.0  3.0     1
 1    2   0    130       204   2        202      0      0.0      1      0.0  3.0     1
 0    2   0    118       210   0        192      0      0.7      1      0.0  3.0     1
  1   0    118       182   2        174      0      0.0      1      0.0  3.0     1
 0    4   0    138       183   0        182      0      1.4      1      0.0  3.0     1
Name: count, dtype: int64

These counts show how often common quasi-identifier combinations appear; higher counts typically boost k-anonymity in the privacy report.

References

SynthCity metric APIs: https://synthcity.readthedocs.io/en/latest/generated/synthcity.metrics.eval_privacy.html
k-map and k-anonymity background: https://doi.org/10.1145/1401890.1401904
t-closeness definition: https://doi.org/10.1109/ICDE.2007.367856