Privacy metrics

SemSynth summarizes disclosure risk by combining SynthCity’s privacy and sanity checks with quasi-identifier analysis derived from dataset metadata. The DatasetPrivacySummary dataclass captures row counts, quasi-identifier lists, overlap/duplicate rates, nearest-neighbor distances, k-anonymity figures, t-closeness diagnostics, and (when available) identifiability and delta-presence metrics.

# Build the privacy metadata frame derived from curated roles/types.
import json
import pandas as pd
from semsynth.semmap import Metadata
from semsynth.utils import infer_types

heart = pd.read_csv("../downloads-cache/uciml/45.csv.gz")
disc, cont = infer_types(heart)
inferred = {c: ("discrete" if c in disc else "continuous") for c in heart.columns}

meta = Metadata.from_dcat_dsv(json.load(open("../mappings/uciml-45.metadata.json")))
privacy_frame = meta.to_privacy_frame(inferred)
privacy_frame.head()
variable role type
0 age qi numeric
1 sex qi categorical
2 cp qi categorical
3 fbs qi categorical
4 trestbps qi numeric

Metric suite

  • SynthCity’s CommonRowsProportion, CloseValuesProbability, and NearestSyntheticNeighborDistance quantify exact overlaps, near-duplicates (using SynthCity’s fixed 0.2 threshold), and neighbor distance statistics between real and synthetic records.

  • k-map and k-anonymity values are computed on quasi-identifier groupings drawn from the metadata roles, while rare quasi-identifier reproduction rates flag synthetic groups that repeat sparse real combinations.

  • t-closeness is reported per sensitive attribute: numerical sensitive variables use a Wasserstein-1 distance between group and global distributions, and categorical variables use total variation distance on normalized counts.

  • Optional identifiability and delta-presence scores are attempted via SynthCity; failures emit warnings but do not halt reporting, keeping the pipeline resilient to missing optional dependencies.

# Check quasi-identifier group sizes that feed k-anonymity.
qi = list(privacy_frame.loc[privacy_frame.role == "qi", "variable"])
group_sizes = heart[qi].value_counts().head()
group_sizes
age  sex  cp  fbs  trestbps  chol  restecg  thalach  exang  oldpeak  slope  ca   thal
77   1    4   0    125       304   2        162      1      0.0      1      3.0  3.0     1
29   1    2   0    130       204   2        202      0      0.0      1      0.0  3.0     1
34   0    2   0    118       210   0        192      0      0.7      1      0.0  3.0     1
     1    1   0    118       182   2        174      0      0.0      1      0.0  3.0     1
35   0    4   0    138       183   0        182      0      1.4      1      0.0  3.0     1
Name: count, dtype: int64

These counts show how often common quasi-identifier combinations appear; higher counts typically boost k-anonymity in the privacy report.

References