semsynth.datasets

Functions

`get_default_openml`([cache_dir])
`get_default_uciml`([area, cache_dir])
`list_openml`([name_substr, cat_min, num_min, ...])
`list_uciml`([area, name_substr, cat_min, ...])	Return (id, name, n_instances, n_categorical, n_numeric) for mixed datasets in area.
`load_dataset`(spec, *[, openml_cache_dir, ...])
`load_openml_by_name`(name, cache_dir)	Load an OpenML dataset by name, with local caching of the data payload.
`load_uciml_by_id`(dataset_id, cache_dir)	Load a UCI ML dataset by ID, with local caching of the data payload.
`rule`(*[, name, phony, base_iri, prov_dir, ...])	Decorate a function as a build rule with automatic provenance.
`specs_from_input`(provider[, datasets, area, ...])

Classes

`DatasetPayload`(spec, frame[, color, metadata])	Bundled dataset artefacts returned by provider loaders.
`DatasetSpec`(provider[, name, id, target, meta])	Container describing how to locate and identify a dataset.
`InPath`(*paths)	Marker for input paths where `"-"` maps to stdin.
`OutPath`(*paths)	Marker for output paths where `"-"` maps to stdout.

class semsynth.datasets.DatasetPayload(spec: DatasetSpec, frame: DataFrame, color: Series | None = None, metadata: Mapping[str, Any] | None = None)

Bases: object

Bundled dataset artefacts returned by provider loaders.

color: Series | None

frame: DataFrame

metadata: Mapping[str, Any] | None

spec: DatasetSpec

class semsynth.datasets.DatasetSpec(provider: str, name: str | None = None, id: int | None = None, target: str | None = None, meta: Any | None = None)

Bases: object

Container describing how to locate and identify a dataset.

id: int | None = None

meta: Any | None = None

name: str | None = None

provider: str

target: str | None = None

semsynth.datasets.list_openml(name_substr: str | None = None, cat_min: int = 1, num_min: int = 1, cache_dir: Path = PosixPath('downloads-cache/openml')) → DataFrame

semsynth.datasets.list_uciml(area: str = 'Health and Medicine', name_substr: str | None = None, cat_min: int = 1, num_min: int = 1, *, cachedir: Path = PosixPath('uciml-cache')) → DataFrame

Return (id, name, n_instances, n_categorical, n_numeric) for mixed datasets in area.

It pulls the dataset list for the given area from the UCI API, then, for each dataset, fetches data via ucimlrepo and infers variable types to decide whether it is mixed (has at least one categorical and one numeric). Only mixed datasets are returned.

semsynth.datasets.load_dataset(spec: DatasetSpec, *, openml_cache_dir: Path = PosixPath('downloads-cache/openml'), uciml_cache_dir: Path = PosixPath('downloads-cache/uciml')) → DatasetPayload

semsynth.datasets.specs_from_input(provider: str, datasets: Iterable[str] | None = None, area: str = 'Health and Medicine', *, openml_cache_dir: OutPath = OutPath('downloads-cache/openml'), uciml_cache_dir: OutPath = OutPath('downloads-cache/uciml')) → List[DatasetSpec]