semsynth.datasets

Functions

get_default_openml([cache_dir])

get_default_uciml([area, cache_dir])

list_openml([name_substr, cat_min, num_min, ...])

list_uciml([area, name_substr, cat_min, ...])

Return (id, name, n_instances, n_categorical, n_numeric) for mixed datasets in area.

load_dataset(spec, *[, openml_cache_dir, ...])

load_openml_by_name(name, cache_dir)

Load an OpenML dataset by name, with local caching of the data payload.

load_uciml_by_id(dataset_id, cache_dir)

Load a UCI ML dataset by ID, with local caching of the data payload.

rule(*[, name, phony, base_iri, prov_dir, ...])

Decorate a function as a build rule with automatic provenance.

specs_from_input(provider[, datasets, area, ...])

Classes

DatasetPayload(spec, frame[, color, metadata])

Bundled dataset artefacts returned by provider loaders.

DatasetSpec(provider[, name, id, target, meta])

Container describing how to locate and identify a dataset.

InPath(*paths)

Marker for input paths where "-" maps to stdin.

OutPath(*paths)

Marker for output paths where "-" maps to stdout.

class semsynth.datasets.DatasetPayload(spec: DatasetSpec, frame: DataFrame, color: Series | None = None, metadata: Mapping[str, Any] | None = None)

Bases: object

Bundled dataset artefacts returned by provider loaders.

color: Series | None
frame: DataFrame
metadata: Mapping[str, Any] | None
spec: DatasetSpec
class semsynth.datasets.DatasetSpec(provider: str, name: str | None = None, id: int | None = None, target: str | None = None, meta: Any | None = None)

Bases: object

Container describing how to locate and identify a dataset.

id: int | None = None
meta: Any | None = None
name: str | None = None
provider: str
target: str | None = None
semsynth.datasets.list_openml(name_substr: str | None = None, cat_min: int = 1, num_min: int = 1, cache_dir: Path = PosixPath('downloads-cache/openml')) DataFrame
semsynth.datasets.list_uciml(area: str = 'Health and Medicine', name_substr: str | None = None, cat_min: int = 1, num_min: int = 1, *, cachedir: Path = PosixPath('uciml-cache')) DataFrame

Return (id, name, n_instances, n_categorical, n_numeric) for mixed datasets in area.

It pulls the dataset list for the given area from the UCI API, then, for each dataset, fetches data via ucimlrepo and infers variable types to decide whether it is mixed (has at least one categorical and one numeric). Only mixed datasets are returned.

semsynth.datasets.load_dataset(spec: DatasetSpec, *, openml_cache_dir: Path = PosixPath('downloads-cache/openml'), uciml_cache_dir: Path = PosixPath('downloads-cache/uciml')) DatasetPayload
semsynth.datasets.specs_from_input(provider: str, datasets: Iterable[str] | None = None, area: str = 'Health and Medicine', *, openml_cache_dir: OutPath = OutPath('downloads-cache/openml'), uciml_cache_dir: OutPath = OutPath('downloads-cache/uciml')) List[DatasetSpec]