semsynth.datasets
Functions
|
|
|
|
|
|
|
Return (id, name, n_instances, n_categorical, n_numeric) for mixed datasets in area. |
|
|
|
Load an OpenML dataset by name, with local caching of the data payload. |
|
Load a UCI ML dataset by ID, with local caching of the data payload. |
|
Decorate a function as a build rule with automatic provenance. |
|
Classes
|
Bundled dataset artefacts returned by provider loaders. |
|
Container describing how to locate and identify a dataset. |
|
Marker for input paths where |
|
Marker for output paths where |
- class semsynth.datasets.DatasetPayload(spec: DatasetSpec, frame: DataFrame, color: Series | None = None, metadata: Mapping[str, Any] | None = None)
Bases:
objectBundled dataset artefacts returned by provider loaders.
- color: Series | None
- frame: DataFrame
- metadata: Mapping[str, Any] | None
- spec: DatasetSpec
- class semsynth.datasets.DatasetSpec(provider: str, name: str | None = None, id: int | None = None, target: str | None = None, meta: Any | None = None)
Bases:
objectContainer describing how to locate and identify a dataset.
- id: int | None = None
- meta: Any | None = None
- name: str | None = None
- provider: str
- target: str | None = None
- semsynth.datasets.list_openml(name_substr: str | None = None, cat_min: int = 1, num_min: int = 1, cache_dir: Path = PosixPath('downloads-cache/openml')) DataFrame
- semsynth.datasets.list_uciml(area: str = 'Health and Medicine', name_substr: str | None = None, cat_min: int = 1, num_min: int = 1, *, cachedir: Path = PosixPath('uciml-cache')) DataFrame
Return (id, name, n_instances, n_categorical, n_numeric) for mixed datasets in area.
It pulls the dataset list for the given area from the UCI API, then, for each dataset, fetches data via ucimlrepo and infers variable types to decide whether it is mixed (has at least one categorical and one numeric). Only mixed datasets are returned.
- semsynth.datasets.load_dataset(spec: DatasetSpec, *, openml_cache_dir: Path = PosixPath('downloads-cache/openml'), uciml_cache_dir: Path = PosixPath('downloads-cache/uciml')) DatasetPayload
- semsynth.datasets.specs_from_input(provider: str, datasets: Iterable[str] | None = None, area: str = 'Health and Medicine', *, openml_cache_dir: OutPath = OutPath('downloads-cache/openml'), uciml_cache_dir: OutPath = OutPath('downloads-cache/uciml')) List[DatasetSpec]