semsynth.utils

Functions

`coerce_continuous_to_float`(df, continuous_cols)	Convert continuous columns to floating point when possible.
`coerce_discrete_to_category`(df, discrete_cols)	Convert selected columns to the categorical dtype.
`dataframe_to_markdown_table`(df[, float_fmt])	Render a dataframe as a GitHub-flavoured markdown table.
`ensure_dir`(path)	Create the directory at `path` if it is missing.
`infer_types`(df[, cardinality_threshold])	Split dataframe columns into discrete and continuous lists.
`is_discrete_series`(s[, cardinality_threshold])	Return `True` if `s` should be treated as discrete.
`is_numeric_series`(s)	Return `True` when `s` represents a numeric series.
`rename_categorical_categories_to_str`(df, ...)	Ensure categorical levels are strings to keep outputs JSON-friendly.
`seed_all`(seed)	Create a deterministic NumPy random generator.
`summarize_dataframe`(df, discrete_cols, ...)	Summarise dataframe columns with statistics tailored by type.

Classes

PintType([units, subdtype])

A Pint duck-typed class, suitable for holding a quantity (with unit specified) dtype.

semsynth.utils.coerce_continuous_to_float(df: DataFrame, continuous_cols: List[str]) → DataFrame

Convert continuous columns to floating point when possible.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"value": [1, 2, 3]})
>>> converted = coerce_continuous_to_float(df, ["value"])
>>> str(converted.dtypes['value'])
'float64'

semsynth.utils.coerce_discrete_to_category(df: DataFrame, discrete_cols: List[str]) → DataFrame

Convert selected columns to the categorical dtype.

Parameters:

df – Source dataframe.
discrete_cols – Columns expected to be discrete.

Returns:

A copy of df with categorical conversions applied.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 1, 2], "b": [0.1, 0.2, 0.3]})
>>> converted = coerce_discrete_to_category(df, ["a"])
>>> str(converted.dtypes['a'])
'category'

semsynth.utils.dataframe_to_markdown_table(df: DataFrame, float_fmt: str = '{:.4f}') → str

Render a dataframe as a GitHub-flavoured markdown table.

Parameters:

df – Table to render.
float_fmt – Format string applied to floating values.

Returns:

Markdown string representing the table.

Examples

>>> import pandas as pd
>>> table = pd.DataFrame({"variable": ["grade"], "type": ["discrete"]})
>>> print(dataframe_to_markdown_table(table))
| variable | type |
| --- | --- |
| grade | discrete |

semsynth.utils.ensure_dir(path: str) → None

Create the directory at path if it is missing.

Parameters:: path – Filesystem location to create.

Examples

>>> import os, tempfile
>>> base = tempfile.mkdtemp()
>>> target = os.path.join(base, "nested")
>>> ensure_dir(target)
>>> os.path.isdir(target)
True

semsynth.utils.infer_types(df: DataFrame, cardinality_threshold: int = 20) → Tuple[List[str], List[str]]

Split dataframe columns into discrete and continuous lists.

Parameters:

df – DataFrame to analyse.
cardinality_threshold – Threshold used by is_discrete_series().

Returns:

Two lists containing discrete and continuous column names.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [0.1, 0.2, 0.3], "c": ["x", "y", "z"]})
>>> infer_types(df)
(['a', 'c'], ['b'])

semsynth.utils.is_discrete_series(s: Series, cardinality_threshold: int = 20) → bool

Return True if s should be treated as discrete.

Parameters:

s – Series to inspect.
cardinality_threshold – Maximum unique values before treating as continuous.

Returns:

Whether the series is discrete.

Examples

>>> import pandas as pd
>>> is_discrete_series(pd.Series(["a", "b", "a"]))
True

semsynth.utils.is_numeric_series(s: Series) → bool

Return True when s represents a numeric series.

Examples

>>> import pandas as pd
>>> is_numeric_series(pd.Series([1.0, 2.5]))
True

semsynth.utils.rename_categorical_categories_to_str(df: DataFrame, discrete_cols: List[str]) → DataFrame

Ensure categorical levels are strings to keep outputs JSON-friendly.

Examples

>>> import pandas as pd
>>> cat = pd.Series(pd.Categorical([1, 2, 1], categories=[1, 2]))
>>> renamed = rename_categorical_categories_to_str(pd.DataFrame({"value": cat}), ["value"])
>>> list(renamed['value'].cat.categories)
['1', '2']

semsynth.utils.seed_all(seed: int) → Generator

Create a deterministic NumPy random generator.

Parameters:: seed – Random seed used for the generator.
Returns:: A numpy.random.Generator seeded with seed.

Examples

>>> rng = seed_all(0)
>>> rng.integers(0, 10, size=3).tolist()
[8, 6, 5]

semsynth.utils.summarize_dataframe(df: DataFrame, discrete_cols: List[str], continuous_cols: List[str]) → DataFrame

Summarise dataframe columns with statistics tailored by type.

Parameters:

df – Dataframe to summarise.
discrete_cols – Columns considered discrete.
continuous_cols – Columns considered continuous.

Returns:

A dataframe with summary statistics per column.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"grade": ["A", "B", "A"], "score": [0.1, 0.2, 0.3]})
>>> summary = summarize_dataframe(df, ["grade"], ["score"])
>>> summary.loc[summary['variable'] == 'grade', 'top3'].iloc[0]
'A:2; B:1'