semsynth.utils

Functions

coerce_continuous_to_float(df, continuous_cols)

Convert continuous columns to floating point when possible.

coerce_discrete_to_category(df, discrete_cols)

Convert selected columns to the categorical dtype.

dataframe_to_markdown_table(df[, float_fmt])

Render a dataframe as a GitHub-flavoured markdown table.

ensure_dir(path)

Create the directory at path if it is missing.

infer_types(df[, cardinality_threshold])

Split dataframe columns into discrete and continuous lists.

is_discrete_series(s[, cardinality_threshold])

Return True if s should be treated as discrete.

is_numeric_series(s)

Return True when s represents a numeric series.

rename_categorical_categories_to_str(df, ...)

Ensure categorical levels are strings to keep outputs JSON-friendly.

seed_all(seed)

Create a deterministic NumPy random generator.

summarize_dataframe(df, discrete_cols, ...)

Summarise dataframe columns with statistics tailored by type.

Classes

PintType([units, subdtype])

A Pint duck-typed class, suitable for holding a quantity (with unit specified) dtype.

semsynth.utils.coerce_continuous_to_float(df: DataFrame, continuous_cols: List[str]) DataFrame

Convert continuous columns to floating point when possible.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"value": [1, 2, 3]})
>>> converted = coerce_continuous_to_float(df, ["value"])
>>> str(converted.dtypes['value'])
'float64'
semsynth.utils.coerce_discrete_to_category(df: DataFrame, discrete_cols: List[str]) DataFrame

Convert selected columns to the categorical dtype.

Parameters:
  • df – Source dataframe.

  • discrete_cols – Columns expected to be discrete.

Returns:

A copy of df with categorical conversions applied.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 1, 2], "b": [0.1, 0.2, 0.3]})
>>> converted = coerce_discrete_to_category(df, ["a"])
>>> str(converted.dtypes['a'])
'category'
semsynth.utils.dataframe_to_markdown_table(df: DataFrame, float_fmt: str = '{:.4f}') str

Render a dataframe as a GitHub-flavoured markdown table.

Parameters:
  • df – Table to render.

  • float_fmt – Format string applied to floating values.

Returns:

Markdown string representing the table.

Examples

>>> import pandas as pd
>>> table = pd.DataFrame({"variable": ["grade"], "type": ["discrete"]})
>>> print(dataframe_to_markdown_table(table))
| variable | type |
| --- | --- |
| grade | discrete |
semsynth.utils.ensure_dir(path: str) None

Create the directory at path if it is missing.

Parameters:

path – Filesystem location to create.

Examples

>>> import os, tempfile
>>> base = tempfile.mkdtemp()
>>> target = os.path.join(base, "nested")
>>> ensure_dir(target)
>>> os.path.isdir(target)
True
semsynth.utils.infer_types(df: DataFrame, cardinality_threshold: int = 20) Tuple[List[str], List[str]]

Split dataframe columns into discrete and continuous lists.

Parameters:
Returns:

Two lists containing discrete and continuous column names.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [0.1, 0.2, 0.3], "c": ["x", "y", "z"]})
>>> infer_types(df)
(['a', 'c'], ['b'])
semsynth.utils.is_discrete_series(s: Series, cardinality_threshold: int = 20) bool

Return True if s should be treated as discrete.

Parameters:
  • s – Series to inspect.

  • cardinality_threshold – Maximum unique values before treating as continuous.

Returns:

Whether the series is discrete.

Examples

>>> import pandas as pd
>>> is_discrete_series(pd.Series(["a", "b", "a"]))
True
semsynth.utils.is_numeric_series(s: Series) bool

Return True when s represents a numeric series.

Examples

>>> import pandas as pd
>>> is_numeric_series(pd.Series([1.0, 2.5]))
True
semsynth.utils.rename_categorical_categories_to_str(df: DataFrame, discrete_cols: List[str]) DataFrame

Ensure categorical levels are strings to keep outputs JSON-friendly.

Examples

>>> import pandas as pd
>>> cat = pd.Series(pd.Categorical([1, 2, 1], categories=[1, 2]))
>>> renamed = rename_categorical_categories_to_str(pd.DataFrame({"value": cat}), ["value"])
>>> list(renamed['value'].cat.categories)
['1', '2']
semsynth.utils.seed_all(seed: int) Generator

Create a deterministic NumPy random generator.

Parameters:

seed – Random seed used for the generator.

Returns:

A numpy.random.Generator seeded with seed.

Examples

>>> rng = seed_all(0)
>>> rng.integers(0, 10, size=3).tolist()
[8, 6, 5]
semsynth.utils.summarize_dataframe(df: DataFrame, discrete_cols: List[str], continuous_cols: List[str]) DataFrame

Summarise dataframe columns with statistics tailored by type.

Parameters:
  • df – Dataframe to summarise.

  • discrete_cols – Columns considered discrete.

  • continuous_cols – Columns considered continuous.

Returns:

A dataframe with summary statistics per column.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"grade": ["A", "B", "A"], "score": [0.1, 0.2, 0.3]})
>>> summary = summarize_dataframe(df, ["grade"], ["score"])
>>> summary.loc[summary['variable'] == 'grade', 'top3'].iloc[0]
'A:2; B:1'