semsynth.utils
Functions
|
Convert continuous columns to floating point when possible. |
|
Convert selected columns to the categorical dtype. |
|
Render a dataframe as a GitHub-flavoured markdown table. |
|
Create the directory at |
|
Split dataframe columns into discrete and continuous lists. |
|
Return |
Return |
|
|
Ensure categorical levels are strings to keep outputs JSON-friendly. |
|
Create a deterministic NumPy random generator. |
|
Summarise dataframe columns with statistics tailored by type. |
Classes
|
A Pint duck-typed class, suitable for holding a quantity (with unit specified) dtype. |
- semsynth.utils.coerce_continuous_to_float(df: DataFrame, continuous_cols: List[str]) DataFrame
Convert continuous columns to floating point when possible.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"value": [1, 2, 3]}) >>> converted = coerce_continuous_to_float(df, ["value"]) >>> str(converted.dtypes['value']) 'float64'
- semsynth.utils.coerce_discrete_to_category(df: DataFrame, discrete_cols: List[str]) DataFrame
Convert selected columns to the categorical dtype.
- Parameters:
df – Source dataframe.
discrete_cols – Columns expected to be discrete.
- Returns:
A copy of
dfwith categorical conversions applied.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"a": [1, 1, 2], "b": [0.1, 0.2, 0.3]}) >>> converted = coerce_discrete_to_category(df, ["a"]) >>> str(converted.dtypes['a']) 'category'
- semsynth.utils.dataframe_to_markdown_table(df: DataFrame, float_fmt: str = '{:.4f}') str
Render a dataframe as a GitHub-flavoured markdown table.
- Parameters:
df – Table to render.
float_fmt – Format string applied to floating values.
- Returns:
Markdown string representing the table.
Examples
>>> import pandas as pd >>> table = pd.DataFrame({"variable": ["grade"], "type": ["discrete"]}) >>> print(dataframe_to_markdown_table(table)) | variable | type | | --- | --- | | grade | discrete |
- semsynth.utils.ensure_dir(path: str) None
Create the directory at
pathif it is missing.- Parameters:
path – Filesystem location to create.
Examples
>>> import os, tempfile >>> base = tempfile.mkdtemp() >>> target = os.path.join(base, "nested") >>> ensure_dir(target) >>> os.path.isdir(target) True
- semsynth.utils.infer_types(df: DataFrame, cardinality_threshold: int = 20) Tuple[List[str], List[str]]
Split dataframe columns into discrete and continuous lists.
- Parameters:
df – DataFrame to analyse.
cardinality_threshold – Threshold used by
is_discrete_series().
- Returns:
Two lists containing discrete and continuous column names.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"a": [1, 2, 3], "b": [0.1, 0.2, 0.3], "c": ["x", "y", "z"]}) >>> infer_types(df) (['a', 'c'], ['b'])
- semsynth.utils.is_discrete_series(s: Series, cardinality_threshold: int = 20) bool
Return
Trueifsshould be treated as discrete.- Parameters:
s – Series to inspect.
cardinality_threshold – Maximum unique values before treating as continuous.
- Returns:
Whether the series is discrete.
Examples
>>> import pandas as pd >>> is_discrete_series(pd.Series(["a", "b", "a"])) True
- semsynth.utils.is_numeric_series(s: Series) bool
Return
Truewhensrepresents a numeric series.Examples
>>> import pandas as pd >>> is_numeric_series(pd.Series([1.0, 2.5])) True
- semsynth.utils.rename_categorical_categories_to_str(df: DataFrame, discrete_cols: List[str]) DataFrame
Ensure categorical levels are strings to keep outputs JSON-friendly.
Examples
>>> import pandas as pd >>> cat = pd.Series(pd.Categorical([1, 2, 1], categories=[1, 2])) >>> renamed = rename_categorical_categories_to_str(pd.DataFrame({"value": cat}), ["value"]) >>> list(renamed['value'].cat.categories) ['1', '2']
- semsynth.utils.seed_all(seed: int) Generator
Create a deterministic NumPy random generator.
- Parameters:
seed – Random seed used for the generator.
- Returns:
A
numpy.random.Generatorseeded withseed.
Examples
>>> rng = seed_all(0) >>> rng.integers(0, 10, size=3).tolist() [8, 6, 5]
- semsynth.utils.summarize_dataframe(df: DataFrame, discrete_cols: List[str], continuous_cols: List[str]) DataFrame
Summarise dataframe columns with statistics tailored by type.
- Parameters:
df – Dataframe to summarise.
discrete_cols – Columns considered discrete.
continuous_cols – Columns considered continuous.
- Returns:
A dataframe with summary statistics per column.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"grade": ["A", "B", "A"], "score": [0.1, 0.2, 0.3]}) >>> summary = summarize_dataframe(df, ["grade"], ["score"]) >>> summary.loc[summary['variable'] == 'grade', 'top3'].iloc[0] 'A:2; B:1'