Python API Reference
The Python API provides the ultimate flexibility, allowing you to seamlessly integrate the generator into your machine learning pipelines, testing suites, or simulation environments.
🏛️ The DataGen Class
The DataGen class is the central orchestrator that coordinates dates, timestamps, dimensions, composed metric trends, anomalies, and transforms.
from ts_data_generator import DataGen
dg = DataGen(seed=42)
Initializer Parameters:
-
dimensions(list[Dimensions]None): Initial dimensions list (default None). -
metrics(list[Metrics]None): Initial metrics list (default None). -
multi_items(list[MultiItems]None): Initial multi-items list (default None). -
start_datetime(strNone): Start date/time string (ISO format YYYY-MM-DDorYYYY-MM-DDTHH:MM:SS). -
end_datetime(strNone): End date/time string (ISO format). -
granularity(Granularitystr): Time step interval (default Granularity.FIVE_MIN). -
seed(intNone): Seed for deterministic PCG64 random generation. When set, all randomness flows through an isolated SeedableRNGinstance backed by PCG64.
Properties:
.data— The generatedpd.DataFrame, indexed by timestamp. Triggers lazy generation if not yet built..granularity— Read/write property. Get or set the current granularity (acceptsGranularityenum or string like"h","D"). Writing triggers regeneration..start_datetime/.end_datetime— Read/write ISO datetime strings. Writing triggers regeneration..dimensions— Mapping of dimension name toDimensionsinstance..metrics— Mapping of metric name toMetricsinstance..multi_items— Mapping of comma-joined names toMultiItemsinstance..trends— Nested mapping{metric_name: {trend_name: trend_instance}}.
⚙️ Configuration Methods
.to_granularity(granularity: Granularity | str)
Sets the generation time step using a predefined frequency or Pandas alias string.
- Examples:
"s","min","5min","h","D","W","ME","YE".
.add_dimension(name: str, function: int | float | str | list | Generator)
Adds a categorical or context column mapping to the index.
- Parameters:
name: The resulting column name in the DataFrame.function: An infinite generator, static value, or list that cycles. Static values (int,float,str) are wrapped as constants; lists are cycled infinitely.
- Raises:
DimensionErrorif a dimension with this name already exists.ValidationErrorif the function type is unsupported.
.update_dimension(name: str, function: int | str | float | Generator | None)
Update an existing dimension’s generator function.
- Parameters:
name: The dimension name to update.function: New generator or static value. PassNoneto skip.
- Raises:
DimensionErrorif the dimension does not exist.
.remove_dimension(name: str)
Remove a dimension and its column from the data.
- Parameters:
name: The dimension name to remove.
.add_metric(name: str, trends: list[Trends] | set[Trends], aggregation_type: AggregationType = AggregationType.AVG, anomalies: list[Anomaly] | None = None)
Composes and adds a numeric metric column by summing multiple trends together.
- Parameters:
name: The resulting column name in the DataFrame.trends: A list or set ofTrendssubclasses (e.g.SinusoidalTrend,LinearTrend). Their generated arrays are summed to form the base signal.aggregation_type: TheAggregationTypeenum (e.g.AVG,SUM,MIN,MAX) used when resampling via.aggregate(). Defaults toAVG.anomalies: An optional list ofAnomalyinstances applied sequentially after trend composition.
- Raises:
MetricErrorif a metric with this name already exists, or if duplicate trends are detected.
.remove_metric(name: str)
Remove a metric and its column from the data.
- Parameters:
name: The metric name to remove.
.add_multi_items(names: list[str], function: int | float | str | list | Generator, aggregation_type: list[AggregationType | str] | None = None)
Adds multiple correlated columns that are generated together from a single iterator (e.g., city and country).
- Parameters:
names: A list of column names.function: A generator yielding tuples of values matching the length ofnames. Static values are wrapped as constants; lists are cycled.aggregation_type: Optional list of aggregation methods for resampling.
- Raises:
MultiItemErrorif any name overlaps with existing multi-items.ValidationErrorif generation fails.
.remove_multi_item(names: str | list[str])
Remove a multi-item group and its columns from the data.
- If any of the given names overlap with a multi-item group, that entire group is removed.
📊 Retrieval, Aggregation, Normalization & Visualization
.data (Property)
Generates the data (if not already generated) and returns a clean, fully aligned pandas.DataFrame indexed by timestamp. Generation runs a pipeline combining dimensions, metrics, and multi-items, applying any configured anomalies.
.state (Property)
Returns the current PipelineState (CONFIGURED, GENERATED, or NORMALIZED). Guard rails are in place to ensure you don’t call .normalize() before generating data, or .denormalize() on unnormalized data.
.baselines (Property)
Returns a dictionary mapping metric names to their clean pandas.DataFrame baseline (i.e. the signal generated by Trends before any Anomalies were applied). Useful for training anomaly detection models.
.shape() -> tuple[int, int]
Return the (rows, columns) shape of the generated data.
.head(n: int = 5) -> pd.DataFrame
Return the first n rows of generated data.
.tail(n: int = 5) -> pd.DataFrame
Return the last n rows of generated data.
.aggregate(granularity: str) -> pd.DataFrame
Aggregates the generated data to a coarser granularity (e.g., daily down to weekly, or hourly down to daily).
- Rule: You can only aggregate to a coarser granularity than the current one. Uses
Granularity.coarser_than()andGranularity.finer_than()for validation, which replaced the old module-level_GRANULARITY_ORDERdict. - It automatically applies the metric-specific aggregation types (
AVG,SUM, etc.) and multi-item aggregation types defined when the metrics were added.
.normalize(method: str = "min-max")
Apply normalization to numeric columns in place.
method:"min-max"or"mean-std"(default"min-max").- Uses the
Normalizerclass fromts_data_generator.transforms.normalizer.
.denormalize()
Reverse the last normalization in place. Safe to call even if no normalization has been applied.
.plot(include: list[str] | None = None, exclude: list[str] | None = None, **matplotlib_kwargs)
Renders a quick, native line plot of your numeric columns using matplotlib.
include: Explicit list of column names to plot.exclude: List of column names to omit from plotting.matplotlib_kwargs: Additional keyword arguments passed to matplotlib’splotfunction (e.g.figsize,color,linestyle).- Raises:
ImportErrorif matplotlib is not installed. Install withuv add 'ts-data-generator[plotting]'.
🐍 Full End-to-End Lifecycle Script
Here is a complete, copy-pasteable script that exercises the full DataGen lifecycle: setup, dimensions, composition of metrics with trends and anomalies, linked multi-items, dataframe extraction, normalization, aggregation, and plotting.
from ts_data_generator import DataGen
from ts_data_generator.schema.models import AggregationType
from ts_data_generator.utils.functions import random_choice, ordered_choice
from ts_data_generator.utils.trends import LinearTrend, SinusoidalTrend, ARNoiseTrend
from ts_data_generator.anomalies import PointAnomaly, MissingData
# 1. Initialize with dates, granularity, and seed
dg = DataGen(
start_datetime="2024-01-01T00:00:00",
end_datetime="2024-01-07T23:00:00",
granularity="h",
seed=12345
)
# 2. Add categorical dimensions
dg.add_dimension("region", random_choice(["North", "South", "East"]))
dg.add_dimension("priority", ordered_choice(["low", "high"]))
# 3. Add correlated Multi-Item dimensions (linked columns)
def server_specs_generator():
specs = [
("srv_alpha", "intel", "16GB"),
("srv_beta", "amd", "32GB"),
("srv_gamma", "arm", "8GB")
]
while True:
yield random_choice(specs) # yields tuple: (srv, CPU, RAM)
dg.add_multi_items(
names=["server_name", "cpu_vendor", "ram_capacity"],
function=server_specs_generator()
)
# 4. Compose Metric 1: CPU load (summing growth + cycle + AR noise, adding spikes)
cpu_trends = {
LinearTrend(offset=40.0, slope=2.0),
SinusoidalTrend(amplitude=12.0, freq=1.0)
}
cpu_anomalies = [
PointAnomaly(probability=0.015, mode="additive", magnitude=(30.0, 45.0))
]
dg.add_metric(
name="cpu_utilization",
trends=cpu_trends,
aggregation_type=AggregationType.AVG, # CPU load aggregated via average
anomalies=cpu_anomalies
)
# 5. Compose Metric 2: Completed Transactions (using SUM aggregation and bursty dropouts)
trans_trends = {
LinearTrend(offset=500.0, slope=10.0),
SinusoidalTrend(amplitude=150.0, freq=1.0, phase=4.0)
}
trans_anomalies = [
MissingData(mode="burst", burst_probability=0.01, min_length=2, max_length=4)
]
dg.add_metric(
name="completed_transactions",
trends=trans_trends,
aggregation_type=AggregationType.SUM, # Revenue/Transactions aggregated via sum
anomalies=trans_anomalies
)
# 6. Retrieve the generated Pandas DataFrame
df = dg.data
print("--- Raw Generated DataFrame ---")
print(df.head(10))
# 7. Normalize numeric columns in-place
dg.normalize(method="min-max")
print("\n--- Normalized DataFrame ---")
print(dg.data.head())
# 8. Denormalize back to original values
dg.denormalize()
# 9. Aggregate to daily granularity
# cpu_utilization is automatically averaged, completed_transactions is summed!
daily_df = dg.aggregate(granularity="D")
print("\n--- Daily Aggregated DataFrame ---")
print(daily_df.head())
# 10. Render quick built-in line charts of our numeric metrics
dg.plot(include=["cpu_utilization"])
🏗️ Internal Architecture
DataGen Pipeline
DataGen directly orchestrates generation across dimensions, metrics, and multi-items in a deterministic pipeline. It maintains state via PipelineState (CONFIGURED, GENERATED, NORMALIZED).
MetricResult (ts_data_generator.schema.models)
When a metric generates data, it now returns a MetricResult NamedTuple containing two pd.DataFrames: the baseline (pure trends) and the signal (trends + anomalies). This guarantees access to the clean pre-contamination signal.
SeedableRNG / DefaultRNG / RNGProtocol (ts_data_generator.random)
Handles deterministic randomness. SeedableRNG wraps a PCG64-backed numpy.random.Generator. When DataGen is not seeded, a DefaultRNG is used. This implements a unified RNGProtocol ensuring deterministic behaviour is threaded thoroughly across Trends, Anomalies, and Dimensions without global side effects.
Schema Parser (ts_data_generator.schema.parser)
Isolates string parsing and validation into strict dataclasses (DimensionSpec, TrendSpec, AnomalySpec, PresetConfig). The CLI and Python interfaces use this parser alongside the lightweight Registry to look up available components.
Normalizer (ts_data_generator.transforms.normalizer)
Provides min-max and mean-std normalization with exact denormalization support.
aggregate_dataframe (ts_data_generator.aggregator)
Handles DataFrame resampling to coarser granularities, respecting per-metric aggregation types and granularity ordering via Granularity.coarser_than() and Granularity.finer_than().