Core Concepts
ts-data-generator builds realistic time series data by executing a deterministic, multi-stage pipeline. Instead of trying to generate whole datasets as a single block of values, it models time series as a set of decoupled primitives that are combined sequentially.
Understanding how these primitives work and interact is key to generating high-fidelity datasets.
📐 The Lifecycle of Data Generation
Every dataset goes through a five-stage processing pipeline. This pipeline ensures that context (dimensions), mathematical signals (trends), errors (anomalies), and scales (normalization) are resolved in a strict, logical sequence.
Here is the sequential flow of execution, which is strictly guarded by the underlying PipelineState (CONFIGURED -> GENERATED -> NORMALIZED):
graph TD
A[1. Datetime Range & Granularity] -->|Index Generation| B[2. Dimension Broadcasting]
B -->|Categorical Alignment| C[3. Trend Composition & Layering]
C -->|Numeric Base Signal| D[4. Anomaly Injection & Perturbation]
D -->|Stochastic Failure Simulation| E[5. Normalization & Transforms]
E -->|In-Place Scaling| F[6. Final pandas DataFrame]
style A fill:#e1f5fe,stroke:#03a9f4,stroke-width:2px
style B fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
style C fill:#fff3e0,stroke:#ff9800,stroke-width:2px
style D fill:#ffebee,stroke:#f44336,stroke-width:2px
style E fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
style F fill:#eceff1,stroke:#607d8b,stroke-width:2px
🏛️ The Three Primary Primitives
The core architecture relies on three distinct primitives:
1. Dimensions (Context)
Dimensions represent the contextual axes of your time series data (e.g., store_id, region, ip_address, client_version).
- Infinite Iteration: Dimensions are implemented as infinite Python iterators (
generators). This ensures they can produce values for a series of any length without exhausting memory. - Broadcasting: When multiple dimensions are combined, they are mapped across the datetime index. When using dimensions, the generator produces multi-variate series where metrics are generated for each unique dimensional combination (e.g., revenue generated per store, per region, per timestamp).
Learn more about Dimension Generators
2. Metrics & Trends (Base Signal)
Metrics represent the numeric observations you want to track (e.g., cpu_utilization, sales_revenue, temperature).
- Compositional Math: A metric is not defined by a single equation. Instead, you compose it by summing multiple Trends together: \(\text{Metric}(t) = \sum \text{Trend}_i(t)\)
- Modular Layers: This allows you to stack a stable base level (
LinearTrend), a periodic daily oscillation (SinusoidalTrend), a weekend drop-off (WeekendTrend), and autocorrelated volatility (ARNoiseTrend) to easily build highly complex signals.
Learn more about Trend Functions
3. Anomalies (Perturbation)
Anomalies represent real-world failure events or regime shifts (e.g., network spikes, missing sensor data, or system recalibration drift).
- Decoupled Intervention: Anomalies are injected after the base metrics are generated. This is critical: it keeps the clean baseline mathematical trend completely separate from the failure events. Because metrics return a
MetricResultobject (containing both abaselineand asignal), you can directly compare the clean baseline against the contaminated dataset to get perfect labels when benchmarking anomaly detection models.
Learn more about Anomaly Injection
🚀 Secondary Transforms
Once the primitives are fully composed and stochastically contaminated, ts-data-generator provides post-processing transforms to make the data model-ready:
- Coarser Aggregation: Easily resample your generated granular data (e.g., converting 5-minute raw sensor logs to 1-hour average logs) while automatically respecting each metric’s specific aggregation rules (e.g., taking the
AVGof temperature but theSUMof revenue). - Normalization: Scale your numeric data in place using standard methods like
min-maxscaling ormean-stdZ-score normalization, with built-in support for exact denormalization (restoration).