The TemporalMixing layer is a core component of the TSMixer architecture that applies MLP-based transformations across the time dimension. It mixes temporal information while preserving the multivariate structure through batch normalization and linear projections. The layer uses residual connections to enable training of deep architectures.
This layer is particularly effective for capturing temporal dependencies and patterns in multivariate time series forecasting tasks where you need to learn complex temporal interactions.
π How It Works
The TemporalMixing layer processes data through:
Transpose: Converts input from (batch, time, features) to (batch, features, time)
Flatten: Reshapes to (batch, features Γ time) for batch normalization
Batch Normalization: Normalizes across feature-time dimensions (epsilon=0.001, momentum=0.01)
Reshape: Restores to (batch, features, time)
Linear Transformation: Learnable dense layer across time dimension
ReLU Activation: Non-linear activation function
Transpose Back: Converts back to (batch, time, features)
Dropout: Stochastic regularization during training
Residual Connection: Adds input to output for improved gradient flow
graph TD
A["Input<br/>(batch, time, features)"] --> B["Transpose<br/>β (batch, features, time)"]
B --> C["Reshape<br/>β (batch, featΓtime)"]
C --> D["Batch Norm<br/>Ξ΅=0.001, m=0.01"]
D --> E["Reshape<br/>β (batch, feat, time)"]
E --> F["Dense Layer<br/>output_size=time"]
F --> G["ReLU Activation"]
G --> H["Transpose<br/>β (batch, time, feat)"]
H --> I["Dropout<br/>rate=dropout"]
I --> J["Residual Connection<br/>output + input"]
J --> K["Output<br/>(batch, time, features)"]
style A fill:#e6f3ff,stroke:#4a86e8
style K fill:#e8f5e9,stroke:#66bb6a
style D fill:#fff9e6,stroke:#ffb74d
style J fill:#f3e5f5,stroke:#9c27b0
π‘ Why Use This Layer?
Challenge
Traditional Approach
TemporalMixing Solution
Temporal Dependencies
Fixed pattern matching
π― Learnable temporal projections
Multivariate Learning
Treats features independently
π Joint temporal-feature optimization
Deep Models
Vanishing gradients
β¨ Residual connections stabilize training
Regularization
Manual dropout insertion
π² Integrated dropout in mixing
π Use Cases
Multivariate Time Series Forecasting: When multiple related time series have temporal dependencies
Temporal Pattern Learning: For complex temporal patterns requiring non-linear transformations
Deep Models: As a building block in stacked TSMixer architectures
Dropout Regularization: When training data is limited and overfitting is a concern
Feature Interaction: When temporal relationships between time steps are critical
fromkerasfactory.layersimportTemporalMixing,FeatureMixing,MixingLayerimportkeras# TemporalMixing is used inside MixingLayermixing_layer=MixingLayer(n_series=7,input_size=96,dropout=0.1,ff_dim=64)# MixingLayer internally uses TemporalMixing first, then FeatureMixingx=keras.random.normal((32,96,7))output=mixing_layer(x)
π Advanced Usage
Training vs Inference
1 2 3 4 5 6 7 8 91011121314
importtensorflowastflayer=TemporalMixing(n_series=7,input_size=96,dropout=0.2)x=keras.random.normal((32,96,7))# Training mode: dropout is activeoutput_train1=layer(x,training=True)output_train2=layer(x,training=True)# Outputs differ due to stochastic dropout# Inference mode: dropout disabledoutput_infer1=layer(x,training=False)output_infer2=layer(x,training=False)# Outputs are identical
# Get configurationlayer=TemporalMixing(n_series=7,input_size=96,dropout=0.1)config=layer.get_config()print(config)# Recreate from configurationnew_layer=TemporalMixing.from_config(config)# Verify parameters matchassertnew_layer.n_series==layer.n_seriesassertnew_layer.input_size==layer.input_sizeassertnew_layer.dropout_rate==layer.dropout_rate
π Performance Characteristics
Aspect
Value
Notes
Time Complexity
O(B Γ T Γ DΒ²)
B=batch, T=time, D=features
Space Complexity
O(B Γ T Γ D)
Residual connection overhead is minimal
Gradient Flow
β Excellent
Residual connections prevent vanishing gradients
Trainability
βββββ
Very stable with batch normalization
π§ Parameter Guide
Parameter
Type
Range
Impact
n_series
int
> 0
Number of multivariate features/channels
input_size
int
> 0
Temporal sequence length
dropout
float
[0, 1]
Higher values = more regularization
Tuning Recommendations
Small datasets: Use dropout β₯ 0.2 to prevent overfitting
Deep models: Use lower dropout (0.05-0.1) to maintain information flow
Limited features: Increase n_series impact through feature expansion layers
Long sequences: Consider computational cost for large input_size
π§ͺ Testing & Validation
Unit Tests
1 2 3 4 5 6 7 8 910111213141516171819
importtensorflowastffromkerasfactory.layersimportTemporalMixing# Test 1: Output shape preservationlayer=TemporalMixing(n_series=7,input_size=96,dropout=0.1)x=tf.random.normal((32,96,7))output=layer(x)assertoutput.shape==x.shape,"Shape mismatch!"# Test 2: Dropout effectoutput1=layer(x,training=True)output2=layer(x,training=True)diff=tf.reduce_mean(tf.abs(output1-output2))assertdiff>0,"Dropout not working!"# Test 3: Inference determinismoutput1=layer(x,training=False)output2=layer(x,training=False)tf.debugging.assert_near(output1,output2)
β οΈ Common Issues & Solutions
Issue
Cause
Solution
NaN values in output
Unstable batch norm or extreme inputs
Normalize inputs to [-1, 1] range
Slow gradient updates
Batch norm momentum too high
Use default momentum=0.01
Poor performance
Dropout too high
Reduce dropout rate to 0.05-0.1
Memory overflow
Large input_size with many features
Use smaller batch sizes
π Related Layers
FeatureMixing: Complements TemporalMixing by mixing across feature dimension