TSMixer
MLP-based Multivariate Time Series Forecasting Model
Overview
TSMixer (Time-Series Mixer) is an all-MLP architecture for multivariate time series forecasting. It jointly learns temporal and cross-sectional representations by repeatedly combining time- and feature information using stacked mixing layers. Unlike transformer-based architectures, TSMixer is computationally efficient and interpretable.
Key Features
- All-MLP Architecture: No attention mechanisms or complex attention patterns
- Temporal & Feature Mixing: Alternating MLPs across time and feature dimensions
- Reversible Instance Normalization: Optional normalization for improved training
- Multivariate Support: Handles multiple related time series simultaneously
- Residual Connections: Enables training of deep architectures
- Efficient: Linear computational complexity in sequence length
Parameters
- seq_len (int): Sequence length (number of lookback steps). Must be positive.
- pred_len (int): Prediction length (forecast horizon). Must be positive.
- n_features (int): Number of features/time series. Must be positive.
- n_blocks (int, default=2): Number of mixing layers in the model.
- ff_dim (int, default=64): Hidden dimension for feed-forward networks.
- dropout (float, default=0.1): Dropout rate between 0 and 1.
- use_norm (bool, default=True): Whether to use Reversible Instance Normalization.
- name (str, optional): Model name.
Input/Output Shapes
Input: - Shape: (batch_size, seq_len, n_features) - Type: Float32
Output: - Shape: (batch_size, pred_len, n_features) - Type: Float32
Architecture Flow
- Instance Normalization (optional): Normalize input to zero mean and unit variance
- Stacked Mixing Layers: Apply n_blocks mixing layers sequentially
- Each layer combines TemporalMixing and FeatureMixing
- Output Projection: Project temporal dimension from seq_len to pred_len
- Reverse Instance Normalization (optional): Denormalize output
Usage Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | |
Advanced Usage
Model with Different Configurations
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Serialization
1 2 3 4 5 6 7 8 9 10 11 12 | |
Best Use Cases
- Multivariate Time Series Forecasting: Multiple related time series with complex dependencies
- Efficient Models: When computational efficiency is critical
- Interpretability: All-MLP models are more interpretable than attention-based methods
- Long Sequences: Linear complexity allows handling long sequences
- Resource-Constrained Environments: Lower memory footprint than transformers
Performance Considerations
- seq_len: Larger values capture longer-term dependencies but increase computation
- n_blocks: More blocks improve performance but increase model size and training time
- ff_dim: Larger dimensions improve expressiveness but increase parameters
- dropout: Helps prevent overfitting; use higher values with limited data
- use_norm: Instance normalization can improve training stability
Comparison with Other Architectures
vs. Transformers
- Advantage: Simpler, more efficient, linear complexity
- Disadvantage: May not capture long-range dependencies as well
vs. LSTM/GRU
- Advantage: Parallel processing, faster training
- Disadvantage: Different inductive bias for temporal sequences
vs. NLinear/DLinear
- Advantage: Captures both temporal and feature interactions
- Disadvantage: More parameters and complexity
References
Chen, Si-An, Chun-Liang Li, Nate Yoder, Sercan O. Arik, and Tomas Pfister (2023). "TSMixer: An All-MLP Architecture for Time Series Forecasting." arXiv preprint arXiv:2303.06053.
Notes
- Instance normalization (RevIN) is enabled by default and helps with training
- Residual connections in mixing layers prevent gradient issues in deep models
- Batch normalization parameters in mixing layers are learned during training
- The model is fully differentiable and supports all Keras optimizers and losses