Skip to content

๐Ÿ”€ MixingLayer

๐Ÿ”€ MixingLayer

๐Ÿ”ด Advanced โœ… Stable โฑ๏ธ Time Series

๐ŸŽฏ Overview

The MixingLayer is the core building block of the TSMixer architecture, combining sequential TemporalMixing and FeatureMixing layers. It jointly learns temporal and cross-sectional representations by alternating between time and feature dimension mixing. This layer is essential for capturing complex interdependencies in multivariate time series data.

The architecture enables the model to learn both temporal patterns and feature correlations in a unified framework, making it highly effective for multivariate forecasting tasks.

๐Ÿ” How It Works

The MixingLayer processes data sequentially through two distinct mixing phases:

  1. TemporalMixing Phase: Mixes information across the time dimension
  2. Batch normalization across time-feature space
  3. Linear projection across temporal axis
  4. ReLU activation for non-linearity
  5. Residual connection for gradient flow

  6. FeatureMixing Phase: Mixes information across the feature dimension

  7. Batch normalization across feature-time space
  8. Feed-forward network with configurable hidden dimension
  9. Two-layer MLP for feature interactions
  10. Residual connection for gradient flow
graph TD
    A["Input<br/>(batch, time, features)"] --> B["TemporalMixing<br/>Temporal MLP + ResNet"]
    B --> C["Intermediate<br/>(batch, time, features)"]
    C --> D["FeatureMixing<br/>Feature FFN + ResNet"]
    D --> E["Output<br/>(batch, time, features)"]

    B1["BatchNorm<br/>Linear(time)"] -.-> B
    B2["ReLU<br/>Dropout"] -.-> B

    D1["BatchNorm<br/>Dense(ff_dim)"] -.-> D
    D2["Dense(feat)<br/>Dropout"] -.-> D

    style A fill:#e6f3ff,stroke:#4a86e8
    style E fill:#e8f5e9,stroke:#66bb6a
    style B fill:#fff9e6,stroke:#ffb74d
    style D fill:#f3e5f5,stroke:#9c27b0

๐Ÿ’ก Why Use This Layer?

Challenge Traditional Approach MixingLayer Solution
Temporal Dependencies Fixed patterns ๐ŸŽฏ Learnable temporal mixing
Feature Correlations Independent features ๐Ÿ”— Joint feature learning
Deep Models Gradient vanishing โœจ Residual connections stabilize
Complex Interactions Simple architectures ๐Ÿงฉ Dual-phase mixing strategy

๐Ÿ“Š Use Cases

  • Multivariate Time Series Forecasting: Multiple related time series with temporal and cross-series dependencies
  • Deep Architectures: As a stackable building block for very deep models (4+ layers)
  • Complex Pattern Learning: When both temporal and feature interactions are important
  • High-Dimensional Data: When features number > 10 with strong correlations
  • Transfer Learning: As a feature extractor in downstream forecasting tasks

๐Ÿš€ Quick Start

Basic Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import keras
from kerasfactory.layers import MixingLayer

# Create sample multivariate time series
batch_size, time_steps, features = 32, 96, 7
x = keras.random.normal((batch_size, time_steps, features))

# Apply mixing layer (temporal + feature)
layer = MixingLayer(
    n_series=features,
    input_size=time_steps,
    dropout=0.1,
    ff_dim=64
)
output = layer(x, training=True)

print(f"Input shape:  {x.shape}")        # (32, 96, 7)
print(f"Output shape: {output.shape}")   # (32, 96, 7)

Stacking Multiple Layers

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from kerasfactory.layers import MixingLayer
import keras

# Create stacked mixing layers (core of TSMixer)
n_blocks = 4
mixing_layers = [
    MixingLayer(n_series=7, input_size=96, dropout=0.1, ff_dim=64)
    for _ in range(n_blocks)
]

x = keras.random.normal((32, 96, 7))
for layer in mixing_layers:
    x = layer(x, training=True)

print(f"After {n_blocks} mixing blocks: {x.shape}")  # (32, 96, 7)

๐ŸŽ“ Advanced Usage

Parameter Tuning for Different Scenarios

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Scenario 1: Large dataset with high dimensionality
large_model = MixingLayer(
    n_series=50,      # Many features
    input_size=512,   # Long sequences
    dropout=0.05,     # Low dropout (sufficient data)
    ff_dim=256        # Larger capacity
)

# Scenario 2: Small dataset with few features
small_model = MixingLayer(
    n_series=5,       # Few features
    input_size=48,    # Short sequences
    dropout=0.3,      # Higher dropout (prevent overfitting)
    ff_dim=32         # Reduced capacity
)

# Scenario 3: Bottleneck architecture
bottleneck = MixingLayer(
    n_series=20,
    input_size=96,
    dropout=0.1,
    ff_dim=8  # ff_dim < n_series for compression
)

Training Mode Effects

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import tensorflow as tf

layer = MixingLayer(n_series=7, input_size=96, dropout=0.2, ff_dim=64)
x = keras.random.normal((32, 96, 7))

# Training: dropout active, batch norm updated
output_train1 = layer(x, training=True)
output_train2 = layer(x, training=True)
train_diff = tf.reduce_mean(tf.abs(output_train1 - output_train2))
print(f"Training mode difference: {train_diff:.6f}")  # > 0 due to dropout

# Inference: dropout disabled, batch norm frozen
output_infer1 = layer(x, training=False)
output_infer2 = layer(x, training=False)
tf.debugging.assert_near(output_infer1, output_infer2)
print("Inference mode: outputs are identical โœ“")

Serialization & Model Checkpointing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import keras

# Create and configure layer
layer = MixingLayer(n_series=7, input_size=96, dropout=0.1, ff_dim=64)

# Get config for saving
config = layer.get_config()
print(f"Config keys: {config.keys()}")

# Recreate from config
new_layer = MixingLayer.from_config(config)

# Verify parameters match
assert new_layer.n_series == layer.n_series
assert new_layer.input_size == layer.input_size
assert new_layer.dropout_rate == layer.dropout_rate
assert new_layer.ff_dim == layer.ff_dim

๐Ÿ“ˆ Performance Characteristics

Aspect Value Details
Time Complexity O(B ร— T ร— (Dยฒ + Dร—H)) B=batch, T=time, D=features, H=ff_dim
Space Complexity O(B ร— T ร— D) Residual connection overhead minimal
Gradient Flow โœ… Excellent Dual residual connections prevent vanishing gradients
Trainability โญโญโญโญโญ Very stable with batch norm in both phases
Memory Usage Moderate More than single phase, less than attention

๐Ÿ”ง Parameter Guide

Parameter Type Range Impact Recommendation
n_series int > 0 Number of features/channels Match your data dimensionality
input_size int > 0 Temporal sequence length Match your time series length
dropout float [0, 1] Regularization strength 0.1-0.2 for training stability
ff_dim int > 0 Feature mixing capacity 1-2x n_series for expressiveness

Tuning Strategy

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Start with baseline
base = MixingLayer(n_series=7, input_size=96, dropout=0.1, ff_dim=64)

# For overfitting: increase dropout or reduce ff_dim
overfit_fix = MixingLayer(n_series=7, input_size=96, dropout=0.2, ff_dim=32)

# For underfitting: decrease dropout or increase ff_dim
underfit_fix = MixingLayer(n_series=7, input_size=96, dropout=0.05, ff_dim=128)

# For efficiency: reduce ff_dim
efficient = MixingLayer(n_series=7, input_size=96, dropout=0.1, ff_dim=32)

๐Ÿงช Testing & Validation

Comprehensive Tests

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import tensorflow as tf
from kerasfactory.layers import MixingLayer

layer = MixingLayer(n_series=7, input_size=96, dropout=0.1, ff_dim=64)
x = tf.random.normal((32, 96, 7))

# Test 1: Shape preservation
output = layer(x)
assert output.shape == x.shape, "Shape mismatch!"

# Test 2: Residual effect (output differs from input)
output = layer(x, training=False)
diff = tf.reduce_max(tf.abs(output - x))
assert diff > 0, "Output should differ from input due to mixing"

# Test 3: Dropout effect
outputs_train = [layer(x, training=True) for _ in range(5)]
diffs = [tf.reduce_mean(tf.abs(outputs_train[i] - outputs_train[i+1])).numpy() 
         for i in range(4)]
assert all(d > 0 for d in diffs), "Dropout should cause variation"

# Test 4: Batch norm stability
outputs = [layer(x, training=False) for _ in range(5)]
for o1, o2 in zip(outputs[:-1], outputs[1:]):
    tf.debugging.assert_near(o1, o2)

print("โœ… All tests passed!")

โš ๏ธ Common Issues & Solutions

Issue Cause Solution
NaN in output Unstable batch norm or extreme inputs Normalize inputs to [-1, 1]; check initial weights
Slow convergence Dropout too high or ff_dim too small Reduce dropout to 0.05-0.1; increase ff_dim
High memory usage Large ff_dim or sequence length Reduce ff_dim; use gradient accumulation
Poor generalization Insufficient regularization Increase dropout or add weight regularization
Vanishing gradients Very deep stacking Use skip connections between mixing blocks
  • TemporalMixing: Handles temporal dimension mixing
  • FeatureMixing: Handles feature dimension mixing
  • ReversibleInstanceNorm: Normalization layer for TSMixer
  • MovingAverage: Alternative temporal processing
  • MultiScaleSeasonMixing: Multi-scale seasonal patterns

๐Ÿ”— Integration with TSMixer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
TSMixer Architecture:
    Input
      โ†“
  [RevIN Normalize]
      โ†“
  [MixingLayer] ร— n_blocks โ† You are here!
      โ†“
  [Dense Projection]
      โ†“
  [RevIN Denormalize]
      โ†“
  Output Forecast

๐Ÿ“– References

  • Chen, Si-An, et al. (2023). "TSMixer: An All-MLP Architecture for Time Series Forecasting." arXiv:2303.06053
  • Batch Normalization: Ioffe & Szegedy (2015). "Batch Normalization: Accelerating Deep Network Training"
  • Residual Networks: He, K., et al. (2015). "Deep Residual Learning for Image Recognition"

๐Ÿ’ป Implementation Details

  • Backend: Pure Keras 3 ops module
  • Computation: CPU/GPU optimized
  • Memory: Efficient with residual shortcuts
  • Serialization: Full get_config() / from_config() support
  • Compatibility: Works with any Keras optimizer and loss function