Skip to content

๐ŸŽซ TokenEmbedding

๐ŸŽซ TokenEmbedding

๐ŸŸข Beginner โœ… Stable โฑ๏ธ Time Series

๐ŸŽฏ Overview

The TokenEmbedding layer embeds raw time series values using 1D convolution with learnable filters and bias. It transforms raw numerical input values into rich, learnable feature representations suitable for transformer-based models and deep learning architectures.

This layer is inspired by the TokenEmbedding component used in state-of-the-art time series forecasting models like Informer and TimeMixer. It provides a learnable alternative to fixed embeddings, allowing the model to discover optimal feature representations during training.

๐Ÿ” How It Works

The TokenEmbedding layer processes data through a 1D convolutional transformation:

  1. Input Reception: Receives raw time series values of shape (batch, time_steps, channels)
  2. Transposition: Rearranges to (batch, channels, time_steps) for Conv1D
  3. 1D Convolution: Applies learnable 3ร—1 kernels across the time dimension
  4. Same Padding: Preserves temporal dimension using "same" padding
  5. Output Generation: Returns embedded features of shape (batch, time_steps, d_model)
graph TD
    A["Input: (batch, time, c_in)"] -->|Transpose| B["(batch, c_in, time)"]
    B -->|Conv1D kernel=3<br/>filters=d_model| C["(batch, d_model, time)"]
    C -->|Transpose| D["Output: (batch, time, d_model)"]

    style A fill:#e6f3ff,stroke:#4a86e8
    style D fill:#e8f5e9,stroke:#66bb6a
    style B fill:#fff9e6,stroke:#ffb74d
    style C fill:#f3e5f5,stroke:#9c27b0

๐Ÿ’ก Why Use This Layer?

Challenge Fixed Embeddings Learnable Tokens TokenEmbedding's Solution
Feature Learning No learning Limited โœจ Learnable 1D convolution
Contextual Awareness No context Local only ๐ŸŽฏ Kernel-size receptive field
Adaptation Static Slow โšก Trained end-to-end
Multivariate Support Single channel Per-channel ๐Ÿ”„ True multi-channel learning
Initialization Random/fixed Basic ๐Ÿ”ง Kaiming normal init

๐Ÿ“Š Use Cases

  • Time Series Forecasting: Embedding raw values in LSTM/Transformer models
  • Anomaly Detection: Feature extraction for anomaly detection models
  • Time Series Classification: Converting raw series to embeddings for classification
  • Multivariate Analysis: Processing multiple correlated time series simultaneously
  • Feature Engineering: Automatic feature extraction from raw temporal data
  • Preprocessing Pipeline: As first layer in deep time series models
  • Pre-training: For self-supervised learning on time series

๐Ÿš€ Quick Start

Basic Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import keras
from kerasfactory.layers import TokenEmbedding

# Create token embedding layer
token_emb = TokenEmbedding(c_in=7, d_model=64)

# Create sample time series data
batch_size, time_steps, n_features = 32, 100, 7
x = keras.random.normal((batch_size, time_steps, n_features))

# Apply embedding
output = token_emb(x)

print(f"Input shape: {x.shape}")      # (32, 100, 7)
print(f"Output shape: {output.shape}") # (32, 100, 64)

In a Time Series Forecasting Model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import keras
from kerasfactory.layers import TokenEmbedding, PositionalEmbedding

# Build forecasting model
def create_forecasting_model():
    inputs = keras.Input(shape=(96, 7))  # 96 time steps, 7 features

    # Embed raw values
    x = TokenEmbedding(c_in=7, d_model=64)(inputs)

    # Add positional encoding
    x = x + PositionalEmbedding(max_len=96, d_model=64)(x)

    # Process with transformers
    x = keras.layers.MultiHeadAttention(num_heads=8, key_dim=8)(x, x)
    x = keras.layers.Dense(128, activation='relu')(x)
    x = keras.layers.Dense(32, activation='relu')(x)

    # Forecast future values
    outputs = keras.layers.Dense(7)(x)  # Forecast next 7 features

    return keras.Model(inputs, outputs)

model = create_forecasting_model()
model.compile(optimizer='adam', loss='mse')

With Multivariate Time Series

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from kerasfactory.layers import TokenEmbedding, TemporalEmbedding, DataEmbeddingWithoutPosition

# Multi-feature time series embedding
token_emb = TokenEmbedding(c_in=12, d_model=96)
temporal_emb = TemporalEmbedding(d_model=96, embed_type='fixed')

# Input data
x = keras.random.normal((32, 100, 12))  # 12 features
x_mark = keras.random.uniform((32, 100, 5), minval=0, maxval=24, dtype='int32')

# Embed values
x_embedded = token_emb(x)

# Add temporal context
temporal_features = temporal_emb(x_mark)
combined = x_embedded + temporal_features

print(f"Combined embedding shape: {combined.shape}")  # (32, 100, 96)

Advanced Multi-Scale Architecture

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from kerasfactory.layers import TokenEmbedding, MultiScaleSeasonMixing

class MultiScaleTimeSeriesModel(keras.Model):
    def __init__(self, c_in, d_model, num_scales=3):
        super().__init__()
        self.token_emb = TokenEmbedding(c_in, d_model)
        self.scale_embeddings = [
            TokenEmbedding(c_in, d_model // (2 ** i))
            for i in range(num_scales)
        ]

    def call(self, inputs):
        # Primary embedding
        x = self.token_emb(inputs)

        # Multi-scale embeddings
        scales = [emb(inputs) for emb in self.scale_embeddings]

        # Combine scales
        combined = x + keras.layers.average(scales)
        return combined

๐Ÿ”ง API Reference

TokenEmbedding

1
2
3
4
5
6
kerasfactory.layers.TokenEmbedding(
    c_in: int,
    d_model: int,
    name: str | None = None,
    **kwargs: Any
)

Parameters

Parameter Type Default Description
c_in int โ€” Number of input channels (features)
d_model int โ€” Output embedding dimension
name str \| None None Optional layer name for identification

Input Shape

  • (batch_size, time_steps, c_in)

Output Shape

  • (batch_size, time_steps, d_model)

Returns

  • Embedded time series tensor with learned representations

๐Ÿ“ˆ Performance Characteristics

  • Time Complexity: O(time_steps ร— c_in ร— d_model ร— kernel_size) per forward pass
  • Space Complexity: O(c_in ร— d_model ร— kernel_size) for weights
  • Trainable Parameters: c_in ร— d_model ร— kernel_size + d_model (weights + bias)
  • Training Efficiency: Fast convergence with proper initialization
  • Inference Speed: Optimized for batch processing

๐ŸŽจ Advanced Usage

Custom Initialization

1
2
3
4
5
6
7
8
from kerasfactory.layers import TokenEmbedding

# Create layer with custom initialization
token_emb = TokenEmbedding(c_in=8, d_model=64)

# Access the conv layer for custom initialization
conv_layer = token_emb.conv
conv_layer.kernel_initializer = keras.initializers.HeNormal()

Integration with Preprocessing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from kerasfactory.layers import TokenEmbedding, ReversibleInstanceNorm

# Preprocessing pipeline
normalizer = ReversibleInstanceNorm(num_features=7)
token_emb = TokenEmbedding(c_in=7, d_model=64)

# Apply normalization then embedding
x = keras.random.normal((32, 100, 7))
x_normalized = normalizer(x, mode='norm')
x_embedded = token_emb(x_normalized)

print(f"Embedded shape: {x_embedded.shape}")  # (32, 100, 64)

Ensemble of Embeddings

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
class EnsembleTokenEmbedding(keras.layers.Layer):
    def __init__(self, c_in, d_model, num_embeddings=3):
        super().__init__()
        self.embeddings = [
            TokenEmbedding(c_in, d_model // num_embeddings)
            for _ in range(num_embeddings)
        ]

    def call(self, inputs):
        outputs = [emb(inputs) for emb in self.embeddings]
        return keras.layers.concatenate(outputs, axis=-1)

๐Ÿ” Visual Representation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Input Time Series (Raw Values)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Shape: (batch, time, channels)  โ”‚
โ”‚ Example: (32, 96, 7)            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
               โ–ผ
       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
       โ”‚  Transposition    โ”‚
       โ”‚ (batch, ch, time) โ”‚
       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
               โ–ผ
       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
       โ”‚  Conv1D Layer     โ”‚
       โ”‚  kernel_size=3    โ”‚
       โ”‚  filters=d_model  โ”‚
       โ”‚  padding='same'   โ”‚
       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
               โ–ผ
       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
       โ”‚  Transposition    โ”‚
       โ”‚(batch, time, d_m) โ”‚
       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
               โ–ผ
    Output Embeddings (Learned)
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ Shape: (batch, time, 64) โ”‚
    โ”‚ Rich feature rep.        โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ’ก Best Practices

  1. Match d_model: Ensure d_model matches downstream layer dimensions
  2. Normalize First: Apply normalization before embedding for stability
  3. Proper Initialization: Kaiming normal is applied automatically
  4. Batch Consistency: Use consistent batch sizes for training
  5. Feature Scaling: Consider scaling inputs to [-1, 1] range
  6. Layer Stacking: Combine with positional embeddings for transformers
  7. Learning Rate: Use moderate learning rates (0.001-0.01)

โš ๏ธ Common Pitfalls

  • โŒ c_in mismatch: Using wrong input channel count causes shape errors
  • โŒ d_model too small: Underfitting if embedding dimension too small
  • โŒ Missing normalization: Training instability without preprocessing
  • โŒ Batch size 1: Can cause issues with layer normalization (if used)
  • โŒ Extreme values: Very large input values can cause training issues
  • โŒ Forgetting temporal position: Don't use alone; add positional encoding

๐Ÿ“š References

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification"
  • Vaswani, A., et al. (2017). "Attention Is All You Need"
  • Zhou, H., et al. (2021). "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting"

โœ… Serialization

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Get layer configuration
config = token_emb.get_config()

# Save to file
import json
with open('token_embedding_config.json', 'w') as f:
    json.dump(config, f)

# Recreate from config
new_layer = TokenEmbedding.from_config(config)

๐Ÿงช Testing & Validation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Test with different input sizes
token_emb = TokenEmbedding(c_in=7, d_model=64)

# Small batch
x_small = keras.random.normal((1, 96, 7))
out_small = token_emb(x_small)
assert out_small.shape == (1, 96, 64)

# Large batch
x_large = keras.random.normal((256, 96, 7))
out_large = token_emb(x_large)
assert out_large.shape == (256, 96, 64)

# Different time steps
x_diff_time = keras.random.normal((32, 200, 7))
out_diff_time = token_emb(x_diff_time)
assert out_diff_time.shape == (32, 200, 64)

print("โœ“ All shape tests passed!")

Last Updated: 2025-11-04
Version: 1.0
Keras: 3.0+
Status: โœ… Production Ready