Skip to content

πŸ”„ TransformerBlock

πŸ”„ TransformerBlock

πŸ”΄ Advanced βœ… Stable πŸ”₯ Popular

🎯 Overview

The TransformerBlock implements a standard transformer block with multi-head self-attention followed by a feed-forward network, with residual connections and layer normalization. This layer is particularly useful for capturing complex relationships in tabular data and sequence processing.

This layer is particularly powerful for tabular data where feature interactions are complex, making it ideal for sophisticated feature processing and relationship modeling.

πŸ” How It Works

The TransformerBlock processes data through a transformer architecture:

  1. Multi-Head Attention: Applies multi-head self-attention to capture relationships
  2. Residual Connection: Adds input to attention output for gradient flow
  3. Layer Normalization: Normalizes the attention output
  4. Feed-Forward Network: Applies two-layer feed-forward network
  5. Residual Connection: Adds attention output to feed-forward output
  6. Layer Normalization: Normalizes the final output
graph TD
    A[Input Features] --> B[Multi-Head Attention]
    B --> C[Add & Norm]
    A --> C
    C --> D[Feed-Forward Network]
    D --> E[Add & Norm]
    C --> E
    E --> F[Output Features]

    G[Layer Normalization] --> C
    H[Layer Normalization] --> E

    style A fill:#e6f3ff,stroke:#4a86e8
    style F fill:#e8f5e9,stroke:#66bb6a
    style B fill:#fff9e6,stroke:#ffb74d
    style D fill:#f3e5f5,stroke:#9c27b0
    style C fill:#e1f5fe,stroke:#03a9f4
    style E fill:#e1f5fe,stroke:#03a9f4

πŸ’‘ Why Use This Layer?

Challenge Traditional Approach TransformerBlock's Solution
Feature Interactions Limited interaction modeling 🎯 Multi-head attention captures complex interactions
Sequence Processing RNN-based processing ⚑ Parallel processing with attention mechanisms
Long Dependencies Limited by sequence length 🧠 Self-attention captures long-range dependencies
Tabular Data Simple feature processing πŸ”— Sophisticated processing for tabular data

πŸ“Š Use Cases

  • Tabular Data Processing: Complex feature interaction modeling
  • Sequence Processing: Time series and sequential data
  • Feature Engineering: Sophisticated feature transformation
  • Attention Mechanisms: Implementing attention-based processing
  • Deep Learning: Building deep transformer architectures

πŸš€ Quick Start

Basic Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import keras
from kerasfactory.layers import TransformerBlock

# Create sample input data
batch_size, seq_len, dim_model = 32, 10, 64
x = keras.random.normal((batch_size, seq_len, dim_model))

# Apply transformer block
transformer = TransformerBlock(
    dim_model=64,
    num_heads=4,
    ff_units=128,
    dropout_rate=0.1
)
output = transformer(x)

print(f"Input shape: {x.shape}")           # (32, 10, 64)
print(f"Output shape: {output.shape}")     # (32, 10, 64)

In a Sequential Model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import keras
from kerasfactory.layers import TransformerBlock

model = keras.Sequential([
    keras.layers.Dense(64, activation='relu'),
    TransformerBlock(dim_model=64, num_heads=4, ff_units=128, dropout_rate=0.1),
    keras.layers.Dense(32, activation='relu'),
    TransformerBlock(dim_model=32, num_heads=2, ff_units=64, dropout_rate=0.1),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In a Functional Model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import keras
from kerasfactory.layers import TransformerBlock

# Define inputs
inputs = keras.Input(shape=(20, 32))  # 20 time steps, 32 features

# Apply transformer block
x = TransformerBlock(
    dim_model=32,
    num_heads=4,
    ff_units=64,
    dropout_rate=0.1
)(inputs)

# Continue processing
x = keras.layers.Dense(64, activation='relu')(x)
x = TransformerBlock(
    dim_model=64,
    num_heads=4,
    ff_units=128,
    dropout_rate=0.1
)(x)
x = keras.layers.Dense(32, activation='relu')(x)
outputs = keras.layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs, outputs)

Advanced Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Advanced configuration with multiple transformer blocks
def create_transformer_network():
    inputs = keras.Input(shape=(15, 48))  # 15 time steps, 48 features

    # Multiple transformer blocks
    x = TransformerBlock(
        dim_model=48,
        num_heads=6,
        ff_units=96,
        dropout_rate=0.1
    )(inputs)

    x = TransformerBlock(
        dim_model=48,
        num_heads=6,
        ff_units=96,
        dropout_rate=0.1
    )(x)

    x = TransformerBlock(
        dim_model=48,
        num_heads=6,
        ff_units=96,
        dropout_rate=0.1
    )(x)

    # Global pooling and final processing
    x = keras.layers.GlobalAveragePooling1D()(x)
    x = keras.layers.Dense(64, activation='relu')(x)
    x = keras.layers.Dropout(0.2)(x)
    x = keras.layers.Dense(32, activation='relu')(x)

    # Multi-task output
    classification = keras.layers.Dense(3, activation='softmax', name='classification')(x)
    regression = keras.layers.Dense(1, name='regression')(x)

    return keras.Model(inputs, [classification, regression])

model = create_transformer_network()
model.compile(
    optimizer='adam',
    loss={'classification': 'categorical_crossentropy', 'regression': 'mse'},
    loss_weights={'classification': 1.0, 'regression': 0.5}
)

πŸ“– API Reference

kerasfactory.layers.TransformerBlock

This module implements a TransformerBlock layer that applies transformer-style self-attention and feed-forward processing to input tensors. It's particularly useful for capturing complex relationships in tabular data.

Classes

TransformerBlock
1
2
3
4
5
6
7
8
TransformerBlock(
    dim_model: int = 32,
    num_heads: int = 3,
    ff_units: int = 16,
    dropout_rate: float = 0.2,
    name: str | None = None,
    **kwargs: Any
)

Transformer block with multi-head attention and feed-forward layers.

This layer implements a standard transformer block with multi-head self-attention followed by a feed-forward network, with residual connections and layer normalization.

Parameters:

Name Type Description Default
dim_model int

Dimensionality of the model.

32
num_heads int

Number of attention heads.

3
ff_units int

Number of units in the feed-forward network.

16
dropout_rate float

Dropout rate for regularization.

0.2
name str

Name for the layer.

None
Input shape

Tensor with shape: (batch_size, sequence_length, dim_model) or (batch_size, dim_model) which will be automatically reshaped.

Output shape

Tensor with shape: (batch_size, sequence_length, dim_model) or (batch_size, dim_model) matching the input shape.

Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import keras
from kerasfactory.layers import TransformerBlock

# Create sample input data
x = keras.random.normal((32, 10, 64))  # 32 samples, 10 time steps, 64 features

# Apply transformer block
transformer = TransformerBlock(dim_model=64, num_heads=4, ff_units=128, dropout_rate=0.1)
y = transformer(x)
print("Output shape:", y.shape)  # (32, 10, 64)

Initialize the TransformerBlock layer.

Parameters:

Name Type Description Default
dim_model int

Model dimension.

32
num_heads int

Number of attention heads.

3
ff_units int

Feed-forward units.

16
dropout_rate float

Dropout rate.

0.2
name str | None

Name of the layer.

None
**kwargs Any

Additional keyword arguments.

{}
Source code in kerasfactory/layers/TransformerBlock.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
def __init__(
    self,
    dim_model: int = 32,
    num_heads: int = 3,
    ff_units: int = 16,
    dropout_rate: float = 0.2,
    name: str | None = None,
    **kwargs: Any,
) -> None:
    """Initialize the TransformerBlock layer.

    Args:
        dim_model: Model dimension.
        num_heads: Number of attention heads.
        ff_units: Feed-forward units.
        dropout_rate: Dropout rate.
        name: Name of the layer.
        **kwargs: Additional keyword arguments.
    """
    # Set private attributes first
    self._dim_model = dim_model
    self._num_heads = num_heads
    self._ff_units = ff_units
    self._dropout_rate = dropout_rate

    # Validate parameters
    self._validate_params()

    # Set public attributes BEFORE calling parent's __init__
    self.dim_model = self._dim_model
    self.num_heads = self._num_heads
    self.ff_units = self._ff_units
    self.dropout_rate = self._dropout_rate

    # Initialize layers
    self.multihead_attention: layers.MultiHeadAttention | None = None
    self.dropout1: layers.Dropout | None = None
    self.add1: layers.Add | None = None
    self.layer_norm1: layers.LayerNormalization | None = None
    self.ff1: layers.Dense | None = None
    self.dropout2: layers.Dropout | None = None
    self.ff2: layers.Dense | None = None
    self.add2: layers.Add | None = None
    self.layer_norm2: layers.LayerNormalization | None = None

    # Call parent's __init__ after setting public attributes
    super().__init__(name=name, **kwargs)
Functions
compute_output_shape
1
2
3
compute_output_shape(
    input_shape: tuple[int, ...]
) -> tuple[int, ...]

Compute the output shape of the layer.

Parameters:

Name Type Description Default
input_shape tuple[int, ...]

Shape of the input tensor.

required

Returns:

Type Description
tuple[int, ...]

Shape of the output tensor.

Source code in kerasfactory/layers/TransformerBlock.py
198
199
200
201
202
203
204
205
206
207
def compute_output_shape(self, input_shape: tuple[int, ...]) -> tuple[int, ...]:
    """Compute the output shape of the layer.

    Args:
        input_shape: Shape of the input tensor.

    Returns:
        Shape of the output tensor.
    """
    return input_shape

πŸ”§ Parameters Deep Dive

dim_model (int)

  • Purpose: Dimensionality of the model
  • Range: 8 to 512+ (typically 32-128)
  • Impact: Determines the size of the feature space
  • Recommendation: Start with 32-64, scale based on data complexity

num_heads (int)

  • Purpose: Number of attention heads
  • Range: 1 to 16+ (typically 2-8)
  • Impact: More heads = more attention patterns
  • Recommendation: Start with 4-6, adjust based on data complexity

ff_units (int)

  • Purpose: Number of units in the feed-forward network
  • Range: 16 to 512+ (typically 64-256)
  • Impact: Larger values = more complex transformations
  • Recommendation: Start with 2x dim_model, scale as needed

dropout_rate (float)

  • Purpose: Dropout rate for regularization
  • Range: 0.0 to 0.5 (typically 0.1-0.2)
  • Impact: Higher values = more regularization
  • Recommendation: Start with 0.1, adjust based on overfitting

πŸ“ˆ Performance Characteristics

  • Speed: ⚑⚑⚑ Fast for small to medium models, scales with attention heads
  • Memory: πŸ’ΎπŸ’ΎπŸ’Ύ Moderate memory usage due to attention mechanisms
  • Accuracy: 🎯🎯🎯🎯 Excellent for complex relationship modeling
  • Best For: Tabular data with complex feature interactions

🎨 Examples

Example 1: Tabular Data Processing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import keras
import numpy as np
from kerasfactory.layers import TransformerBlock

# Create a transformer for tabular data
def create_tabular_transformer():
    inputs = keras.Input(shape=(25, 32))  # 25 features, 32 dimensions

    # Transformer processing
    x = TransformerBlock(
        dim_model=32,
        num_heads=4,
        ff_units=64,
        dropout_rate=0.1
    )(inputs)

    x = TransformerBlock(
        dim_model=32,
        num_heads=4,
        ff_units=64,
        dropout_rate=0.1
    )(x)

    # Global pooling and final processing
    x = keras.layers.GlobalAveragePooling1D()(x)
    x = keras.layers.Dense(64, activation='relu')(x)
    x = keras.layers.Dropout(0.2)(x)
    x = keras.layers.Dense(32, activation='relu')(x)

    # Output
    outputs = keras.layers.Dense(1, activation='sigmoid')(x)

    return keras.Model(inputs, outputs)

model = create_tabular_transformer()
model.compile(optimizer='adam', loss='binary_crossentropy')

# Test with sample data
sample_data = keras.random.normal((100, 25, 32))
predictions = model(sample_data)
print(f"Tabular transformer predictions shape: {predictions.shape}")

Example 2: Time Series Processing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Create a transformer for time series data
def create_time_series_transformer():
    inputs = keras.Input(shape=(30, 16))  # 30 time steps, 16 features

    # Multiple transformer blocks
    x = TransformerBlock(
        dim_model=16,
        num_heads=4,
        ff_units=32,
        dropout_rate=0.1
    )(inputs)

    x = TransformerBlock(
        dim_model=16,
        num_heads=4,
        ff_units=32,
        dropout_rate=0.1
    )(x)

    # Global pooling and final processing
    x = keras.layers.GlobalAveragePooling1D()(x)
    x = keras.layers.Dense(32, activation='relu')(x)
    x = keras.layers.Dropout(0.2)(x)

    # Multi-task output
    trend = keras.layers.Dense(1, name='trend')(x)
    seasonality = keras.layers.Dense(1, name='seasonality')(x)
    anomaly = keras.layers.Dense(1, activation='sigmoid', name='anomaly')(x)

    return keras.Model(inputs, [trend, seasonality, anomaly])

model = create_time_series_transformer()
model.compile(
    optimizer='adam',
    loss={'trend': 'mse', 'seasonality': 'mse', 'anomaly': 'binary_crossentropy'},
    loss_weights={'trend': 1.0, 'seasonality': 0.5, 'anomaly': 0.3}
)

Example 3: Attention Analysis

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Analyze attention patterns in transformer
def analyze_attention_patterns():
    # Create model with transformer
    inputs = keras.Input(shape=(10, 32))
    x = TransformerBlock(
        dim_model=32,
        num_heads=4,
        ff_units=64,
        dropout_rate=0.1
    )(inputs)
    outputs = keras.layers.Dense(1, activation='sigmoid')(x)

    model = keras.Model(inputs, outputs)

    # Test with sample data
    sample_data = keras.random.normal((5, 10, 32))
    predictions = model(sample_data)

    print("Attention Analysis:")
    print("=" * 40)
    print(f"Input shape: {sample_data.shape}")
    print(f"Output shape: {predictions.shape}")
    print(f"Model parameters: {model.count_params()}")

    return model

# Analyze attention patterns
# model = analyze_attention_patterns()

πŸ’‘ Tips & Best Practices

  • Model Dimension: Start with 32-64, scale based on data complexity
  • Attention Heads: Use 4-6 heads for most applications
  • Feed-Forward Units: Use 2x model dimension as starting point
  • Dropout Rate: Use 0.1-0.2 for regularization
  • Residual Connections: Built-in residual connections for gradient flow
  • Layer Normalization: Built-in layer normalization for stable training

⚠️ Common Pitfalls

  • Model Dimension: Must match input feature dimension
  • Attention Heads: Must divide model dimension evenly
  • Memory Usage: Scales with attention heads and sequence length
  • Overfitting: Monitor for overfitting with complex models
  • Gradient Flow: Residual connections help but monitor training

πŸ“š Further Reading