π TransformerBlock
π TransformerBlock
π― Overview
The TransformerBlock implements a standard transformer block with multi-head self-attention followed by a feed-forward network, with residual connections and layer normalization. This layer is particularly useful for capturing complex relationships in tabular data and sequence processing.
This layer is particularly powerful for tabular data where feature interactions are complex, making it ideal for sophisticated feature processing and relationship modeling.
π How It Works
The TransformerBlock processes data through a transformer architecture:
- Multi-Head Attention: Applies multi-head self-attention to capture relationships
- Residual Connection: Adds input to attention output for gradient flow
- Layer Normalization: Normalizes the attention output
- Feed-Forward Network: Applies two-layer feed-forward network
- Residual Connection: Adds attention output to feed-forward output
- Layer Normalization: Normalizes the final output
graph TD
A[Input Features] --> B[Multi-Head Attention]
B --> C[Add & Norm]
A --> C
C --> D[Feed-Forward Network]
D --> E[Add & Norm]
C --> E
E --> F[Output Features]
G[Layer Normalization] --> C
H[Layer Normalization] --> E
style A fill:#e6f3ff,stroke:#4a86e8
style F fill:#e8f5e9,stroke:#66bb6a
style B fill:#fff9e6,stroke:#ffb74d
style D fill:#f3e5f5,stroke:#9c27b0
style C fill:#e1f5fe,stroke:#03a9f4
style E fill:#e1f5fe,stroke:#03a9f4
π‘ Why Use This Layer?
| Challenge | Traditional Approach | TransformerBlock's Solution |
|---|---|---|
| Feature Interactions | Limited interaction modeling | π― Multi-head attention captures complex interactions |
| Sequence Processing | RNN-based processing | β‘ Parallel processing with attention mechanisms |
| Long Dependencies | Limited by sequence length | π§ Self-attention captures long-range dependencies |
| Tabular Data | Simple feature processing | π Sophisticated processing for tabular data |
π Use Cases
- Tabular Data Processing: Complex feature interaction modeling
- Sequence Processing: Time series and sequential data
- Feature Engineering: Sophisticated feature transformation
- Attention Mechanisms: Implementing attention-based processing
- Deep Learning: Building deep transformer architectures
π Quick Start
Basic Usage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | |
In a Sequential Model
1 2 3 4 5 6 7 8 9 10 11 12 | |
In a Functional Model
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |
Advanced Configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | |
π API Reference
kerasfactory.layers.TransformerBlock
This module implements a TransformerBlock layer that applies transformer-style self-attention and feed-forward processing to input tensors. It's particularly useful for capturing complex relationships in tabular data.
Classes
TransformerBlock
1 2 3 4 5 6 7 8 | |
Transformer block with multi-head attention and feed-forward layers.
This layer implements a standard transformer block with multi-head self-attention followed by a feed-forward network, with residual connections and layer normalization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim_model |
int
|
Dimensionality of the model. |
32
|
num_heads |
int
|
Number of attention heads. |
3
|
ff_units |
int
|
Number of units in the feed-forward network. |
16
|
dropout_rate |
float
|
Dropout rate for regularization. |
0.2
|
name |
str
|
Name for the layer. |
None
|
Input shape
Tensor with shape: (batch_size, sequence_length, dim_model) or
(batch_size, dim_model) which will be automatically reshaped.
Output shape
Tensor with shape: (batch_size, sequence_length, dim_model) or
(batch_size, dim_model) matching the input shape.
Example
1 2 3 4 5 6 7 8 9 10 | |
Initialize the TransformerBlock layer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim_model |
int
|
Model dimension. |
32
|
num_heads |
int
|
Number of attention heads. |
3
|
ff_units |
int
|
Feed-forward units. |
16
|
dropout_rate |
float
|
Dropout rate. |
0.2
|
name |
str | None
|
Name of the layer. |
None
|
**kwargs |
Any
|
Additional keyword arguments. |
{}
|
Source code in kerasfactory/layers/TransformerBlock.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | |
Functions
1 2 3 | |
Compute the output shape of the layer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_shape |
tuple[int, ...]
|
Shape of the input tensor. |
required |
Returns:
| Type | Description |
|---|---|
tuple[int, ...]
|
Shape of the output tensor. |
Source code in kerasfactory/layers/TransformerBlock.py
198 199 200 201 202 203 204 205 206 207 | |
π§ Parameters Deep Dive
dim_model (int)
- Purpose: Dimensionality of the model
- Range: 8 to 512+ (typically 32-128)
- Impact: Determines the size of the feature space
- Recommendation: Start with 32-64, scale based on data complexity
num_heads (int)
- Purpose: Number of attention heads
- Range: 1 to 16+ (typically 2-8)
- Impact: More heads = more attention patterns
- Recommendation: Start with 4-6, adjust based on data complexity
ff_units (int)
- Purpose: Number of units in the feed-forward network
- Range: 16 to 512+ (typically 64-256)
- Impact: Larger values = more complex transformations
- Recommendation: Start with 2x dim_model, scale as needed
dropout_rate (float)
- Purpose: Dropout rate for regularization
- Range: 0.0 to 0.5 (typically 0.1-0.2)
- Impact: Higher values = more regularization
- Recommendation: Start with 0.1, adjust based on overfitting
π Performance Characteristics
- Speed: β‘β‘β‘ Fast for small to medium models, scales with attention heads
- Memory: πΎπΎπΎ Moderate memory usage due to attention mechanisms
- Accuracy: π―π―π―π― Excellent for complex relationship modeling
- Best For: Tabular data with complex feature interactions
π¨ Examples
Example 1: Tabular Data Processing
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | |
Example 2: Time Series Processing
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | |
Example 3: Attention Analysis
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
π‘ Tips & Best Practices
- Model Dimension: Start with 32-64, scale based on data complexity
- Attention Heads: Use 4-6 heads for most applications
- Feed-Forward Units: Use 2x model dimension as starting point
- Dropout Rate: Use 0.1-0.2 for regularization
- Residual Connections: Built-in residual connections for gradient flow
- Layer Normalization: Built-in layer normalization for stable training
β οΈ Common Pitfalls
- Model Dimension: Must match input feature dimension
- Attention Heads: Must divide model dimension evenly
- Memory Usage: Scales with attention heads and sequence length
- Overfitting: Monitor for overfitting with complex models
- Gradient Flow: Residual connections help but monitor training
π Related Layers
- TabularAttention - Tabular attention mechanisms
- MultiResolutionTabularAttention - Multi-resolution attention
- GatedResidualNetwork - Gated residual networks
- TabularMoELayer - Mixture of experts
π Further Reading
- Attention Is All You Need - Original transformer paper
- Multi-Head Attention - Multi-head attention mechanism
- Transformer Architecture - Transformer concepts
- KerasFactory Layer Explorer - Browse all available layers
- Feature Engineering Tutorial - Complete guide to feature engineering