🔄 TransformerBlock

🔴 Advanced ✅ Stable 🔥 Popular

🎯 Overview

The TransformerBlock implements a standard transformer block with multi-head self-attention followed by a feed-forward network, with residual connections and layer normalization. This layer is particularly useful for capturing complex relationships in tabular data and sequence processing.

This layer is particularly powerful for tabular data where feature interactions are complex, making it ideal for sophisticated feature processing and relationship modeling.

🔍 How It Works

The TransformerBlock processes data through a transformer architecture:

Multi-Head Attention: Applies multi-head self-attention to capture relationships
Residual Connection: Adds input to attention output for gradient flow
Layer Normalization: Normalizes the attention output
Feed-Forward Network: Applies two-layer feed-forward network
Residual Connection: Adds attention output to feed-forward output
Layer Normalization: Normalizes the final output

graph TD
    A[Input Features] --> B[Multi-Head Attention]
    B --> C[Add & Norm]
    A --> C
    C --> D[Feed-Forward Network]
    D --> E[Add & Norm]
    C --> E
    E --> F[Output Features]

    G[Layer Normalization] --> C
    H[Layer Normalization] --> E

    style A fill:#e6f3ff,stroke:#4a86e8
    style F fill:#e8f5e9,stroke:#66bb6a
    style B fill:#fff9e6,stroke:#ffb74d
    style D fill:#f3e5f5,stroke:#9c27b0
    style C fill:#e1f5fe,stroke:#03a9f4
    style E fill:#e1f5fe,stroke:#03a9f4

💡 Why Use This Layer?

Challenge	Traditional Approach	TransformerBlock's Solution
Feature Interactions	Limited interaction modeling	🎯 Multi-head attention captures complex interactions
Sequence Processing	RNN-based processing	⚡ Parallel processing with attention mechanisms
Long Dependencies	Limited by sequence length	🧠 Self-attention captures long-range dependencies
Tabular Data	Simple feature processing	🔗 Sophisticated processing for tabular data

📊 Use Cases

Tabular Data Processing: Complex feature interaction modeling
Sequence Processing: Time series and sequential data
Feature Engineering: Sophisticated feature transformation
Attention Mechanisms: Implementing attention-based processing
Deep Learning: Building deep transformer architectures

🚀 Quick Start

Basic Usage

import keras
from kerasfactory.layers import TransformerBlock

# Create sample input data
batch_size, seq_len, dim_model = 32, 10, 64
x = keras.random.normal((batch_size, seq_len, dim_model))

# Apply transformer block
transformer = TransformerBlock(
    dim_model=64,
    num_heads=4,
    ff_units=128,
    dropout_rate=0.1
)
output = transformer(x)

print(f"Input shape: {x.shape}")           # (32, 10, 64)
print(f"Output shape: {output.shape}")     # (32, 10, 64)

In a Sequential Model

import keras
from kerasfactory.layers import TransformerBlock

model = keras.Sequential([
    keras.layers.Dense(64, activation='relu'),
    TransformerBlock(dim_model=64, num_heads=4, ff_units=128, dropout_rate=0.1),
    keras.layers.Dense(32, activation='relu'),
    TransformerBlock(dim_model=32, num_heads=2, ff_units=64, dropout_rate=0.1),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In a Functional Model

import keras
from kerasfactory.layers import TransformerBlock

# Define inputs
inputs = keras.Input(shape=(20, 32))  # 20 time steps, 32 features

# Apply transformer block
x = TransformerBlock(
    dim_model=32,
    num_heads=4,
    ff_units=64,
    dropout_rate=0.1
)(inputs)

# Continue processing
x = keras.layers.Dense(64, activation='relu')(x)
x = TransformerBlock(
    dim_model=64,
    num_heads=4,
    ff_units=128,
    dropout_rate=0.1
)(x)
x = keras.layers.Dense(32, activation='relu')(x)
outputs = keras.layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs, outputs)

Advanced Configuration

# Advanced configuration with multiple transformer blocks
def create_transformer_network():
    inputs = keras.Input(shape=(15, 48))  # 15 time steps, 48 features

    # Multiple transformer blocks
    x = TransformerBlock(
        dim_model=48,
        num_heads=6,
        ff_units=96,
        dropout_rate=0.1
    )(inputs)

    x = TransformerBlock(
        dim_model=48,
        num_heads=6,
        ff_units=96,
        dropout_rate=0.1
    )(x)

    x = TransformerBlock(
        dim_model=48,
        num_heads=6,
        ff_units=96,
        dropout_rate=0.1
    )(x)

    # Global pooling and final processing
    x = keras.layers.GlobalAveragePooling1D()(x)
    x = keras.layers.Dense(64, activation='relu')(x)
    x = keras.layers.Dropout(0.2)(x)
    x = keras.layers.Dense(32, activation='relu')(x)

    # Multi-task output
    classification = keras.layers.Dense(3, activation='softmax', name='classification')(x)
    regression = keras.layers.Dense(1, name='regression')(x)

    return keras.Model(inputs, [classification, regression])

model = create_transformer_network()
model.compile(
    optimizer='adam',
    loss={'classification': 'categorical_crossentropy', 'regression': 'mse'},
    loss_weights={'classification': 1.0, 'regression': 0.5}
)

📖 API Reference

kerasfactory.layers.TransformerBlock

This module implements a TransformerBlock layer that applies transformer-style self-attention and feed-forward processing to input tensors. It's particularly useful for capturing complex relationships in tabular data.

Classes

TransformerBlock

TransformerBlock(
    dim_model: int = 32,
    num_heads: int = 3,
    ff_units: int = 16,
    dropout_rate: float = 0.2,
    name: str | None = None,
    **kwargs: Any
)

Transformer block with multi-head attention and feed-forward layers.

This layer implements a standard transformer block with multi-head self-attention followed by a feed-forward network, with residual connections and layer normalization.

Parameters:

Name	Type	Description	Default
`dim_model`	`int`	Dimensionality of the model.	`32`
`num_heads`	`int`	Number of attention heads.	`3`
`ff_units`	`int`	Number of units in the feed-forward network.	`16`
`dropout_rate`	`float`	Dropout rate for regularization.	`0.2`
`name`	`str`	Name for the layer.	`None`

Input shape

Tensor with shape: (batch_size, sequence_length, dim_model) or (batch_size, dim_model) which will be automatically reshaped.

Output shape

Tensor with shape: (batch_size, sequence_length, dim_model) or (batch_size, dim_model) matching the input shape.

Example

import keras
from kerasfactory.layers import TransformerBlock

# Create sample input data
x = keras.random.normal((32, 10, 64))  # 32 samples, 10 time steps, 64 features

# Apply transformer block
transformer = TransformerBlock(dim_model=64, num_heads=4, ff_units=128, dropout_rate=0.1)
y = transformer(x)
print("Output shape:", y.shape)  # (32, 10, 64)

Initialize the TransformerBlock layer.

Parameters:

Name	Type	Description	Default
`dim_model`	`int`	Model dimension.	`32`
`num_heads`	`int`	Number of attention heads.	`3`
`ff_units`	`int`	Feed-forward units.	`16`
`dropout_rate`	`float`	Dropout rate.	`0.2`
`name`	`str \| None`	Name of the layer.	`None`
`**kwargs`	`Any`	Additional keyword arguments.	`{}`

Source code in kerasfactory/layers/TransformerBlock.py

def __init__(
    self,
    dim_model: int = 32,
    num_heads: int = 3,
    ff_units: int = 16,
    dropout_rate: float = 0.2,
    name: str | None = None,
    **kwargs: Any,
) -> None:
    """Initialize the TransformerBlock layer.

    Args:
        dim_model: Model dimension.
        num_heads: Number of attention heads.
        ff_units: Feed-forward units.
        dropout_rate: Dropout rate.
        name: Name of the layer.
        **kwargs: Additional keyword arguments.
    """
    # Set private attributes first
    self._dim_model = dim_model
    self._num_heads = num_heads
    self._ff_units = ff_units
    self._dropout_rate = dropout_rate

    # Validate parameters
    self._validate_params()

    # Set public attributes BEFORE calling parent's __init__
    self.dim_model = self._dim_model
    self.num_heads = self._num_heads
    self.ff_units = self._ff_units
    self.dropout_rate = self._dropout_rate

    # Initialize layers
    self.multihead_attention: layers.MultiHeadAttention | None = None
    self.dropout1: layers.Dropout | None = None
    self.add1: layers.Add | None = None
    self.layer_norm1: layers.LayerNormalization | None = None
    self.ff1: layers.Dense | None = None
    self.dropout2: layers.Dropout | None = None
    self.ff2: layers.Dense | None = None
    self.add2: layers.Add | None = None
    self.layer_norm2: layers.LayerNormalization | None = None

    # Call parent's __init__ after setting public attributes
    super().__init__(name=name, **kwargs)

Functions

compute_output_shape

compute_output_shape(
    input_shape: tuple[int, ...]
) -> tuple[int, ...]

Compute the output shape of the layer.

Parameters:

Name	Type	Description	Default
`input_shape`	`tuple[int, ...]`	Shape of the input tensor.	required

Returns:

Type	Description
`tuple[int, ...]`	Shape of the output tensor.

Source code in kerasfactory/layers/TransformerBlock.py

def compute_output_shape(self, input_shape: tuple[int, ...]) -> tuple[int, ...]:
    """Compute the output shape of the layer.

    Args:
        input_shape: Shape of the input tensor.

    Returns:
        Shape of the output tensor.
    """
    return input_shape

🔧 Parameters Deep Dive

`dim_model` (int)

Purpose: Dimensionality of the model
Range: 8 to 512+ (typically 32-128)
Impact: Determines the size of the feature space
Recommendation: Start with 32-64, scale based on data complexity

`num_heads` (int)

Purpose: Number of attention heads
Range: 1 to 16+ (typically 2-8)
Impact: More heads = more attention patterns
Recommendation: Start with 4-6, adjust based on data complexity

`ff_units` (int)

Purpose: Number of units in the feed-forward network
Range: 16 to 512+ (typically 64-256)
Impact: Larger values = more complex transformations
Recommendation: Start with 2x dim_model, scale as needed

`dropout_rate` (float)

Purpose: Dropout rate for regularization
Range: 0.0 to 0.5 (typically 0.1-0.2)
Impact: Higher values = more regularization
Recommendation: Start with 0.1, adjust based on overfitting

📈 Performance Characteristics

Speed: ⚡⚡⚡ Fast for small to medium models, scales with attention heads
Memory: 💾💾💾 Moderate memory usage due to attention mechanisms
Accuracy: 🎯🎯🎯🎯 Excellent for complex relationship modeling
Best For: Tabular data with complex feature interactions

🎨 Examples

Example 1: Tabular Data Processing

import keras
import numpy as np
from kerasfactory.layers import TransformerBlock

# Create a transformer for tabular data
def create_tabular_transformer():
    inputs = keras.Input(shape=(25, 32))  # 25 features, 32 dimensions

    # Transformer processing
    x = TransformerBlock(
        dim_model=32,
        num_heads=4,
        ff_units=64,
        dropout_rate=0.1
    )(inputs)

    x = TransformerBlock(
        dim_model=32,
        num_heads=4,
        ff_units=64,
        dropout_rate=0.1
    )(x)

    # Global pooling and final processing
    x = keras.layers.GlobalAveragePooling1D()(x)
    x = keras.layers.Dense(64, activation='relu')(x)
    x = keras.layers.Dropout(0.2)(x)
    x = keras.layers.Dense(32, activation='relu')(x)

    # Output
    outputs = keras.layers.Dense(1, activation='sigmoid')(x)

    return keras.Model(inputs, outputs)

model = create_tabular_transformer()
model.compile(optimizer='adam', loss='binary_crossentropy')

# Test with sample data
sample_data = keras.random.normal((100, 25, 32))
predictions = model(sample_data)
print(f"Tabular transformer predictions shape: {predictions.shape}")

Example 2: Time Series Processing

# Create a transformer for time series data
def create_time_series_transformer():
    inputs = keras.Input(shape=(30, 16))  # 30 time steps, 16 features

    # Multiple transformer blocks
    x = TransformerBlock(
        dim_model=16,
        num_heads=4,
        ff_units=32,
        dropout_rate=0.1
    )(inputs)

    x = TransformerBlock(
        dim_model=16,
        num_heads=4,
        ff_units=32,
        dropout_rate=0.1
    )(x)

    # Global pooling and final processing
    x = keras.layers.GlobalAveragePooling1D()(x)
    x = keras.layers.Dense(32, activation='relu')(x)
    x = keras.layers.Dropout(0.2)(x)

    # Multi-task output
    trend = keras.layers.Dense(1, name='trend')(x)
    seasonality = keras.layers.Dense(1, name='seasonality')(x)
    anomaly = keras.layers.Dense(1, activation='sigmoid', name='anomaly')(x)

    return keras.Model(inputs, [trend, seasonality, anomaly])

model = create_time_series_transformer()
model.compile(
    optimizer='adam',
    loss={'trend': 'mse', 'seasonality': 'mse', 'anomaly': 'binary_crossentropy'},
    loss_weights={'trend': 1.0, 'seasonality': 0.5, 'anomaly': 0.3}
)

Example 3: Attention Analysis

# Analyze attention patterns in transformer
def analyze_attention_patterns():
    # Create model with transformer
    inputs = keras.Input(shape=(10, 32))
    x = TransformerBlock(
        dim_model=32,
        num_heads=4,
        ff_units=64,
        dropout_rate=0.1
    )(inputs)
    outputs = keras.layers.Dense(1, activation='sigmoid')(x)

    model = keras.Model(inputs, outputs)

    # Test with sample data
    sample_data = keras.random.normal((5, 10, 32))
    predictions = model(sample_data)

    print("Attention Analysis:")
    print("=" * 40)
    print(f"Input shape: {sample_data.shape}")
    print(f"Output shape: {predictions.shape}")
    print(f"Model parameters: {model.count_params()}")

    return model

# Analyze attention patterns
# model = analyze_attention_patterns()

💡 Tips & Best Practices

Model Dimension: Start with 32-64, scale based on data complexity
Attention Heads: Use 4-6 heads for most applications
Feed-Forward Units: Use 2x model dimension as starting point
Dropout Rate: Use 0.1-0.2 for regularization
Residual Connections: Built-in residual connections for gradient flow
Layer Normalization: Built-in layer normalization for stable training

⚠️ Common Pitfalls

Model Dimension: Must match input feature dimension
Attention Heads: Must divide model dimension evenly
Memory Usage: Scales with attention heads and sequence length
Overfitting: Monitor for overfitting with complex models
Gradient Flow: Residual connections help but monitor training

TabularAttention - Tabular attention mechanisms
MultiResolutionTabularAttention - Multi-resolution attention
GatedResidualNetwork - Gated residual networks
TabularMoELayer - Mixture of experts

📚 Further Reading

Attention Is All You Need - Original transformer paper
Multi-Head Attention - Multi-head attention mechanism
Transformer Architecture - Transformer concepts
KerasFactory Layer Explorer - Browse all available layers
Feature Engineering Tutorial - Complete guide to feature engineering

🔄 TransformerBlock

🔄 TransformerBlock

🎯 Overview

🔍 How It Works

💡 Why Use This Layer?

📊 Use Cases

🚀 Quick Start

Basic Usage

In a Sequential Model

In a Functional Model

Advanced Configuration

📖 API Reference

kerasfactory.layers.TransformerBlock

Classes

TransformerBlock

Functions

🔧 Parameters Deep Dive

dim_model (int)

num_heads (int)

ff_units (int)

dropout_rate (float)

📈 Performance Characteristics

🎨 Examples

Example 1: Tabular Data Processing

Example 2: Time Series Processing

Example 3: Attention Analysis

💡 Tips & Best Practices

⚠️ Common Pitfalls

🔗 Related Layers

📚 Further Reading

`dim_model` (int)

`num_heads` (int)

`ff_units` (int)

`dropout_rate` (float)