π MultiHeadGraphFeaturePreprocessor
π MultiHeadGraphFeaturePreprocessor
π― Overview
The MultiHeadGraphFeaturePreprocessor treats each feature as a node in a graph and applies multi-head self-attention to capture and aggregate complex interactions among features. It learns multiple relational views among features, which can significantly boost performance on tabular data.
This layer is particularly powerful for tabular data where complex feature relationships need to be captured, providing a sophisticated preprocessing step that can learn multiple aspects of feature interactions.
π How It Works
The MultiHeadGraphFeaturePreprocessor processes data through multi-head graph-based transformation:
- Feature Embedding: Projects each scalar input into an embedding
- Multi-Head Split: Splits the embedding into multiple heads
- Query-Key-Value: Computes queries, keys, and values for each head
- Scaled Dot-Product Attention: Calculates attention across feature dimension
- Head Concatenation: Concatenates head outputs
- Output Projection: Projects back to original dimension with residual connection
graph TD
A[Input Features] --> B[Feature Embedding]
B --> C[Multi-Head Split]
C --> D[Query-Key-Value]
D --> E[Scaled Dot-Product Attention]
E --> F[Head Concatenation]
F --> G[Output Projection]
A --> H[Residual Connection]
G --> H
H --> I[Transformed Features]
style A fill:#e6f3ff,stroke:#4a86e8
style I fill:#e8f5e9,stroke:#66bb6a
style B fill:#fff9e6,stroke:#ffb74d
style C fill:#f3e5f5,stroke:#9c27b0
style D fill:#e1f5fe,stroke:#03a9f4
style E fill:#fff3e0,stroke:#ff9800
π‘ Why Use This Layer?
| Challenge | Traditional Approach | MultiHeadGraphFeaturePreprocessor's Solution |
|---|---|---|
| Feature Interactions | Manual feature engineering | π― Automatic learning of complex feature interactions |
| Multiple Views | Single perspective | β‘ Multi-head attention for multiple relational views |
| Graph Structure | No graph structure | π§ Graph-based feature preprocessing |
| Complex Relationships | Limited relationship modeling | π Sophisticated relationship learning |
π Use Cases
- Tabular Data: Complex feature relationship preprocessing
- Graph Neural Networks: Graph-based preprocessing for tabular data
- Feature Engineering: Automatic feature interaction learning
- Multi-Head Attention: Multiple relational views of features
- Complex Patterns: Capturing complex feature relationships
π Quick Start
Basic Usage
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
In a Sequential Model
1 2 3 4 5 6 7 8 9 10 11 12 | |
In a Functional Model
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Advanced Configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
π API Reference
kerasfactory.layers.MultiHeadGraphFeaturePreprocessor
This module implements a MultiHeadGraphFeaturePreprocessor layer that treats features as nodes in a graph and learns multiple "views" (heads) of the feature interactions via self-attention. This approach is useful for tabular data where complex feature relationships need to be captured.
Classes
MultiHeadGraphFeaturePreprocessor
1 2 3 4 5 6 7 | |
Multi-head graph-based feature preprocessor for tabular data.
This layer treats each feature as a node and applies multi-head self-attention to capture and aggregate complex interactions among features. The process is:
- Project each scalar input into an embedding of dimension
embed_dim. - Split the embedding into
num_headsheads. - For each head, compute queries, keys, and values and calculate scaled dot-product attention across the feature dimension.
- Concatenate the head outputs, project back to the original feature dimension, and add a residual connection.
This mechanism allows the network to learn multiple relational views among features, which can significantly boost performance on tabular data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embed_dim |
int
|
Dimension of the feature embeddings. Default is 16. |
16
|
num_heads |
int
|
Number of attention heads. Default is 4. |
4
|
dropout_rate |
float
|
Dropout rate applied to attention weights. Default is 0.0. |
0.0
|
name |
str | None
|
Optional name for the layer. |
None
|
Input shape
2D tensor with shape: (batch_size, num_features)
Output shape
2D tensor with shape: (batch_size, num_features) (same as input)
Example
1 2 3 4 5 6 7 8 9 10 | |
Initialize the MultiHeadGraphFeaturePreprocessor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embed_dim |
int
|
Embedding dimension. |
16
|
num_heads |
int
|
Number of attention heads. |
4
|
dropout_rate |
float
|
Dropout rate. |
0.0
|
name |
str | None
|
Name of the layer. |
None
|
**kwargs |
Any
|
Additional keyword arguments. |
{}
|
Source code in kerasfactory/layers/MultiHeadGraphFeaturePreprocessor.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | |
Functions
1 2 3 | |
Split the last dimension into (num_heads, depth) and transpose.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x |
KerasTensor
|
Input tensor with shape (batch_size, num_features, embed_dim). |
required |
batch_size |
KerasTensor
|
Batch size tensor. |
required |
Returns:
| Type | Description |
|---|---|
KerasTensor
|
Tensor with shape (batch_size, num_heads, num_features, depth). |
Source code in kerasfactory/layers/MultiHeadGraphFeaturePreprocessor.py
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | |
π§ Parameters Deep Dive
embed_dim (int)
- Purpose: Dimension of the feature embeddings
- Range: 8 to 128+ (typically 16-64)
- Impact: Larger values = more expressive embeddings but more parameters
- Recommendation: Start with 16-32, scale based on data complexity
num_heads (int)
- Purpose: Number of attention heads
- Range: 1 to 16+ (typically 4-8)
- Impact: More heads = more diverse attention patterns
- Recommendation: Use 4-8 heads for most applications
dropout_rate (float)
- Purpose: Dropout rate applied to attention weights
- Range: 0.0 to 0.5 (typically 0.1-0.2)
- Impact: Higher values = more regularization
- Recommendation: Use 0.1-0.2 for regularization
π Performance Characteristics
- Speed: β‘β‘β‘ Fast for small to medium models, scales with heads and features
- Memory: πΎπΎπΎ Moderate memory usage due to multi-head attention
- Accuracy: π―π―π―π― Excellent for complex feature relationship learning
- Best For: Tabular data with complex feature relationships
π¨ Examples
Example 1: Complex Feature Relationships
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | |
Example 2: Multi-Head Analysis
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | |
Example 3: Attention Head Analysis
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | |
π‘ Tips & Best Practices
- Embedding Dimension: Start with 16-32, scale based on data complexity
- Number of Heads: Use 4-8 heads for most applications
- Dropout Rate: Use 0.1-0.2 for regularization
- Feature Relationships: Works best when features have complex relationships
- Residual Connections: Built-in residual connections for gradient flow
- Attention Patterns: Monitor attention patterns for interpretability
β οΈ Common Pitfalls
- Embedding Dimension: Must be divisible by num_heads
- Number of Heads: Must be positive integer
- Dropout Rate: Must be between 0 and 1
- Memory Usage: Scales with number of heads and features
- Overfitting: Monitor for overfitting with complex configurations
π Related Layers
- AdvancedGraphFeature - Advanced graph feature layer
- GraphFeatureAggregation - Graph feature aggregation
- TabularAttention - Tabular attention mechanisms
- VariableSelection - Variable selection
π Further Reading
- Multi-Head Attention - Multi-head attention mechanism
- Graph Neural Networks - Graph neural network concepts
- Feature Relationships - Feature relationship concepts
- KerasFactory Layer Explorer - Browse all available layers
- Feature Engineering Tutorial - Complete guide to feature engineering