π― TabularMoELayer
π― TabularMoELayer
π― Overview
The TabularMoELayer implements a Mixture-of-Experts (MoE) architecture for tabular data, routing input features through multiple expert sub-networks and aggregating their outputs via a learnable gating mechanism. Each expert is a small MLP that can specialize in different feature patterns.
This layer is particularly powerful for tabular data where different experts can specialize in different feature patterns, making it ideal for complex datasets with diverse feature types and interactions.
π How It Works
The TabularMoELayer processes data through a mixture-of-experts architecture:
- Expert Networks: Creates multiple expert MLPs for different feature patterns
- Gating Mechanism: Learns to weight expert contributions based on input
- Expert Processing: Each expert processes the input independently
- Weighted Aggregation: Combines expert outputs using learned weights
- Output Generation: Produces final aggregated output
graph TD
A[Input Features] --> B[Gating Network]
A --> C1[Expert 1]
A --> C2[Expert 2]
A --> C3[Expert N]
B --> D[Gating Weights]
C1 --> E1[Expert 1 Output]
C2 --> E2[Expert 2 Output]
C3 --> E3[Expert N Output]
D --> F[Weighted Aggregation]
E1 --> F
E2 --> F
E3 --> F
F --> G[Final Output]
style A fill:#e6f3ff,stroke:#4a86e8
style G fill:#e8f5e9,stroke:#66bb6a
style B fill:#fff9e6,stroke:#ffb74d
style C1 fill:#f3e5f5,stroke:#9c27b0
style C2 fill:#f3e5f5,stroke:#9c27b0
style C3 fill:#f3e5f5,stroke:#9c27b0
style F fill:#e1f5fe,stroke:#03a9f4
π‘ Why Use This Layer?
| Challenge | Traditional Approach | TabularMoELayer's Solution |
|---|---|---|
| Feature Diversity | Single model for all features | π― Multiple experts specialize in different patterns |
| Complex Patterns | Limited pattern recognition | β‘ Specialized experts for different feature types |
| Ensemble Learning | Separate ensemble models | π§ Integrated ensemble with learned weighting |
| Scalability | Fixed model capacity | π Scalable capacity with more experts |
π Use Cases
- Complex Tabular Data: Datasets with diverse feature types
- Feature Specialization: Different experts for different feature patterns
- Ensemble Learning: Integrated ensemble with learned weighting
- Scalable Models: Models that can scale with more experts
- Pattern Recognition: Complex pattern recognition in tabular data
π Quick Start
Basic Usage
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
In a Sequential Model
1 2 3 4 5 6 7 8 9 10 11 12 | |
In a Functional Model
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Advanced Configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | |
π API Reference
kerasfactory.layers.TabularMoELayer
This module implements a TabularMoELayer (Mixture-of-Experts) that routes input features through multiple expert sub-networks and aggregates their outputs via a learnable gating mechanism. This approach is useful for tabular data where different experts can specialize in different feature patterns.
Classes
TabularMoELayer
1 2 3 4 5 6 | |
Mixture-of-Experts layer for tabular data.
This layer routes input features through multiple expert sub-networks and aggregates their outputs via a learnable gating mechanism. Each expert is a small MLP, and the gate learns to weight their contributions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_experts |
int
|
Number of expert networks. Default is 4. |
4
|
expert_units |
int
|
Number of hidden units in each expert network. Default is 16. |
16
|
name |
str | None
|
Optional name for the layer. |
None
|
Input shape
2D tensor with shape: (batch_size, num_features)
Output shape
2D tensor with shape: (batch_size, num_features) (same as input)
Example
1 2 3 4 5 6 7 8 9 10 | |
Initialize the TabularMoELayer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_experts |
int
|
Number of expert networks. |
4
|
expert_units |
int
|
Number of units in each expert. |
16
|
name |
str | None
|
Name of the layer. |
None
|
**kwargs |
Any
|
Additional keyword arguments. |
{}
|
Source code in kerasfactory/layers/TabularMoELayer.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | |
π§ Parameters Deep Dive
num_experts (int)
- Purpose: Number of expert networks
- Range: 2 to 20+ (typically 4-8)
- Impact: More experts = more specialization but more parameters
- Recommendation: Start with 4-6, scale based on data complexity
expert_units (int)
- Purpose: Number of hidden units in each expert network
- Range: 8 to 128+ (typically 16-64)
- Impact: Larger values = more complex expert transformations
- Recommendation: Start with 16-32, scale based on data complexity
π Performance Characteristics
- Speed: β‘β‘β‘ Fast for small to medium models, scales with experts
- Memory: πΎπΎπΎ Moderate memory usage due to multiple experts
- Accuracy: π―π―π―π― Excellent for complex pattern recognition
- Best For: Tabular data with diverse feature patterns
π¨ Examples
Example 1: Feature Specialization
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
Example 2: Expert Analysis
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | |
Example 3: Scalable MoE Architecture
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | |
π‘ Tips & Best Practices
- Number of Experts: Start with 4-6 experts, scale based on data complexity
- Expert Units: Use 16-32 units per expert for most applications
- Gating Mechanism: The layer automatically learns expert weighting
- Specialization: Different experts will specialize in different patterns
- Scalability: Can scale by adding more experts
- Regularization: Consider adding dropout between MoE layers
β οΈ Common Pitfalls
- Number of Experts: Must be positive integer
- Expert Units: Must be positive integer
- Memory Usage: Scales with number of experts and units
- Overfitting: Can overfit with too many experts on small datasets
- Expert Utilization: Some experts may not be used effectively
π Related Layers
- SparseAttentionWeighting - Sparse attention weighting
- GatedFeatureFusion - Gated feature fusion
- VariableSelection - Variable selection
- TransformerBlock - Transformer processing
π Further Reading
- Mixture of Experts - MoE concepts
- Gating Networks - Gating mechanism paper
- Ensemble Learning - Ensemble learning concepts
- KerasFactory Layer Explorer - Browse all available layers
- Feature Engineering Tutorial - Complete guide to feature engineering