π DistributionAwareEncoder
π DistributionAwareEncoder
π― Overview
The DistributionAwareEncoder automatically detects the distribution type of input data and applies appropriate transformations and encodings. It builds upon the DistributionTransformLayer but adds sophisticated distribution detection and specialized encoding for different distribution types.
This layer is particularly powerful for preprocessing data where the distribution characteristics are unknown or vary across features, providing intelligent adaptation to different data patterns.
π How It Works
The DistributionAwareEncoder processes data through intelligent distribution-aware encoding:
- Distribution Detection: Analyzes input data to identify distribution type
- Transformation Selection: Chooses optimal transformation based on detected distribution
- Specialized Encoding: Applies distribution-specific encoding strategies
- Embedding Generation: Creates rich embeddings with optional distribution information
- Output Generation: Produces encoded features optimized for the detected distribution
graph TD
A[Input Features] --> B[Distribution Detection]
B --> C{Distribution Type}
C -->|Normal| D[Normal Encoding]
C -->|Exponential| E[Exponential Encoding]
C -->|LogNormal| F[LogNormal Encoding]
C -->|Uniform| G[Uniform Encoding]
C -->|Beta| H[Beta Encoding]
C -->|Bimodal| I[Bimodal Encoding]
C -->|Heavy Tailed| J[Heavy Tailed Encoding]
C -->|Mixed| K[Mixed Encoding]
C -->|Unknown| L[Generic Encoding]
D --> M[Transformation Layer]
E --> M
F --> M
G --> M
H --> M
I --> M
J --> M
K --> M
L --> M
M --> N[Distribution Embedding]
N --> O[Final Encoded Features]
style A fill:#e6f3ff,stroke:#4a86e8
style O fill:#e8f5e9,stroke:#66bb6a
style B fill:#fff9e6,stroke:#ffb74d
style C fill:#f3e5f5,stroke:#9c27b0
π‘ Why Use This Layer?
| Challenge | Traditional Approach | DistributionAwareEncoder's Solution |
|---|---|---|
| Unknown Distributions | One-size-fits-all preprocessing | π― Automatic detection and adaptation to distribution type |
| Mixed Data Types | Uniform processing for all features | β‘ Specialized encoding for different distribution types |
| Distribution Changes | Static preprocessing strategies | π§ Adaptive encoding that adjusts to data characteristics |
| Feature Engineering | Manual distribution analysis | π Automated preprocessing with learned distribution awareness |
π Use Cases
- Mixed Distribution Data: Datasets with features following different distributions
- Unknown Data Characteristics: When distribution types are not known in advance
- Adaptive Preprocessing: Systems that need to adapt to changing data patterns
- Feature Engineering: Automated creation of distribution-aware features
- Data Quality: Handling datasets with varying distribution quality
π Quick Start
Basic Usage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | |
Automatic Detection
1 2 3 4 5 6 7 8 9 10 | |
Manual Distribution Type
1 2 3 4 5 6 7 8 9 10 11 | |
In a Sequential Model
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
In a Functional Model
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
Advanced Configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | |
π API Reference
kerasfactory.layers.DistributionAwareEncoder
This module implements a DistributionAwareEncoder layer that automatically detects the distribution type of input data and applies appropriate transformations and encodings. It builds upon the DistributionTransformLayer but adds more sophisticated distribution detection and specialized encoding for different distribution types.
Classes
DistributionAwareEncoder
1 2 3 4 5 6 7 8 9 | |
Layer that automatically detects and encodes data based on its distribution.
This layer first detects the distribution type of the input data and then applies appropriate transformations and encodings. It builds upon the DistributionTransformLayer but adds more sophisticated distribution detection and specialized encoding for different distribution types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embedding_dim |
int | None
|
Dimension of the output embedding. If None, the output will have the same dimension as the input. Default is None. |
None
|
auto_detect |
bool
|
Whether to automatically detect the distribution type. If False, the layer will use the specified distribution_type. Default is True. |
True
|
distribution_type |
str
|
The distribution type to use if auto_detect is False. Options are "normal", "exponential", "lognormal", "uniform", "beta", "bimodal", "heavy_tailed", "mixed", "bounded", "unknown". Default is "unknown". |
'unknown'
|
transform_type |
str
|
The transformation type to use. If "auto", the layer will automatically select the best transformation based on the detected distribution. See DistributionTransformLayer for available options. Default is "auto". |
'auto'
|
add_distribution_embedding |
bool
|
Whether to add a learned embedding of the distribution type to the output. Default is False. |
False
|
name |
str | None
|
Optional name for the layer. |
None
|
Input shape
N-D tensor with shape: (batch_size, ..., features).
Output shape
If embedding_dim is None, same shape as input: (batch_size, ..., features).
If embedding_dim is specified: (batch_size, ..., embedding_dim).
If add_distribution_embedding is True, the output will have an additional
dimension for the distribution embedding.
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | |
Initialize the DistributionAwareEncoder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embedding_dim |
int | None
|
Embedding dimension. |
None
|
auto_detect |
bool
|
Whether to auto-detect distribution type. |
True
|
distribution_type |
str
|
Type of distribution. |
'unknown'
|
transform_type |
str
|
Type of transformation to apply. |
'auto'
|
add_distribution_embedding |
bool
|
Whether to add distribution embedding. |
False
|
name |
str | None
|
Name of the layer. |
None
|
**kwargs |
Any
|
Additional keyword arguments. |
{}
|
Source code in kerasfactory/layers/DistributionAwareEncoder.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | |
π§ Parameters Deep Dive
embedding_dim (int, optional)
- Purpose: Dimension of the output embedding
- Range: 8 to 256+ (typically 16-64)
- Impact: Higher values = richer representations but more parameters
- Recommendation: Start with 16-32, scale based on data complexity
auto_detect (bool)
- Purpose: Whether to automatically detect distribution type
- Default: True
- Impact: Enables intelligent distribution detection
- Recommendation: Use True for unknown data, False for known distributions
distribution_type (str)
- Purpose: Distribution type to use if auto_detect is False
- Options: "normal", "exponential", "lognormal", "uniform", "beta", "bimodal", "heavy_tailed", "mixed", "bounded", "unknown"
- Default: "unknown"
- Impact: Determines encoding strategy
- Recommendation: Use specific type when you know the distribution
add_distribution_embedding (bool)
- Purpose: Whether to add learned distribution type embedding
- Default: False
- Impact: Includes distribution information in output
- Recommendation: Use True for complex models that benefit from distribution awareness
π Performance Characteristics
- Speed: β‘β‘β‘ Fast for small to medium datasets, scales with embedding_dim
- Memory: πΎπΎπΎ Moderate memory usage due to distribution detection and encoding
- Accuracy: π―π―π―π― Excellent for mixed-distribution data
- Best For: Tabular data with unknown or mixed distribution types
π¨ Examples
Example 1: Mixed Distribution Data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | |
Example 2: Time Series with Varying Distributions
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | |
Example 3: Multi-Modal Data Processing
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | |
π‘ Tips & Best Practices
- Auto Detection: Use auto_detect=True for unknown data distributions
- Distribution Embedding: Enable add_distribution_embedding for complex models
- Feature Preprocessing: Ensure features are properly scaled before encoding
- Embedding Dimension: Start with 16-32, scale based on data complexity
- Monitoring: Track distribution detection accuracy during training
- Data Quality: Works best with clean, well-preprocessed data
β οΈ Common Pitfalls
- Input Shape: Must be 2D tensor (batch_size, num_features)
- Distribution Detection: May not work well with very small datasets
- Memory Usage: Scales with embedding_dim and distribution complexity
- Overfitting: Can overfit on small datasets - use regularization
- Distribution Changes: May need retraining if data distribution changes significantly
π Related Layers
- DistributionTransformLayer - Distribution transformation
- AdvancedNumericalEmbedding - Advanced numerical embeddings
- DifferentiableTabularPreprocessor - End-to-end preprocessing
- CastToFloat32Layer - Type casting utility
π Further Reading
- Distribution Detection in Machine Learning - Distribution testing concepts
- Feature Encoding Techniques - Feature encoding approaches
- Adaptive Preprocessing - Adaptive data preprocessing
- KerasFactory Layer Explorer - Browse all available layers
- Data Preprocessing Tutorial - Complete guide to data preprocessing