π CategoricalAnomalyDetectionLayer
π CategoricalAnomalyDetectionLayer
π― Overview
The CategoricalAnomalyDetectionLayer identifies outliers in categorical features by learning the distribution of categorical values and detecting rare or unusual combinations. It uses embedding-based approaches and frequency analysis to detect anomalies in categorical data.
This layer is particularly powerful for identifying outliers in categorical data, providing a specialized approach for non-numerical features that traditional statistical methods may not handle well.
π How It Works
The CategoricalAnomalyDetectionLayer processes data through categorical anomaly detection:
- Categorical Encoding: Encodes categorical features into embeddings
- Frequency Analysis: Analyzes frequency of categorical values
- Rarity Detection: Identifies rare or unusual categorical combinations
- Embedding Learning: Learns embeddings for categorical values
- Anomaly Scoring: Computes anomaly scores based on rarity and embeddings
- Output Generation: Produces anomaly scores for each categorical feature
graph TD
A[Categorical Features] --> B[Categorical Encoding]
B --> C[Frequency Analysis]
C --> D[Rarity Detection]
B --> E[Embedding Learning]
E --> F[Embedding Analysis]
F --> G[Anomaly Scoring]
D --> G
G --> H[Anomaly Scores]
style A fill:#e6f3ff,stroke:#4a86e8
style H fill:#e8f5e9,stroke:#66bb6a
style B fill:#fff9e6,stroke:#ffb74d
style C fill:#f3e5f5,stroke:#9c27b0
style E fill:#e1f5fe,stroke:#03a9f4
style G fill:#fff3e0,stroke:#ff9800
π‘ Why Use This Layer?
| Challenge | Traditional Approach | CategoricalAnomalyDetectionLayer's Solution |
|---|---|---|
| Categorical Outliers | Limited methods | π― Specialized approach for categorical data |
| Rarity Detection | Manual frequency analysis | β‘ Automatic rarity detection |
| Embedding Learning | No embedding learning | π§ Embedding-based anomaly detection |
| Frequency Analysis | Static frequency analysis | π Dynamic frequency analysis |
π Use Cases
- Categorical Outlier Detection: Identifying outliers in categorical features
- Data Quality: Ensuring data quality through categorical anomaly detection
- Rarity Analysis: Analyzing rare categorical combinations
- Embedding Learning: Learning embeddings for categorical values
- Frequency Analysis: Analyzing frequency of categorical values
π Quick Start
Basic Usage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | |
In a Sequential Model
1 2 3 4 5 6 7 8 9 10 11 | |
In a Functional Model
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Advanced Configuration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
π API Reference
kerasfactory.layers.CategoricalAnomalyDetectionLayer
Classes
CategoricalAnomalyDetectionLayer
1 2 3 | |
Backend-agnostic anomaly detection for categorical features.
This layer detects anomalies in categorical features by checking if values belong to a predefined set of valid categories. Values not in this set are considered anomalous.
The layer uses a Keras StringLookup or IntegerLookup layer internally to efficiently map input values to indices, which are then used to determine if a value is valid.
Attributes:
| Name | Type | Description |
|---|---|---|
dtype |
Any
|
The data type of input values ('string' or 'int32'). |
lookup |
StringLookup | IntegerLookup | None
|
A Keras lookup layer for mapping values to indices. |
vocabulary |
StringLookup | IntegerLookup | None
|
list of valid categorical values. |
Example
1 2 3 4 | |
Initializes the layer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dtype |
str
|
Data type of input values ('string' or 'int32'). Defaults to 'string'. |
'string'
|
**kwargs |
Additional layer arguments. |
{}
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If dtype is not 'string' or 'int32'. |
Source code in kerasfactory/layers/CategoricalAnomalyDetectionLayer.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | |
Attributes
property
1 | |
Get the dtype of the layer.
Functions
1 | |
Set the dtype and initialize the appropriate lookup layer.
Source code in kerasfactory/layers/CategoricalAnomalyDetectionLayer.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | |
1 | |
Initializes the layer with a vocabulary of valid values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocabulary |
list[str | int]
|
list of valid categorical values. |
required |
Source code in kerasfactory/layers/CategoricalAnomalyDetectionLayer.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | |
1 2 3 | |
Compute the output shape of the layer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_shape |
tuple[int | None, int]
|
Input shape tuple. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, tuple[int | None, int]]
|
Dictionary mapping output names to their shapes. |
Source code in kerasfactory/layers/CategoricalAnomalyDetectionLayer.py
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | |
classmethod
1 | |
Create layer from configuration.
Source code in kerasfactory/layers/CategoricalAnomalyDetectionLayer.py
202 203 204 205 206 207 208 209 210 211 212 | |
π§ Parameters Deep Dive
embedding_dim (int, optional)
- Purpose: Dimension of categorical embeddings
- Range: 8 to 64+ (typically 16-32)
- Impact: Larger values = more expressive embeddings but more parameters
- Recommendation: Start with 16-32, scale based on data complexity
frequency_threshold (float, optional)
- Purpose: Threshold for frequency-based anomaly detection
- Range: 0.0 to 1.0 (typically 0.01-0.1)
- Impact: Lower values = more sensitive to rare values
- Recommendation: Use 0.01-0.05 for most applications
embedding_weight (float, optional)
- Purpose: Weight for embedding-based anomaly detection
- Range: 0.0 to 1.0 (typically 0.3-0.7)
- Impact: Higher values = more emphasis on embedding-based detection
- Recommendation: Use 0.3-0.7 based on data characteristics
π Performance Characteristics
- Speed: β‘β‘β‘ Fast for small to medium models, scales with embedding dimension
- Memory: πΎπΎπΎ Moderate memory usage due to embeddings
- Accuracy: π―π―π―π― Excellent for categorical anomaly detection
- Best For: Categorical data with potential outliers
π¨ Examples
Example 1: Categorical Outlier Detection
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | |
Example 2: Rarity Analysis
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |
Example 3: Frequency Analysis
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |
π‘ Tips & Best Practices
- Embedding Dimension: Start with 16-32, scale based on data complexity
- Frequency Threshold: Use 0.01-0.05 for most applications
- Embedding Weight: Balance embedding and frequency-based detection
- Categorical Encoding: Ensure proper categorical encoding
- Rarity Analysis: Monitor rarity patterns for interpretability
- Frequency Analysis: Track frequency changes over time
β οΈ Common Pitfalls
- Embedding Dimension: Must be positive integer
- Frequency Threshold: Must be between 0 and 1
- Embedding Weight: Must be between 0 and 1
- Memory Usage: Scales with embedding dimension and vocabulary size
- Categorical Encoding: Ensure proper string tensor handling
π Related Layers
- NumericalAnomalyDetection - Numerical anomaly detection
- BusinessRulesLayer - Business rules validation
- FeatureCutout - Feature regularization
- DistributionAwareEncoder - Distribution-aware encoding
π Further Reading
- Categorical Data - Categorical data concepts
- Anomaly Detection - Anomaly detection techniques
- Frequency Analysis - Frequency analysis concepts
- KerasFactory Layer Explorer - Browse all available layers
- Feature Engineering Tutorial - Complete guide to feature engineering