🔍 KMR Data Analyzer
The KMR Data Analyzer is an intelligent utility that analyzes your tabular data and automatically recommends the best KMR layers for your specific dataset.
Smart Recommendations
Just provide your CSV file, and the analyzer will suggest the most appropriate layers based on your data characteristics!
✨ Features
- 📊 Automatic Analysis: Analyzes single CSV files or entire directories
- 🎯 Feature Detection: Identifies numerical, categorical, date, and text features
- 🔍 Data Insights: Detects high cardinality, missing values, correlations, and patterns
- 🧩 Layer Recommendations: Suggests the best KMR layers for your data
- 🔧 Extensible: Add custom recommendation rules
- 💻 CLI & API: Command-line interface and Python API
- 📈 Performance Tips: Guidance on layer configuration and optimization
🚀 Installation
The Data Analyzer is included with the Keras Model Registry package.
# Install from PyPI (recommended)
pip install kmr
# Or install from source using Poetry
git clone https://github.com/UnicoLab/keras-model-registry
cd keras-model-registry
poetry install
💻 Usage
🖥️ Command-line Interface
The Data Analyzer can be used from the command line:
# Analyze a single CSV file
python -m kmr.utils.data_analyzer_cli path/to/data.csv
# Analyze a directory of CSV files
python -m kmr.utils.data_analyzer_cli path/to/data_dir/
# Save results to a JSON file
python -m kmr.utils.data_analyzer_cli path/to/data.csv --output results.json
# Get only layer recommendations without detailed statistics
python -m kmr.utils.data_analyzer_cli path/to/data.csv --recommendations-only
🐍 Python API
You can also use the Data Analyzer in your Python code:
from kmr.utils import DataAnalyzer, analyze_data
# Quick usage
results = analyze_data("path/to/data.csv")
recommendations = results["recommendations"]
# Or using the class directly
analyzer = DataAnalyzer()
result = analyzer.analyze_and_recommend("path/to/data.csv")
# Add custom layer recommendations
analyzer.register_recommendation(
characteristic="continuous_features",
layer_name="MyCustomLayer",
description="Custom layer for continuous features",
use_case="Special continuous feature processing"
)
# Analyze multiple files in a directory
result = analyzer.analyze_and_recommend("path/to/directory", pattern="*.csv")
Data Characteristics
The analyzer identifies the following data characteristics:
continuous_features
: Numerical featurescategorical_features
: Categorical featuresdate_features
: Date and time featurestext_features
: Text featureshigh_cardinality_categorical
: Categorical features with high cardinalityhigh_missing_value_features
: Features with many missing valuesfeature_interaction
: Highly correlated feature pairstime_series
: Date features that may indicate time series datageneral_tabular
: General tabular data characteristics
Layer Recommendations
For each data characteristic, the analyzer recommends appropriate KMR layers along with descriptions and use cases.
Example
For continuous features, the following layers might be recommended:
AdvancedNumericalEmbedding
: Embeds continuous features using both MLP and discretization approachesDifferentialPreprocessingLayer
: Applies various normalizations and transformations to numerical features
Extending Layer Recommendations
You can extend the layer recommendations by registering new layers:
from kmr.utils import DataAnalyzer
analyzer = DataAnalyzer()
analyzer.register_recommendation(
characteristic="continuous_features",
layer_name="MyCustomLayer",
description="Custom layer for continuous features",
use_case="Special continuous feature processing"
)
Example Script
Check out the example script at examples/data_analyzer_example.py
for a complete demonstration.
Output Format
The analyzer returns a dictionary with the following structure:
{
"analysis": {
"file": "filename.csv", # For single file analysis
"stats": {
"row_count": 1000,
"column_count": 10,
"column_types": { ... },
"characteristics": {
"continuous_features": ["feature1", "feature2", ...],
"categorical_features": ["feature3", "feature4", ...],
...
},
"missing_values": { ... },
"cardinality": { ... },
"numeric_stats": { ... }
}
},
"recommendations": {
"continuous_features": [
["LayerName1", "Description1", "UseCase1"],
["LayerName2", "Description2", "UseCase2"],
...
],
"categorical_features": [ ... ],
...
}
}
Caveats
- The analyzer relies on heuristics to identify feature types, which may not always be accurate.
- Recommendations are based on general patterns and may need adjustment for specific use cases.
- Performance may degrade with very large CSV files due to memory constraints.