DocsTransformations
Transformation Tools
Data transformation and preprocessing tools for preparing datasets for machine learning.
Overview
Scaling
StandardScaler, MinMaxScaler, RobustScaler
Encoding
OneHot, Label, Target encoding
Feature Engineering
Polynomial features, binning
Missing Data
Imputation strategies
Image Transforms
Augmentation, resizing, normalization
Scaling Methods
| Scaler | Formula | Use Case | Outlier Sensitive |
|---|---|---|---|
| Standard | (x - mean) / std | Most algorithms | Yes |
| MinMax | (x - min) / (max - min) | Neural networks | Yes |
| MaxAbs | x / max(abs(x)) | Sparse data | Yes |
| Robust | (x - median) / IQR | Outlier presence | No |
| Normalizer | x / norm(x) | Text/sparse | No |
Scaling Usage
from cyxwiz.transforms import StandardScaler, MinMaxScaler, RobustScaler # Standard scaling scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Transform new data (use same parameters) X_new_scaled = scaler.transform(X_new) # Inverse transform X_original = scaler.inverse_transform(X_scaled)
Encoding
One-Hot Encoding
from cyxwiz.transforms import OneHotEncoder
encoder = OneHotEncoder(
drop='first',
handle_unknown='ignore'
)
X_encoded = encoder.fit_transform(
X[['category', 'color']]
)Target Encoding
from cyxwiz.transforms import TargetEncoder
target_enc = TargetEncoder(
smoothing=1.0
)
X_encoded = target_enc.fit_transform(
X['category'], y
)Imputation
from cyxwiz.transforms import SimpleImputer, KNNImputer # Simple imputation imputer = SimpleImputer(strategy='median') X_imputed = imputer.fit_transform(X) # KNN imputation knn_imputer = KNNImputer(n_neighbors=5) X_imputed = knn_imputer.fit_transform(X)
Image Transforms
from cyxwiz.transforms.image import (
Compose, Resize, RandomCrop, RandomHorizontalFlip,
ColorJitter, Normalize, ToTensor
)
transform = Compose([
Resize(256),
RandomCrop(224),
RandomHorizontalFlip(p=0.5),
ColorJitter(brightness=0.2, contrast=0.2),
ToTensor(),
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
transformed_image = transform(image)Best Practices
Scaling Guidelines
- Always scale before distance-based algorithms
- Fit on training data only
- Use RobustScaler when outliers present
- MinMaxScaler for neural networks
Encoding Guidelines
- One-Hot for low cardinality (< 20)
- Target Encoding for high cardinality
- Drop first category for linear models
- Handle unknowns for production