CyxWiz LogoCyxWiz
DocsTransformations

Transformation Tools

Data transformation and preprocessing tools for preparing datasets for machine learning.

Overview

Scaling
StandardScaler, MinMaxScaler, RobustScaler
Encoding
OneHot, Label, Target encoding
Feature Engineering
Polynomial features, binning
Missing Data
Imputation strategies
Image Transforms
Augmentation, resizing, normalization

Scaling Methods

ScalerFormulaUse CaseOutlier Sensitive
Standard(x - mean) / stdMost algorithmsYes
MinMax(x - min) / (max - min)Neural networksYes
MaxAbsx / max(abs(x))Sparse dataYes
Robust(x - median) / IQROutlier presenceNo
Normalizerx / norm(x)Text/sparseNo

Scaling Usage

from cyxwiz.transforms import StandardScaler, MinMaxScaler, RobustScaler

# Standard scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Transform new data (use same parameters)
X_new_scaled = scaler.transform(X_new)

# Inverse transform
X_original = scaler.inverse_transform(X_scaled)

Encoding

One-Hot Encoding
from cyxwiz.transforms import OneHotEncoder

encoder = OneHotEncoder(
    drop='first',
    handle_unknown='ignore'
)
X_encoded = encoder.fit_transform(
    X[['category', 'color']]
)
Target Encoding
from cyxwiz.transforms import TargetEncoder

target_enc = TargetEncoder(
    smoothing=1.0
)
X_encoded = target_enc.fit_transform(
    X['category'], y
)

Imputation

from cyxwiz.transforms import SimpleImputer, KNNImputer

# Simple imputation
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)

# KNN imputation
knn_imputer = KNNImputer(n_neighbors=5)
X_imputed = knn_imputer.fit_transform(X)

Image Transforms

from cyxwiz.transforms.image import (
    Compose, Resize, RandomCrop, RandomHorizontalFlip,
    ColorJitter, Normalize, ToTensor
)

transform = Compose([
    Resize(256),
    RandomCrop(224),
    RandomHorizontalFlip(p=0.5),
    ColorJitter(brightness=0.2, contrast=0.2),
    ToTensor(),
    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

transformed_image = transform(image)

Best Practices

Scaling Guidelines
  1. Always scale before distance-based algorithms
  2. Fit on training data only
  3. Use RobustScaler when outliers present
  4. MinMaxScaler for neural networks
Encoding Guidelines
  1. One-Hot for low cardinality (< 20)
  2. Target Encoding for high cardinality
  3. Drop first category for linear models
  4. Handle unknowns for production