CyxWiz LogoCyxWiz
DocsJobs

Job Execution

Understanding how Server Nodes execute training and inference jobs.

Job Lifecycle

Job Assignment           Execution                    Completion
      │                      │                             │
      v                      v                             v
┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
│ Received │─>│ Validate │─>│ Download │─>│ Execute  │─>│ Upload   │
│          │  │          │  │ Data     │  │ Training │  │ Results  │
└──────────┘  └──────────┘  └──────────┘  └──────────┘  └──────────┘
                  │              │              │              │
                  v              v              v              v
              Reject if     Cache data    Progress       Report
              invalid       locally       updates        completion

Job Types

Training Jobs
  • Full model training from scratch
  • Fine-tuning pre-trained models
  • Transfer learning
  • Distributed training (multi-GPU)
Inference Jobs
  • Batch inference on datasets
  • Real-time inference API
  • Model evaluation/testing
  • Feature extraction

Job Assignment

Jobs are assigned based on a scoring algorithm:

fn score_node(job: &Job, node: &Node) -> f64 {
    let mut score = 0.0;

    // Device match (required GPU/memory)
    if node.has_device(job.required_device) {
        score += 100.0;
    }

    // Memory availability
    if node.available_memory >= job.estimated_memory {
        score += 50.0;
    }

    // Reputation bonus (0.0 - 1.0)
    score += node.reputation * 30.0;

    // Penalize busy nodes
    score -= node.active_jobs as f64 * 10.0;

    score
}

Job Configuration

message JobConfig {
    string job_id = 1;
    JobType job_type = 2;              // TRAINING, INFERENCE
    string model_config = 3;           // Serialized model definition
    string dataset_uri = 4;            // IPFS/S3/HTTP URL

    // Training parameters
    int32 epochs = 5;
    int32 batch_size = 6;
    double learning_rate = 7;

    // Resource requirements
    DeviceType required_device = 8;    // CUDA, OPENCL, CPU
    int64 min_memory_mb = 9;
    int32 priority = 10;               // 1-10

    // Limits
    int64 timeout_seconds = 11;
    int64 max_output_size_mb = 12;
}

Progress Reporting

Nodes report progress to the Central Server periodically:

message ProgressUpdate {
    string job_id = 1;
    double progress = 2;          // 0.0 to 1.0
    int32 current_epoch = 3;
    int32 total_epochs = 4;

    // Training metrics
    map<string, double> metrics = 5;  // loss, accuracy, etc.

    // Resource usage
    double gpu_utilization = 6;
    int64 memory_used_mb = 7;

    // Timing
    int64 elapsed_seconds = 8;
    int64 eta_seconds = 9;
}

Execution Environment

FeatureDescription
IsolationJobs run in sandboxed environments
Resource LimitsCPU, memory, and GPU constraints enforced
Network AccessLimited to dataset downloads and result uploads
CheckpointingAutomatic model checkpoints during training
CleanupTemporary files removed after completion

Completion Report

message CompletionReport {
    string job_id = 1;
    CompletionStatus status = 2;   // SUCCESS, FAILED, CANCELLED

    // Results
    string model_hash = 3;         // SHA-256 of model weights
    string model_uri = 4;          // IPFS/storage location

    // Final metrics
    map<string, double> final_metrics = 5;

    // Compute summary
    int64 compute_time_ms = 6;
    int64 gpu_time_ms = 7;
    double average_gpu_util = 8;

    // Proof of compute (optional)
    bytes proof_of_compute = 9;

    // Error info (if failed)
    string error_message = 10;
    string error_stack = 11;
}

Error Handling

Recoverable Errors
  • Network timeouts - Automatic retry
  • GPU memory overflow - Reduce batch size
  • Checkpoint failures - Continue from last
Fatal Errors
  • Invalid model configuration
  • Dataset corruption
  • Hardware failures

Job Handling Configuration

[jobs]
# Maximum concurrent jobs
max_concurrent = 3

# Auto-accept jobs matching criteria
auto_accept = true
auto_accept_min_payment = 10.0  # CYXWIZ

# Job types to accept
accept_training = true
accept_inference = true

# Resource allocation per job
default_gpu_memory_percent = 80
default_cpu_threads = 4

# Checkpointing
checkpoint_interval_epochs = 5
max_checkpoints = 3

# Timeouts
heartbeat_interval_ms = 5000
progress_report_interval_ms = 10000