Job Execution

Understanding how Server Nodes execute training and inference jobs.

Job Lifecycle

Job Assignment           Execution                    Completion
      │                      │                             │
      v                      v                             v
┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
│ Received │─>│ Validate │─>│ Download │─>│ Execute  │─>│ Upload   │
│          │  │          │  │ Data     │  │ Training │  │ Results  │
└──────────┘  └──────────┘  └──────────┘  └──────────┘  └──────────┘
                  │              │              │              │
                  v              v              v              v
              Reject if     Cache data    Progress       Report
              invalid       locally       updates        completion

Job Types

Training Jobs

Full model training from scratch
Fine-tuning pre-trained models
Transfer learning
Distributed training (multi-GPU)

Inference Jobs

Batch inference on datasets
Real-time inference API
Model evaluation/testing
Feature extraction

Job Assignment

Jobs are assigned based on a scoring algorithm:

fn score_node(job: &Job, node: &Node) -> f64 {
    let mut score = 0.0;

    // Device match (required GPU/memory)
    if node.has_device(job.required_device) {
        score += 100.0;
    }

    // Memory availability
    if node.available_memory >= job.estimated_memory {
        score += 50.0;
    }

    // Reputation bonus (0.0 - 1.0)
    score += node.reputation * 30.0;

    // Penalize busy nodes
    score -= node.active_jobs as f64 * 10.0;

    score
}

Job Configuration

message JobConfig {
    string job_id = 1;
    JobType job_type = 2;              // TRAINING, INFERENCE
    string model_config = 3;           // Serialized model definition
    string dataset_uri = 4;            // IPFS/S3/HTTP URL

    // Training parameters
    int32 epochs = 5;
    int32 batch_size = 6;
    double learning_rate = 7;

    // Resource requirements
    DeviceType required_device = 8;    // CUDA, OPENCL, CPU
    int64 min_memory_mb = 9;
    int32 priority = 10;               // 1-10

    // Limits
    int64 timeout_seconds = 11;
    int64 max_output_size_mb = 12;
}

Progress Reporting

Nodes report progress to the Central Server periodically:

message ProgressUpdate {
    string job_id = 1;
    double progress = 2;          // 0.0 to 1.0
    int32 current_epoch = 3;
    int32 total_epochs = 4;

    // Training metrics
    map<string, double> metrics = 5;  // loss, accuracy, etc.

    // Resource usage
    double gpu_utilization = 6;
    int64 memory_used_mb = 7;

    // Timing
    int64 elapsed_seconds = 8;
    int64 eta_seconds = 9;
}

Execution Environment

Feature	Description
Isolation	Jobs run in sandboxed environments
Resource Limits	CPU, memory, and GPU constraints enforced
Network Access	Limited to dataset downloads and result uploads
Checkpointing	Automatic model checkpoints during training
Cleanup	Temporary files removed after completion

Completion Report

message CompletionReport {
    string job_id = 1;
    CompletionStatus status = 2;   // SUCCESS, FAILED, CANCELLED

    // Results
    string model_hash = 3;         // SHA-256 of model weights
    string model_uri = 4;          // IPFS/storage location

    // Final metrics
    map<string, double> final_metrics = 5;

    // Compute summary
    int64 compute_time_ms = 6;
    int64 gpu_time_ms = 7;
    double average_gpu_util = 8;

    // Proof of compute (optional)
    bytes proof_of_compute = 9;

    // Error info (if failed)
    string error_message = 10;
    string error_stack = 11;
}

Error Handling

Recoverable Errors

Network timeouts - Automatic retry
GPU memory overflow - Reduce batch size
Checkpoint failures - Continue from last

Fatal Errors

Invalid model configuration
Dataset corruption
Hardware failures

Job Handling Configuration

[jobs]
# Maximum concurrent jobs
max_concurrent = 3

# Auto-accept jobs matching criteria
auto_accept = true
auto_accept_min_payment = 10.0  # CYXWIZ

# Job types to accept
accept_training = true
accept_inference = true

# Resource allocation per job
default_gpu_memory_percent = 80
default_cpu_threads = 4

# Checkpointing
checkpoint_interval_epochs = 5
max_checkpoints = 3

# Timeouts
heartbeat_interval_ms = 5000
progress_report_interval_ms = 10000