DocsJobs
Job Execution
Understanding how Server Nodes execute training and inference jobs.
Job Lifecycle
Job Assignment Execution Completion
│ │ │
v v v
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Received │─>│ Validate │─>│ Download │─>│ Execute │─>│ Upload │
│ │ │ │ │ Data │ │ Training │ │ Results │
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │ │ │
v v v v
Reject if Cache data Progress Report
invalid locally updates completionJob Types
Training Jobs
- Full model training from scratch
- Fine-tuning pre-trained models
- Transfer learning
- Distributed training (multi-GPU)
Inference Jobs
- Batch inference on datasets
- Real-time inference API
- Model evaluation/testing
- Feature extraction
Job Assignment
Jobs are assigned based on a scoring algorithm:
fn score_node(job: &Job, node: &Node) -> f64 {
let mut score = 0.0;
// Device match (required GPU/memory)
if node.has_device(job.required_device) {
score += 100.0;
}
// Memory availability
if node.available_memory >= job.estimated_memory {
score += 50.0;
}
// Reputation bonus (0.0 - 1.0)
score += node.reputation * 30.0;
// Penalize busy nodes
score -= node.active_jobs as f64 * 10.0;
score
}Job Configuration
message JobConfig {
string job_id = 1;
JobType job_type = 2; // TRAINING, INFERENCE
string model_config = 3; // Serialized model definition
string dataset_uri = 4; // IPFS/S3/HTTP URL
// Training parameters
int32 epochs = 5;
int32 batch_size = 6;
double learning_rate = 7;
// Resource requirements
DeviceType required_device = 8; // CUDA, OPENCL, CPU
int64 min_memory_mb = 9;
int32 priority = 10; // 1-10
// Limits
int64 timeout_seconds = 11;
int64 max_output_size_mb = 12;
}Progress Reporting
Nodes report progress to the Central Server periodically:
message ProgressUpdate {
string job_id = 1;
double progress = 2; // 0.0 to 1.0
int32 current_epoch = 3;
int32 total_epochs = 4;
// Training metrics
map<string, double> metrics = 5; // loss, accuracy, etc.
// Resource usage
double gpu_utilization = 6;
int64 memory_used_mb = 7;
// Timing
int64 elapsed_seconds = 8;
int64 eta_seconds = 9;
}Execution Environment
| Feature | Description |
|---|---|
| Isolation | Jobs run in sandboxed environments |
| Resource Limits | CPU, memory, and GPU constraints enforced |
| Network Access | Limited to dataset downloads and result uploads |
| Checkpointing | Automatic model checkpoints during training |
| Cleanup | Temporary files removed after completion |
Completion Report
message CompletionReport {
string job_id = 1;
CompletionStatus status = 2; // SUCCESS, FAILED, CANCELLED
// Results
string model_hash = 3; // SHA-256 of model weights
string model_uri = 4; // IPFS/storage location
// Final metrics
map<string, double> final_metrics = 5;
// Compute summary
int64 compute_time_ms = 6;
int64 gpu_time_ms = 7;
double average_gpu_util = 8;
// Proof of compute (optional)
bytes proof_of_compute = 9;
// Error info (if failed)
string error_message = 10;
string error_stack = 11;
}Error Handling
Recoverable Errors
- Network timeouts - Automatic retry
- GPU memory overflow - Reduce batch size
- Checkpoint failures - Continue from last
Fatal Errors
- Invalid model configuration
- Dataset corruption
- Hardware failures
Job Handling Configuration
[jobs] # Maximum concurrent jobs max_concurrent = 3 # Auto-accept jobs matching criteria auto_accept = true auto_accept_min_payment = 10.0 # CYXWIZ # Job types to accept accept_training = true accept_inference = true # Resource allocation per job default_gpu_memory_percent = 80 default_cpu_threads = 4 # Checkpointing checkpoint_interval_epochs = 5 max_checkpoints = 3 # Timeouts heartbeat_interval_ms = 5000 progress_report_interval_ms = 10000