Problem Statement

📖 5 min read 📄 Part 1 of 10

Distributed Task Scheduler - Problem Statement

Overview

Design a distributed task scheduling system similar to Cron that can schedule and execute tasks across multiple machines with reliability, fault tolerance, and exactly-once execution guarantees. The system should handle millions of scheduled tasks with precise timing and automatic failure recovery.

Functional Requirements

Core Scheduling Features

  • Schedule Tasks: Create tasks with cron expressions or specific timestamps
  • Execute Tasks: Run tasks at scheduled times across distributed workers
  • Task Types: One-time tasks, recurring tasks, cron-based tasks
  • Task Priority: High, medium, low priority execution
  • Task Dependencies: Execute tasks in dependency order
  • Task Cancellation: Cancel scheduled or running tasks

Scheduling Patterns

  • Cron Expression: 0 0 * * * (daily at midnight)
  • Fixed Rate: Every 5 minutes
  • Fixed Delay: 5 minutes after previous completion
  • One-Time: Execute once at specific time
  • Date Range: Execute between start and end dates
  • Custom Calendar: Skip holidays, weekends

Task Execution

  • Exactly-Once: Guarantee task executes exactly once
  • At-Least-Once: Guarantee task executes at least once (may retry)
  • At-Most-Once: Best effort, may skip on failure
  • Timeout Handling: Kill tasks exceeding timeout
  • Retry Logic: Configurable retry with exponential backoff
  • Failure Handling: Dead letter queue for failed tasks

Task Management

  • Task Status: Scheduled, running, completed, failed, cancelled
  • Task History: Track all executions with timestamps
  • Task Logs: Capture stdout/stderr from task execution
  • Task Metrics: Duration, success rate, failure rate
  • Task Search: Query tasks by status, schedule, tags
  • Bulk Operations: Schedule/cancel multiple tasks

Non-Functional Requirements

Performance Requirements

  • Scheduling Precision: ±1 second accuracy
  • Task Throughput: 10,000 tasks executed per second
  • Scheduling Latency: <100ms to schedule a task
  • Execution Latency: <1 second from scheduled time to execution start
  • Query Performance: <100ms to query task status

Scalability Requirements

  • Total Tasks: Support 100 million scheduled tasks
  • Concurrent Executions: 100,000 tasks running simultaneously
  • Workers: Scale to 10,000 worker nodes
  • Task Rate: Handle 1 million task schedules per day
  • Geographic Distribution: Multi-region deployment

Reliability Requirements

  • System Uptime: 99.99% availability
  • Execution Guarantee: 99.99% tasks execute as scheduled
  • No Duplicate Execution: <0.01% duplicate task executions
  • Fault Tolerance: Survive multiple worker failures
  • Data Durability: No task loss on system failure
  • Recovery Time: <30 seconds after failure

Timing Requirements

  • Precision: ±1 second for most tasks
  • High Precision: ±100ms for critical tasks
  • Clock Skew Tolerance: Handle ±5 seconds clock skew
  • Timezone Support: All timezones, DST handling
  • Leap Second Handling: Graceful handling of leap seconds

Use Cases

1. Batch Job Scheduling

Scenario: Daily data processing jobs
Schedule: 0 2 * * * (2 AM daily)
Duration: 30 minutes
Requirements:
- Exactly-once execution
- Dependency management
- Failure notifications

2. Periodic Health Checks

Scenario: Monitor service health every 5 minutes
Schedule: */5 * * * *
Duration: 10 seconds
Requirements:
- High precision (±10s)
- Fast execution
- Alert on failure

3. Reminder Notifications

Scenario: Send user reminders at specific times
Schedule: One-time at user-specified time
Duration: 1 second
Requirements:
- Timezone aware
- Exactly-once delivery
- Low latency

4. Report Generation

Scenario: Generate monthly reports
Schedule: 0 0 1 * * (1st of month at midnight)
Duration: 2 hours
Requirements:
- Dependency on data availability
- Retry on failure
- Result storage

Edge Cases and Constraints

Timing Edge Cases

  • DST Transitions: Handle spring forward/fall back
  • Leap Seconds: 23:59:60 handling
  • Timezone Changes: User changes timezone
  • Clock Skew: Servers with different times
  • Missed Schedules: System down during scheduled time

Execution Edge Cases

  • Long-Running Tasks: Tasks exceeding timeout
  • Resource Exhaustion: Too many concurrent tasks
  • Worker Failures: Worker crashes during execution
  • Network Partitions: Split brain scenarios
  • Cascading Failures: One failure triggers others

Scheduling Edge Cases

  • Overlapping Executions: Previous run still running
  • Rapid Rescheduling: Task rescheduled while running
  • Bulk Scheduling: Schedule 1M tasks at once
  • Past Schedules: Schedule task in the past
  • Far Future: Schedule task years in advance

Success Metrics

Performance Metrics

  • Scheduling Latency P99: <100ms
  • Execution Latency P99: <1 second
  • Task Throughput: 10K+ tasks/second
  • Query Latency P99: <100ms
  • Worker Utilization: 70-80% average

Reliability Metrics

  • Execution Success Rate: 99.99%
  • Timing Accuracy: 99% within ±1 second
  • Duplicate Rate: <0.01%
  • System Uptime: 99.99%
  • Recovery Time: <30 seconds

Business Metrics

  • Cost per Task: <$0.0001 per execution
  • Infrastructure Cost: <$10K per billion tasks/month
  • Operational Overhead: <1 FTE per 1000 workers
  • Customer Satisfaction: >4.5/5 rating

This problem statement provides the foundation for designing a robust, scalable, and reliable distributed task scheduling system.