Distributed Task Scheduler - Interview Tips

Interview Approach (45-60 minutes)

Time Management

1. Requirements (5-8 min)
   - Clarify task types (one-time, recurring, cron)
   - Understand scale (tasks, executions)
   - Define timing precision

2. High-Level Design (10-15 min)
   - Draw system architecture
   - Explain scheduler, workers, queue
   - Discuss data flow

3. Deep Dives (20-25 min)
   - Time wheel algorithm
   - Distributed locking
   - Leader election
   - Failure handling

4. Trade-offs (5-10 min)
   - Execution guarantees
   - Timing precision
   - Consistency models

5. Follow-ups (5-10 min)
   - Task dependencies
   - Resource constraints
   - Multi-region deployment

Essential Questions to Ask

Functional Requirements

Critical Questions:
1. "What types of tasks do we need to support?"
   - One-time, recurring, cron-based?
   
2. "What execution guarantees are needed?"
   - Exactly-once, at-least-once, at-most-once?
   
3. "What timing precision is required?"
   - ±100ms, ±1s, ±10s?
   
4. "Do we need task dependencies?"
   - Sequential execution, DAG workflows?
   
5. "How should we handle failures?"
   - Retry logic, dead letter queue?

Good Follow-ups:
- "Should tasks be cancellable?"
- "Do we need task priorities?"
- "What about long-running tasks?"
- "How to handle timezone and DST?"

Scale Requirements

Key Metrics:
1. "How many scheduled tasks?"
   - 1K → Single server
   - 1M → Small cluster
   - 100M+ → Distributed system
   
2. "How many executions per second?"
   - 10 → Simple cron
   - 1K → Distributed scheduler
   - 10K+ → Optimized architecture
   
3. "What's the task duration?"
   - <1 minute → Lightweight workers
   - 1-10 minutes → Standard workers
   - >10 minutes → Checkpointing needed
   
4. "How many concurrent executions?"
   - 100 → Single worker pool
   - 10K → Distributed workers
   - 100K+ → Massive scale

Common Pitfalls to Avoid

1. Not Discussing Distributed Locking

❌ Bad: "We'll just execute the task when time comes"

✅ Good: "We need distributed locking to ensure exactly-once 
execution. When a task is ready, the scheduler acquires a 
lock in Redis using SET NX EX. Only the lock holder can 
execute the task. The lock expires after 5 minutes to handle 
worker failures. This prevents duplicate execution across 
multiple workers."

Key Points:
- Explain locking mechanism
- Discuss lock expiration
- Address failure scenarios

2. Ignoring Clock Skew

❌ Bad: "We'll use server time for scheduling"

✅ Good: "Clock skew is a real issue in distributed systems. 
We'll use NTP to synchronize clocks across all servers with 
±10ms accuracy. We'll also implement tolerance windows (±5s) 
to handle minor drift. For critical tasks, we'll use logical 
clocks or vector clocks for ordering."

Key Points:
- Acknowledge clock skew problem
- Provide synchronization solution
- Discuss tolerance mechanisms

3. Not Addressing Missed Schedules

❌ Bad: "Tasks will always execute on time"

✅ Good: "If the system is down during a scheduled time, we 
need a policy:

1. Catch-up: Execute all missed tasks immediately
2. Skip: Skip missed executions, wait for next schedule
3. Execute Once: Execute only the most recent missed task

For recurring tasks, I'd recommend 'Execute Once' to avoid 
overwhelming the system. For critical one-time tasks, I'd 
use 'Catch-up' with rate limiting."

Key Points:
- Enumerate options
- Provide recommendations
- Consider system impact

4. Overlooking Task Dependencies

❌ Bad: "Tasks execute independently"

✅ Good: "For task dependencies, we need a DAG (Directed 
Acyclic Graph) structure. Store dependencies in a separate 
table. When Task A completes, check for dependent tasks. 
If all dependencies are satisfied, schedule the dependent 
task. We must validate for cycles before accepting 
dependencies to prevent deadlocks."

Key Points:
- Explain dependency model
- Discuss cycle detection
- Address execution order

Strong Talking Points

1. Time Wheel Algorithm

Strong Answer:
"For efficient task scheduling, I'll use a hierarchical 
time wheel algorithm:

Structure:
- Level 1: 60 slots (seconds)
- Level 2: 60 slots (minutes)
- Level 3: 24 slots (hours)
- Level 4: 30 slots (days)

How it works:
1. Task scheduled for 2h 15m 30s from now
2. Place in Level 4, slot 0 (days)
3. When day advances, cascade to Level 3, slot 2 (hours)
4. When hour advances, cascade to Level 2, slot 15 (minutes)
5. When minute advances, cascade to Level 1, slot 30 (seconds)
6. When second arrives, execute task

Benefits:
- O(1) insertion and deletion
- O(1) tick processing
- Memory efficient (handles millions of tasks)
- No database scanning needed

Alternative: Priority Queue
- O(log n) insertion
- O(log n) deletion
- Simpler but slower at scale

I'd use time wheel for better performance at scale."

Why This Works:
- Shows advanced algorithm knowledge
- Explains clearly with example
- Discusses complexity
- Compares alternatives

2. Exactly-Once Execution

Strong Answer:
"For exactly-once execution, I'll use distributed locking:

Implementation:
1. Task becomes ready for execution
2. Scheduler acquires lock in Redis:
   SET task:lock:{task_id} {worker_id} NX EX 300
3. If lock acquired, dispatch to worker
4. Worker executes task
5. Worker releases lock on completion
6. If worker fails, lock expires after 5 minutes
7. Task becomes available for retry

Edge Cases:
- Worker crashes: Lock expires, task retried
- Network partition: Lock prevents duplicate execution
- Clock skew: Use Redis time, not local time

Idempotency:
- Tasks should be idempotent
- Use idempotency keys
- Detect and skip duplicates

This guarantees exactly-once execution even with failures."

Why This Works:
- Complete solution
- Addresses edge cases
- Discusses idempotency
- Practical implementation

3. Handling Task Dependencies

Strong Answer:
"For task dependencies, I'll implement a DAG-based system:

Data Model:
- Tasks table: Task metadata
- Dependencies table: (task_id, depends_on_task_id)

Execution Flow:
1. Task A completes successfully
2. Query dependent tasks: SELECT * FROM dependencies 
   WHERE depends_on_task_id = task_a_id
3. For each dependent task, check if all dependencies met
4. If yes, schedule task
5. If no, keep waiting

Cycle Detection:
- Run topological sort on dependency graph
- Reject if cycle detected
- Validate before accepting dependencies

Failure Handling:
- If Task A fails, mark Task B as blocked
- Option 1: Skip Task B
- Option 2: Retry Task A, then execute Task B
- Configurable per task

This provides flexible workflow orchestration while 
preventing deadlocks."

Why This Works:
- Clear data model
- Explains execution flow
- Addresses cycle detection
- Handles failures

How to Handle Follow-ups

"How would you handle a task that takes 10 hours to complete?"

Strong Answer:
"Great question. Long-running tasks need special handling:

1. Checkpointing:
   - Save progress every 10 minutes
   - Store checkpoint in Redis
   - Resume from checkpoint on failure

2. Lock Extension:
   - Initial lock: 5 minutes
   - Extend every 2 minutes while running
   - Prevent lock expiration

3. Heartbeat:
   - Worker sends heartbeat every 30 seconds
   - Scheduler monitors heartbeats
   - Reassign if no heartbeat for 2 minutes

4. Graceful Cancellation:
   - Check cancellation flag periodically
   - Cleanup and exit gracefully
   - Save final checkpoint

5. Resource Management:
   - Dedicated worker pool for long tasks
   - Prevent blocking short tasks
   - Better resource utilization

Implementation:
class LongRunningTask {
    void execute() {
        State state = loadCheckpoint();
        
        for (int i = state.progress; i < totalWork; i++) {
            doWork(i);
            
            if (i % 1000 == 0) {
                saveCheckpoint(i);
                extendLock();
                sendHeartbeat();
            }
            
            if (cancelled) {
                saveCheckpoint(i);
                throw new CancelledException();
            }
        }
    }
}

This ensures long-running tasks can survive failures and 
be cancelled gracefully."

Why This Works:
- Comprehensive solution
- Multiple techniques
- Practical implementation
- Addresses all concerns

"How would you prevent task starvation?"

Strong Answer:
"Task starvation happens when low-priority tasks never 
execute because high-priority tasks keep coming. Here's 
my solution:

1. Priority Aging:
   - Increase priority based on wait time
   - Low → Medium after 1 hour
   - Medium → High after 2 hours

2. Guaranteed Capacity:
   - Reserve 20% capacity for low-priority
   - Ensures they eventually execute
   - Prevents complete starvation

3. Weighted Fair Queuing:
   - High: 50% of capacity
   - Medium: 30% of capacity
   - Low: 20% of capacity (guaranteed)

4. Starvation Detection:
   - Alert if task waits >4 hours
   - Automatic priority boost
   - Manual intervention option

Implementation:
class StarvationPrevention {
    void agePriorities() {
        List<Task> waiting = getWaitingTasks();
        
        for (Task task : waiting) {
            long waitTime = now() - task.queuedAt;
            
            if (waitTime > 3600 && task.priority == LOW) {
                task.priority = MEDIUM;
                updateTask(task);
            }
            
            if (waitTime > 7200 && task.priority == MEDIUM) {
                task.priority = HIGH;
                updateTask(task);
            }
        }
    }
}

This ensures fairness while maintaining priority semantics."

Why This Works:
- Identifies the problem
- Multiple solutions
- Practical implementation
- Balances priorities

Red Flags to Avoid

❌ Don't Say:

1. "We'll use cron on a single server"
   → Not distributed, no fault tolerance

2. "Tasks will never fail"
   → Unrealistic, shows lack of experience

3. "We don't need locking"
   → Duplicate execution guaranteed

4. "Clock skew isn't a problem"
   → Shows lack of distributed systems knowledge

5. "We'll handle dependencies later"
   → Avoiding complexity

✅ Do Say:

1. "Let me clarify the execution guarantees needed"
   → Shows structured thinking

2. "Here are three approaches with trade-offs"
   → Demonstrates options analysis

3. "We'll use distributed locking for exactly-once"
   → Shows distributed systems knowledge

4. "The time wheel algorithm provides O(1) operations"
   → Knows efficient algorithms

5. "We'll handle failures with retry and DLQ"
   → Comprehensive failure handling

Closing Strong

Good Summary

"To summarize, I've designed a distributed task scheduler that:

1. Uses time wheel algorithm for efficient scheduling
2. Implements distributed locking for exactly-once execution
3. Supports cron expressions and flexible scheduling
4. Handles task dependencies with DAG validation
5. Provides fault tolerance with leader election
6. Scales horizontally with worker pool

Key design decisions:
- At-least-once execution (with idempotency)
- Normal precision (±1s) for cost efficiency
- Push model for lower latency
- Checkpointing for long-running tasks

Trade-offs made:
- Eventual consistency for scalability
- Timing precision for cost
- Complexity for features

Areas for future improvement:
- ML-based task optimization
- Predictive scaling
- Advanced dependency workflows

I'm happy to dive deeper into any specific area."

Why This Works:
- Concise summary
- Highlights key decisions
- Acknowledges trade-offs
- Shows forward thinking

Remember: Focus on demonstrating systematic thinking, understanding of distributed systems, and practical engineering judgment. Good luck!