Distributed Counter - Interview Tips

Interview Approach (45-60 minutes)

Time Management

1. Requirements (5-8 min)
   - Clarify use case (page views, likes, analytics)
   - Understand scale (ops/second, total counters)
   - Define accuracy requirements

2. High-Level Design (10-15 min)
   - Draw system architecture
   - Explain sharding strategy
   - Discuss storage choices

3. Deep Dives (20-25 min)
   - Counter implementation (simple vs sharded)
   - Synchronization strategy
   - Time-windowed counters
   - Failure handling

4. Trade-offs (5-10 min)
   - Accuracy vs performance
   - Consistency models
   - Storage options

5. Follow-ups (5-10 min)
   - Scaling strategies
   - Unique counting
   - Real-time analytics

Essential Questions to Ask

Functional Requirements

Critical Questions:
1. "What are we counting?"
   - Page views, likes, API requests, unique visitors?
   
2. "What accuracy do we need?"
   - 100% exact, 99% eventual, 95% approximate?
   
3. "What operations are needed?"
   - Increment only, or also decrement, reset?
   
4. "Do we need time windows?"
   - Last hour, last day, all-time?
   
5. "What about unique counting?"
   - Total counts or unique items?

Good Follow-ups:
- "Should counters expire automatically?"
- "Do we need historical data?"
- "What about aggregations across counters?"

Scale Requirements

Key Metrics:
1. "How many increments per second?"
   - 1K → Single server
   - 100K → Small cluster
   - 1M+ → Sharded counters
   
2. "How many total counters?"
   - 1K → In-memory only
   - 1M → Redis + database
   - 100M+ → Distributed storage
   
3. "What's the acceptable latency?"
   - <1ms → In-memory, eventual consistency
   - <10ms → Hybrid approach
   - <100ms → Database-backed
   
4. "How long to retain data?"
   - 1 day → Memory only
   - 30 days → Database
   - 1 year → Archival storage

Common Pitfalls to Avoid

1. Not Discussing Sharding

❌ Bad: "We'll use a single counter in Redis"

✅ Good: "For high-traffic counters (>10K ops/s), we'll use 
sharded counters. Split the counter into N shards, increment 
a random shard, and aggregate on read. This eliminates 
contention and scales linearly."

Key Points:
- Explain when to shard (traffic threshold)
- Discuss shard count selection
- Address aggregation overhead

2. Ignoring Consistency Trade-offs

❌ Bad: "The counter will always be accurate"

✅ Good: "We have three options:

1. Strong consistency (100% accurate):
   - Synchronous updates, distributed locks
   - 5-10ms latency, 100K ops/s
   - Use for: Financial, voting

2. Eventual consistency (99%+ accurate):
   - Async replication, CRDT merging
   - <1ms latency, 1M+ ops/s
   - Use for: Page views, analytics

3. Approximate (95%+ accurate):
   - HyperLogLog, probabilistic
   - <0.5ms latency, 10M+ ops/s
   - Use for: Unique counts, dashboards

I'd recommend eventual consistency for most use cases."

Key Points:
- Present multiple options
- Explain trade-offs clearly
- Provide specific recommendations

3. Not Addressing Hot Counters

❌ Bad: "All counters are treated the same"

✅ Good: "We need to handle hot counters differently:

Detection:
- Monitor ops/second per counter
- Alert if >10K ops/s

Mitigation:
1. Increase shard count (4 → 16 shards)
2. Dedicated server for hot counter
3. Use approximate counting (HyperLogLog)
4. Rate limiting if abuse detected

This prevents a single hot counter from becoming a 
bottleneck for the entire system."

Key Points:
- Show awareness of hot spots
- Provide detection mechanism
- Offer multiple solutions

4. Overlooking Time-Windowed Counters

❌ Bad: "We'll just store the total count"

✅ Good: "For time-windowed counters (e.g., requests per hour), 
we'll use a sliding window approach:

1. Store counts in 1-minute buckets
2. Use Redis sorted set with timestamp as score
3. On read, sum buckets in the window
4. Auto-expire old buckets

This provides accurate sliding windows without boundary 
issues that fixed windows have."

Key Points:
- Explain sliding vs fixed windows
- Discuss bucket granularity
- Address automatic cleanup

Strong Talking Points

1. Sharded Counter Implementation

Strong Answer:
"For high-throughput counters, I'll implement sharded counters:

Structure:
counter:page_views:article_123
├── shard:0 → 123,456
├── shard:1 → 234,567
├── shard:2 → 345,678
└── shard:3 → 456,789
Total: 1,160,490

Increment:
1. Hash counter_id to determine server
2. Select random shard (0-3)
3. Increment shard atomically
4. Return immediately (no aggregation)

Read:
1. Query all shards in parallel
2. Sum values
3. Cache result for 1 second
4. Return total

Benefits:
- No contention (random shard selection)
- Linear scaling (add more shards)
- Fast writes (<1ms)
- Acceptable read latency (<5ms)

Shard Count Selection:
- Start with 4 shards
- Increase if ops/s > 10K per shard
- Max 64 shards (diminishing returns)

Why This Works:
- Shows deep understanding
- Provides implementation details
- Discusses scaling strategy
- Addresses trade-offs

2. CRDT for Distributed Counters

Strong Answer:
"For multi-region deployment, I'll use CRDTs 
(Conflict-free Replicated Data Types):

G-Counter (Grow-only):
{
  "us-east": 1000,
  "us-west": 2000,
  "eu-west": 3000
}
Total: 6000

Operations:
- Increment (us-east): us-east += delta
- Read: sum(all regions)
- Merge: for each region, take max(local, remote)

PN-Counter (Positive-Negative):
{
  "positive": {"us-east": 1000, "us-west": 2000},
  "negative": {"us-east": 100, "us-west": 200}
}
Total: (1000 + 2000) - (100 + 200) = 2700

Benefits:
- No coordination needed
- Partition-tolerant
- Eventually consistent
- Guaranteed convergence

Trade-offs:
- Memory grows with regions
- Cannot enforce limits (e.g., non-negative)
- Eventual consistency only

Use When:
- Multi-region deployment
- High availability required
- Eventual consistency acceptable

Why This Works:
- Advanced technique
- Shows distributed systems knowledge
- Explains limitations
- Practical application

3. HyperLogLog for Unique Counts

Strong Answer:
"For counting unique items (e.g., unique visitors), 
I'll use HyperLogLog:

Problem:
- Billions of unique users
- Cannot store all user IDs
- Need approximate count

HyperLogLog Solution:
- Fixed memory: 12KB per counter
- Accuracy: 0.81% standard error
- Capacity: Billions of unique items

How it works:
1. Hash user ID
2. Count leading zeros in hash
3. Update register with max value
4. Estimate cardinality from registers

Operations:
ADD(user_id):
  hash = hash(user_id)
  register = hash & 0xFFFF
  value = leading_zeros(hash >> 16)
  registers[register] = max(registers[register], value)

COUNT():
  return 2^(average(registers)) × correction_factor

Benefits:
- Constant memory (12KB)
- Very fast operations
- Mergeable (combine multiple HLLs)
- Good enough accuracy (0.81% error)

Trade-offs:
- Approximate (not exact)
- Cannot decrement
- Cannot list items
- Cannot check membership

Use Cases:
- Unique page visitors
- Unique video viewers
- Daily/Monthly active users
- Cardinality estimation

Why This Works:
- Solves specific problem
- Explains algorithm clearly
- Discusses trade-offs
- Provides use cases

How to Handle Follow-ups

"How would you count unique visitors?"

Strong Answer:
"Great question. For unique counting, I have two approaches:

Approach 1: Exact Count (Small Scale)
- Store user IDs in Redis Set
- SADD unique_visitors:article_123 user_id
- SCARD to get count
- Memory: 8 bytes per user
- Limit: ~10M unique users per counter

Approach 2: Approximate Count (Large Scale)
- Use HyperLogLog
- PFADD unique_visitors:article_123 user_id
- PFCOUNT to get estimate
- Memory: 12KB fixed
- Capacity: Billions of users
- Error: 0.81%

Recommendation:
- Use exact for <10M unique items
- Use HyperLogLog for >10M unique items
- Trade 0.81% accuracy for 1000x memory savings

For most use cases (analytics, dashboards), the 0.81% 
error is acceptable and the memory savings are huge."

Why This Works:
- Two solutions for different scales
- Clear recommendation
- Quantifies trade-offs
- Practical guidance

"How would you handle counter overflow?"

Strong Answer:
"Counter overflow is rarely an issue in practice:

64-bit Integer Max: 9,223,372,036,854,775,807
At 1M increments/second: 292,471 years to overflow

However, if we need arbitrary precision:

Solution 1: BigInteger
- Use arbitrary precision arithmetic
- Store as string in Redis
- Slower but no overflow

Solution 2: Hierarchical Counters
- counter_low: 0 to 2^32-1
- counter_high: overflow count
- total = counter_high × 2^32 + counter_low

Solution 3: Scientific Notation
- Store as mantissa × 10^exponent
- Example: 1.23 × 10^15

Recommendation:
- 64-bit is sufficient for 99.99% of use cases
- Only use arbitrary precision if truly needed
- Monitor for approaching overflow (>2^60)

The key insight is that overflow is a theoretical concern 
but rarely a practical problem."

Why This Works:
- Addresses concern directly
- Provides context (time to overflow)
- Offers solutions if needed
- Practical recommendation

Red Flags to Avoid

❌ Don't Say:

1. "We'll use a database counter"
   → Too slow for high throughput

2. "All counters will be 100% accurate"
   → Ignores distributed systems challenges

3. "We don't need to worry about hot counters"
   → Shows lack of practical experience

4. "We'll lock the counter for updates"
   → Terrible for performance

5. "Overflow isn't a problem"
   → Should at least acknowledge it

✅ Do Say:

1. "Let me clarify the accuracy requirements"
   → Shows structured thinking

2. "Here are three approaches with trade-offs"
   → Demonstrates options analysis

3. "For high-traffic counters, we'll use sharding"
   → Shows scalability awareness

4. "We'll use CRDTs for eventual consistency"
   → Advanced distributed systems knowledge

5. "HyperLogLog for unique counts"
   → Knows appropriate algorithms

Closing Strong

Good Summary

"To summarize, I've designed a distributed counter system that:

1. Uses sharded counters for high throughput (1M+ ops/s)
2. Implements eventual consistency with CRDTs
3. Supports time-windowed counting with sliding windows
4. Uses HyperLogLog for unique counting
5. Handles hot counters with dynamic sharding

Key design decisions:
- Eventual consistency (99%+ accuracy)
- Redis for hot data, database for persistence
- Sharding threshold at 10K ops/s per counter
- 1-minute buckets for time windows

Trade-offs made:
- Accuracy for performance (99% vs 100%)
- Read latency for write throughput
- Memory for speed (caching)

I'm happy to dive deeper into any specific area."

Why This Works:
- Concise summary
- Highlights key decisions
- Acknowledges trade-offs
- Invites discussion

Remember: Focus on demonstrating systematic thinking, understanding of trade-offs, and practical engineering judgment. Good luck!