Production-Grade Error Handling: Retries, Backoff, and Alerting
Your integration will fail. The question isn't if, but when—and how gracefully you handle it.
Production-Grade Error Handling: Retries, Backoff, and Alerting
Your integration will fail. The question isn't if, but when—and how gracefully you handle it.
In TurfDrive's first month, we saw every type of failure imaginable: API rate limits, network timeouts, validation errors, duplicate records, and the occasional "503 Service Unavailable" from Pipedrive during maintenance.
Here's how we built error handling that keeps syncing even when things go wrong.
The Failure Modes
Before building recovery logic, we cataloged how TurfDrive could fail:
1. Transient Network Errors
- Timeout connecting to Pipedrive API
- Connection reset mid-request
- DNS resolution failures
- SSL handshake errors
Characteristic: Temporary. Will succeed if retried.
2. API Rate Limits
- Pipedrive: 100 requests/10 seconds
- Ostendo SOAP API: 50 concurrent connections
Characteristic: Temporary. Need backoff and queue management.
3. Validation Errors
- Missing required fields
- Invalid data format (phone numbers, emails)
- Business rule violations (can't create deal without customer)
Characteristic: Permanent. Will fail again unless data is fixed.
4. Duplicate Records
- Customer already exists in Ostendo (different ID)
- Deal already synced (idempotency check failed)
Characteristic: Recoverable. Needs deduplication logic.
5. Conflict Errors
- Record modified by another process
- Version mismatch (stale data)
- Concurrent update detected
Characteristic: Recoverable. Needs retry with fresh data.
6. Service Downtime
- Pipedrive maintenance window
- Ostendo server restart
- Network infrastructure issue
Characteristic: Extended. Needs patient retry with exponential backoff.
The Solution: Error Categorization + State Machine
We built a state machine with four sync states:
class Job < ApplicationRecord
SYNC_STATUSES = %w[pending synced error skipped].freeze
# pending = queued for sync
# synced = successfully synced
# error = failed, will retry
# skipped = permanent failure, manual intervention needed
end
Each sync attempt categorizes the error and transitions accordingly:
def sync_job(job)
begin
push_to_pipedrive(job)
job.update!(sync_status: 'synced', last_synced_at: Time.current)
rescue RateLimitError => e
retry_with_backoff(job, delay: calculate_backoff(job))
rescue ValidationError => e
job.update!(sync_status: 'skipped', error_message: e.message)
alert_validation_error(job, e)
rescue NetworkError => e
retry_later(job) if job.sync_attempts < MAX_ATTEMPTS
job.update!(sync_status: 'error', error_message: e.message)
rescue => e
Sentry.capture_exception(e)
job.update!(sync_status: 'error', error_message: e.message)
end
end
Exponential Backoff: The Retry Strategy
For transient errors, we retry with exponential backoff:
Attempt 1: Immediate
Attempt 2: 1 minute later
Attempt 3: 5 minutes later
Attempt 4: 15 minutes later
Attempt 5: 1 hour later
Attempt 6+: 6 hours later (then give up)
def calculate_backoff(job)
attempt = job.sync_attempts || 0
base_delay = 60 # seconds
case attempt
when 0..1 then 0
when 2 then base_delay * 1 # 1 min
when 3 then base_delay * 5 # 5 min
when 4 then base_delay * 15 # 15 min
when 5 then base_delay * 60 # 1 hour
else base_delay * 360 # 6 hours
end
end
def retry_with_backoff(job, delay:)
job.increment!(:sync_attempts)
SyncJobWorker.perform_in(delay.seconds, job.id)
end
Why exponential?
- Fast retry for quick blips (network hiccup)
- Slower retry for extended outages (API maintenance)
- Avoids hammering a degraded service
- Eventually gives up if permanently broken
Handling Rate Limits
APIs have limits. Hitting them breaks your sync. Here's how we stay under the cap:
1. Detect Rate Limit Errors
Pipedrive returns 429 Too Many Requests with a Retry-After header:
rescue Pipedrive::RateLimitError => e
retry_after = e.response.headers['Retry-After'].to_i
retry_after = 60 if retry_after.zero? # Default to 1 min
job.update!(
sync_status: 'error',
retry_at: Time.current + retry_after.seconds
)
SyncJobWorker.perform_at(job.retry_at, job.id)
end
2. Throttle Proactively
Don't wait to hit the limit. Track request count and throttle:
class PipedriveClient
RATE_LIMIT = 100 # requests
RATE_WINDOW = 10 # seconds
def initialize
@request_timestamps = []
end
def request(method, path, body = nil)
enforce_rate_limit
response = HTTP.send(method, path, json: body)
@request_timestamps << Time.current
response
end
private
def enforce_rate_limit
# Remove timestamps older than rate window
cutoff = Time.current - RATE_WINDOW.seconds
@request_timestamps.reject! { |ts| ts < cutoff }
# If at limit, sleep until oldest request expires
if @request_timestamps.size >= RATE_LIMIT
sleep_duration = RATE_WINDOW - (Time.current - @request_timestamps.first)
sleep(sleep_duration) if sleep_duration > 0
end
end
end
3. Batch + Delay
For bulk operations, batch and delay:
def sync_all_jobs(jobs)
jobs.in_batches(of: 50) do |batch|
batch.each { |job| sync_job(job) }
sleep(10) # 10 sec between batches
end
end
Processing 500 jobs in batches of 50 with 10-second delays = 100 seconds total. Slow, but respectful of API limits.
Validation Errors: Skip, Don't Retry
Some errors are permanent. Retrying won't fix them.
rescue Ostendo::ValidationError => e
job.update!(
sync_status: 'skipped',
error_message: e.message,
needs_manual_review: true
)
SlackNotifier.alert(
channel: '#turfdrive-errors',
message: "Job #{job.id} skipped: #{e.message}",
link: admin_job_url(job)
)
end
Key insight: Don't let permanent failures clog the retry queue. Mark them as skipped, alert a human, and move on.
Logging Every Error (Structured)
We log every error with structured JSON:
def log_sync_error(job, error)
SyncLog.create!(
syncable: job,
action: 'sync_to_pipedrive',
status: 'error',
error_class: error.class.name,
error_message: error.message,
error_backtrace: error.backtrace[0..10],
retry_count: job.sync_attempts,
metadata: {
job_id: job.id,
ostendo_id: job.ostendo_id,
pipedrive_deal_id: job.pipedrive_deal_id,
sync_status: job.sync_status
}
)
end
This gives us:
- Searchable logs (query by error class, job ID, date range)
- Error trends (which errors are most common?)
- Debugging context (what was the state when it failed?)
Daily Error Digest
Instead of alerting on every error (noisy), we send a daily digest:
Email sent at 8 AM:
```
TurfDrive Error Digest - Feb 17, 2026
Summary:
- 12 jobs synced successfully
- 3 transient errors (retrying)
- 1 validation error (needs review)
- 0 critical failures
Details:
[Transient Errors - Will Retry]
- Job #4532: Network timeout (attempt 2/6)
- Job #4541: Rate limit hit (retry in 5 min)
- Job #4555: Connection reset (attempt 1/6)
[Validation Errors - Action Required]
- Job #4523: Missing customer email
→ Fix in Ostendo or skip permanently
→ Dashboard: https://turfdrive.app/admin/jobs/4523
[Performance]
- Avg sync time: 2.3 seconds
- API success rate: 97.5%
- Jobs pending: 5
```
Code:
```ruby
class DailyErrorDigestJob
def perform
date = Date.current
transient_errors = SyncLog
.where(created_at: date.all_day, status: 'error')
.where(syncable_type: 'Job')
.where("retry_count < ?", MAX_RETRIES)
validation_errors = SyncLog
.where(created_at: date.all_day, status: 'skipped')
send_digest_email(
transient: transient_errors,
validation: validation_errors,
stats: calculate_stats(date)
)
end
end
```
This balances awareness (team knows what's failing) with signal-to-noise (not 50 Slack pings per day).
Alerting on Critical Failures
Some errors need immediate attention:
CRITICAL_ERRORS = [
'Ostendo::AuthenticationError',
'Pipedrive::InvalidApiKeyError',
'Database::ConnectionLost'
].freeze
rescue => e
if CRITICAL_ERRORS.include?(e.class.name)
PagerDuty.trigger(
service: 'turfdrive',
message: "Critical: #{e.message}",
severity: 'critical'
)
end
raise
end
Critical = system-wide failure. Authentication errors, database crashes, or API key invalidation. These need immediate action.
Health Checks
We expose a health endpoint for monitoring:
# /health
{
"status": "healthy",
"last_successful_sync": "2026-02-17T06:00:00Z",
"jobs_pending": 5,
"jobs_with_errors": 2,
"api_status": {
"pipedrive": "ok",
"ostendo": "ok"
}
}
If last_successful_sync is >1 hour ago, monitoring alerts us.
Edge Case: Cascading Failures
Scenario: Pipedrive goes down for 2 hours. 500 jobs pile up in the queue.
Problem: When Pipedrive comes back up, we hammer it with 500 requests instantly, hit rate limits, and fail again.
Solution: Rate-limit the retry queue:
def process_retry_queue
jobs_to_retry = Job.where(sync_status: 'error', retry_at: ..Time.current)
jobs_to_retry.in_batches(of: 20) do |batch|
batch.each { |job| SyncJobWorker.perform_async(job.id) }
sleep(30) # 30 sec between batches
end
end
Drip-feed retries instead of flooding the API.
Observability: The Dashboard
We built a simple dashboard showing:
Error Rate Over Time
Line chart: successful vs failed syncs by hour
Top Errors (Last 7 Days)
Bar chart: most common error classes
Jobs Needing Review
Table: validation errors requiring manual intervention
Recent Activity
Feed: last 50 sync operations with status
Non-technical users can see system health at a glance.
Lessons Learned
1. Categorize Errors Early
Not all errors are equal. Separate transient (retry), permanent (skip), and critical (alert).
2. Exponential Backoff Is Your Friend
Fast retry for quick blips, slow retry for extended outages. Don't give up immediately, but don't retry forever.
3. Rate Limits Will Hit You
Throttle proactively. Batch operations. Respect Retry-After headers. Don't hammer degraded services.
4. Log Everything (Structured)
When debugging, you'll need context. Structured logs make errors searchable and trends visible.
5. Daily Digests > Real-Time Alerts
Most errors aren't urgent. Batch them into daily summaries. Only alert immediately on critical failures.
6. Visibility Builds Trust
Users need to see that errors are handled, retried, and resolved. A dashboard showing "3 errors, all retrying" is reassuring.
The Result
One year in:
- 97.8% success rate on first attempt
- 99.8% eventual success after retries
- ~10 errors/week requiring manual review
- Zero production outages due to error handling failures
Errors happen. The system doesn't break.
Takeaways
- Errors are inevitable. Design for them from day one.
- Categorize, don't treat all failures the same. Transient, permanent, and critical need different strategies.
- Exponential backoff prevents cascading failures. Fast retry for blips, slow retry for outages.
- Rate limits are real. Throttle proactively, respect headers, batch operations.
- Observability is critical. Log everything, surface errors in dashboards, send digests.
- Don't alert on everything. Separate signal (critical failures) from noise (transient errors).
Building error handling isn't glamorous. But it's the difference between a flaky integration and a reliable system users trust.
Next in this series: Data integrity checks and automated audits to keep your systems in sync.
Contact us if you need help building reliable integrations with production-grade error handling.