Production-Grade Error Handling: Retries, Backoff, and Alerting

Your integration will fail. The question isn't if, but when—and how gracefully you handle it.

In TurfDrive's first month, we saw every type of failure imaginable: API rate limits, network timeouts, validation errors, duplicate records, and the occasional "503 Service Unavailable" from Pipedrive during maintenance.

Here's how we built error handling that keeps syncing even when things go wrong.

The Failure Modes

Before building recovery logic, we cataloged how TurfDrive could fail:

1. Transient Network Errors

Timeout connecting to Pipedrive API
Connection reset mid-request
DNS resolution failures
SSL handshake errors

Characteristic: Temporary. Will succeed if retried.

2. API Rate Limits

Pipedrive: 100 requests/10 seconds
Ostendo SOAP API: 50 concurrent connections

Characteristic: Temporary. Need backoff and queue management.

3. Validation Errors

Missing required fields
Invalid data format (phone numbers, emails)
Business rule violations (can't create deal without customer)

Characteristic: Permanent. Will fail again unless data is fixed.

4. Duplicate Records

Customer already exists in Ostendo (different ID)
Deal already synced (idempotency check failed)

Characteristic: Recoverable. Needs deduplication logic.

5. Conflict Errors

Record modified by another process
Version mismatch (stale data)
Concurrent update detected

Characteristic: Recoverable. Needs retry with fresh data.

6. Service Downtime

Pipedrive maintenance window
Ostendo server restart
Network infrastructure issue

Characteristic: Extended. Needs patient retry with exponential backoff.

The Solution: Error Categorization + State Machine

We built a state machine with four sync states:

class Job < ApplicationRecord
  SYNC_STATUSES = %w[pending synced error skipped].freeze

  # pending  = queued for sync
  # synced   = successfully synced
  # error    = failed, will retry
  # skipped  = permanent failure, manual intervention needed
end

Each sync attempt categorizes the error and transitions accordingly:

def sync_job(job)
  begin
    push_to_pipedrive(job)
    job.update!(sync_status: 'synced', last_synced_at: Time.current)
  rescue RateLimitError => e
    retry_with_backoff(job, delay: calculate_backoff(job))
  rescue ValidationError => e
    job.update!(sync_status: 'skipped', error_message: e.message)
    alert_validation_error(job, e)
  rescue NetworkError => e
    retry_later(job) if job.sync_attempts < MAX_ATTEMPTS
    job.update!(sync_status: 'error', error_message: e.message)
  rescue => e
    Sentry.capture_exception(e)
    job.update!(sync_status: 'error', error_message: e.message)
  end
end

Exponential Backoff: The Retry Strategy

For transient errors, we retry with exponential backoff:

Attempt 1: Immediate

Attempt 2: 1 minute later

Attempt 3: 5 minutes later

Attempt 4: 15 minutes later

Attempt 5: 1 hour later

Attempt 6+: 6 hours later (then give up)

def calculate_backoff(job)
  attempt = job.sync_attempts || 0
  base_delay = 60 # seconds

  case attempt
  when 0..1 then 0
  when 2    then base_delay * 1     # 1 min
  when 3    then base_delay * 5     # 5 min
  when 4    then base_delay * 15    # 15 min
  when 5    then base_delay * 60    # 1 hour
  else base_delay * 360             # 6 hours
  end
end

def retry_with_backoff(job, delay:)
  job.increment!(:sync_attempts)
  SyncJobWorker.perform_in(delay.seconds, job.id)
end

Why exponential?
- Fast retry for quick blips (network hiccup)
- Slower retry for extended outages (API maintenance)
- Avoids hammering a degraded service
- Eventually gives up if permanently broken

Handling Rate Limits

APIs have limits. Hitting them breaks your sync. Here's how we stay under the cap:

1. Detect Rate Limit Errors

Pipedrive returns 429 Too Many Requests with a Retry-After header:

rescue Pipedrive::RateLimitError => e
  retry_after = e.response.headers['Retry-After'].to_i
  retry_after = 60 if retry_after.zero? # Default to 1 min

  job.update!(
    sync_status: 'error',
    retry_at: Time.current + retry_after.seconds
  )

  SyncJobWorker.perform_at(job.retry_at, job.id)
end

2. Throttle Proactively

Don't wait to hit the limit. Track request count and throttle:

class PipedriveClient
  RATE_LIMIT = 100        # requests
  RATE_WINDOW = 10        # seconds

  def initialize
    @request_timestamps = []
  end

  def request(method, path, body = nil)
    enforce_rate_limit

    response = HTTP.send(method, path, json: body)
    @request_timestamps << Time.current

    response
  end

  private

  def enforce_rate_limit
    # Remove timestamps older than rate window
    cutoff = Time.current - RATE_WINDOW.seconds
    @request_timestamps.reject! { |ts| ts < cutoff }

    # If at limit, sleep until oldest request expires
    if @request_timestamps.size >= RATE_LIMIT
      sleep_duration = RATE_WINDOW - (Time.current - @request_timestamps.first)
      sleep(sleep_duration) if sleep_duration > 0
    end
  end
end

3. Batch + Delay

For bulk operations, batch and delay:

def sync_all_jobs(jobs)
  jobs.in_batches(of: 50) do |batch|
    batch.each { |job| sync_job(job) }
    sleep(10) # 10 sec between batches
  end
end

Processing 500 jobs in batches of 50 with 10-second delays = 100 seconds total. Slow, but respectful of API limits.

Validation Errors: Skip, Don't Retry

Some errors are permanent. Retrying won't fix them.

rescue Ostendo::ValidationError => e
  job.update!(
    sync_status: 'skipped',
    error_message: e.message,
    needs_manual_review: true
  )

  SlackNotifier.alert(
    channel: '#turfdrive-errors',
    message: "Job #{job.id} skipped: #{e.message}",
    link: admin_job_url(job)
  )
end

Key insight: Don't let permanent failures clog the retry queue. Mark them as skipped, alert a human, and move on.

Logging Every Error (Structured)

We log every error with structured JSON:

def log_sync_error(job, error)
  SyncLog.create!(
    syncable: job,
    action: 'sync_to_pipedrive',
    status: 'error',
    error_class: error.class.name,
    error_message: error.message,
    error_backtrace: error.backtrace[0..10],
    retry_count: job.sync_attempts,
    metadata: {
      job_id: job.id,
      ostendo_id: job.ostendo_id,
      pipedrive_deal_id: job.pipedrive_deal_id,
      sync_status: job.sync_status
    }
  )
end

This gives us:
- Searchable logs (query by error class, job ID, date range)
- Error trends (which errors are most common?)
- Debugging context (what was the state when it failed?)

Daily Error Digest

Instead of alerting on every error (noisy), we send a daily digest:

Email sent at 8 AM:
```
TurfDrive Error Digest - Feb 17, 2026

Summary:
- 12 jobs synced successfully
- 3 transient errors (retrying)
- 1 validation error (needs review)
- 0 critical failures

Details:

[Transient Errors - Will Retry]
- Job #4532: Network timeout (attempt 2/6)
- Job #4541: Rate limit hit (retry in 5 min)
- Job #4555: Connection reset (attempt 1/6)

[Validation Errors - Action Required]
- Job #4523: Missing customer email
→ Fix in Ostendo or skip permanently
→ Dashboard: https://turfdrive.app/admin/jobs/4523

[Performance]
- Avg sync time: 2.3 seconds
- API success rate: 97.5%
- Jobs pending: 5
```

Code:
```ruby
class DailyErrorDigestJob
def perform
date = Date.current

transient_errors = SyncLog
  .where(created_at: date.all_day, status: 'error')
  .where(syncable_type: 'Job')
  .where("retry_count < ?", MAX_RETRIES)

validation_errors = SyncLog
  .where(created_at: date.all_day, status: 'skipped')

send_digest_email(
  transient: transient_errors,
  validation: validation_errors,
  stats: calculate_stats(date)
)

end
end
```

This balances awareness (team knows what's failing) with signal-to-noise (not 50 Slack pings per day).

Alerting on Critical Failures

Some errors need immediate attention:

CRITICAL_ERRORS = [
  'Ostendo::AuthenticationError',
  'Pipedrive::InvalidApiKeyError',
  'Database::ConnectionLost'
].freeze

rescue => e
  if CRITICAL_ERRORS.include?(e.class.name)
    PagerDuty.trigger(
      service: 'turfdrive',
      message: "Critical: #{e.message}",
      severity: 'critical'
    )
  end

  raise
end

Critical = system-wide failure. Authentication errors, database crashes, or API key invalidation. These need immediate action.

Health Checks

We expose a health endpoint for monitoring:

# /health
{
  "status": "healthy",
  "last_successful_sync": "2026-02-17T06:00:00Z",
  "jobs_pending": 5,
  "jobs_with_errors": 2,
  "api_status": {
    "pipedrive": "ok",
    "ostendo": "ok"
  }
}

If last_successful_sync is >1 hour ago, monitoring alerts us.

Edge Case: Cascading Failures

Scenario: Pipedrive goes down for 2 hours. 500 jobs pile up in the queue.

Problem: When Pipedrive comes back up, we hammer it with 500 requests instantly, hit rate limits, and fail again.

Solution: Rate-limit the retry queue:

def process_retry_queue
  jobs_to_retry = Job.where(sync_status: 'error', retry_at: ..Time.current)

  jobs_to_retry.in_batches(of: 20) do |batch|
    batch.each { |job| SyncJobWorker.perform_async(job.id) }
    sleep(30) # 30 sec between batches
  end
end

Drip-feed retries instead of flooding the API.

Observability: The Dashboard

We built a simple dashboard showing:

Error Rate Over Time

Line chart: successful vs failed syncs by hour

Top Errors (Last 7 Days)

Bar chart: most common error classes

Jobs Needing Review

Table: validation errors requiring manual intervention

Recent Activity

Feed: last 50 sync operations with status

Non-technical users can see system health at a glance.

Lessons Learned

1. Categorize Errors Early

Not all errors are equal. Separate transient (retry), permanent (skip), and critical (alert).

2. Exponential Backoff Is Your Friend

Fast retry for quick blips, slow retry for extended outages. Don't give up immediately, but don't retry forever.

3. Rate Limits Will Hit You

Throttle proactively. Batch operations. Respect Retry-After headers. Don't hammer degraded services.

4. Log Everything (Structured)

When debugging, you'll need context. Structured logs make errors searchable and trends visible.

5. Daily Digests > Real-Time Alerts

Most errors aren't urgent. Batch them into daily summaries. Only alert immediately on critical failures.

6. Visibility Builds Trust

Users need to see that errors are handled, retried, and resolved. A dashboard showing "3 errors, all retrying" is reassuring.

The Result

One year in:
- 97.8% success rate on first attempt
- 99.8% eventual success after retries
- ~10 errors/week requiring manual review
- Zero production outages due to error handling failures

Errors happen. The system doesn't break.

Takeaways

Errors are inevitable. Design for them from day one.
Categorize, don't treat all failures the same. Transient, permanent, and critical need different strategies.
Exponential backoff prevents cascading failures. Fast retry for blips, slow retry for outages.
Rate limits are real. Throttle proactively, respect headers, batch operations.
Observability is critical. Log everything, surface errors in dashboards, send digests.
Don't alert on everything. Separate signal (critical failures) from noise (transient errors).

Building error handling isn't glamorous. But it's the difference between a flaky integration and a reliable system users trust.

Next in this series: Data integrity checks and automated audits to keep your systems in sync.