# Mastering Error Handling in Sidekiq: Techniques and Strategies for Rescuing Jobs

### TLDR

If you're looking for a quick solution to handle annoying or expected errors with Sidekiq just try the [sidekiq-rescue gem](https://github.com/moofkit/sidekiq-rescue)

```ruby
class MyJob
  include Sidekiq::Job
  include Sidekiq::Rescue::Dsl

  sidekiq_rescue Faraday::ConnectionFailed, delay: 60, limit: 5

  def perform(*)
    # ...
  end
end
```

If you are interested in learning some hints and tricks about this topic - welcome to the article!

## Handling errors in Sidekiq

[Sidekiq](https://github.com/sidekiq/sidekiq/) is one of the most popular background processors for Ruby. It's very easy to integrate with Rails or use it as a standalone without any frameworks.

One of the major aspects of software development is error observability and error handling. [It's highly recommended](https://www.mikeperham.com/2013/08/25/please-use-an-error-service/) to use one of the error services like [Rollbar](https://rollbar.com), [Sentry](https://sentry.io/), [Honeybadge](https://www.honeybadger.io/)r, [Bugsnag](https://www.bugsnag.com/), etc.

Most of these services have integration with Sidekiq and it's very easy to set up and have all your errors in one place. Thus, you and your team will always be able to see bugs that have occurred.

But while you are fixing the next bug, Sidekiq does one important thing for you - keep a fallen job in a special queue and **retry it**.

There is a standard way to retry Sidekiq jobs that have been falling with errors - [the automatic job retry mechanism](https://github.com/sidekiq/sidekiq/wiki/Error-Handling#automatic-job-retry)

```ruby
class RetryableJob
  include Sidekiq::Job
  sidekiq_options retry: 25

  def perform(...)
  end
end
```

It's a [default setting](https://github.com/sidekiq/sidekiq/blob/80f5f73f8e74a5775866a016fe42446dfc1b861e/lib/sidekiq/job_retry.rb#L68-L72), so specify it only if you want to customize the number of retries

```ruby
class RetryableJob
  include Sidekiq::Job

  def perform(...)
  end
end
```

It uses a very clever strategy with an exponential backoff using the formula `(retry_count ** 4) + 15 + (rand(10) * (retry_count + 1))` (i.e. 15, 16, 31, 96, 271, ... seconds + a [random amount of time](https://github.com/sidekiq/sidekiq/issues/480)). It will perform 25 retries over approximately 20 days.

It works very well when a job encounters a bug and fails because a developer has plenty of time before the job stops retrying. After that time it will go to the "morgue" and will be there for 6 months, and then will be discarded. It is possible to retry the job from the UI or console after a bug has been fixed.

But in real-world applications, not all the exceptions are bugs. It can be also a network issue, 3rd party services downtime. Or your application is using [distributed lock heavily](https://github.com/leandromoreira/redlock-rb) and the resource is busy now and needs to be accessed sometime later.

Sidekiq retry mechanism works fine with all of these cases, but there are some downsides though - the error is still reported to the error service. Such behavior decreases the visibility of real bugs and some errors can be quite annoying.

So there are two types of errors that devs usually deal with:

* ***unexpected errors*** that we want to see as soon as possible in the error tracker because it's probably a bug
    
* ***expected errors*** that can occur from time to time and should be reported only when get out of control
    

As with the first Sidekiq default retry mechanism works perfectly the second one requires some effort to handle.

## Handle expecting errors

Let's consider an example for a clear picture:

```ruby
class HardJob
  include Sidekiq::Job

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  end
end
```

In the given example, `HardJob` fails with approximately a 1% probability. Let's assume that we don't control how often this error occurs. Naturally, we could ignore such errors, but this might lead to a situation where the error spirals out of control and we remain oblivious to it.

However, several techniques can help!

### #1 Technique: Ignore expected errors

The [first technique](https://www.mikeperham.com/2017/09/29/retries-and-exceptions/) proposed by [Mike Perham](https://ruby.social/@getajobmike), the author of Sidekiq, is pretty clever and elegant though. The idea is to patch `Exception` class with ignore flag:

```ruby
# config/initializers/exceptions.rb
class Exception
  attr_accessor :ignore_please
end
```

And fully relies on the Sidekiq retry mechanism:

```ruby
class HardJob
  include Sidekiq::Job

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  rescue AnnoyingError => e
    # flag it to be ignored
    e.ignore_please = true
    # re-raise it so Sidekiq will retry
    raise e
  end
end
```

And ignore errors with a flag in the error service config:

```ruby
# https://docs.rollbar.com/docs/ruby#section-ignoring-items
# config/initializers/rollbar.rb
handler = proc do |options|
  raise Rollbar::Ignore if options[:exception].ignore_please
end
Rollbar.configure do |config|
  config.before_process << handler
end

# https://docs.honeybadger.io/lib/ruby/getting-started/ignoring-errors/#ignore-programmatically
# config/initializers/honeybadger.rb
Honeybadger.configure do |config|
  config.before_notify do |notice|
    notice.halt! if notice.exception.ignore_please
  end
end
```

**Pros:**

* Easy to implement
    
* Works with almost all error trackers
    

**Cons:**

* There are no limits on ignore exceptions. Thus, if an error is raised several times in a row within one job, it will not be reported to the error tracker
    
* Not all trackers can filter errors dynamically, i.e. [NewRelic](https://docs.newrelic.com/docs/apm/agents/ruby-agent/configuration/ruby-agent-configuration/#error_collector-ignore_classes) can only filter by error class names, and implementing such a filter needs deep knowledge of tracker internals or request a feature from maintainers.
    
* requires custom code for a job - if there are a couple of errors it needs to alter each job `#perform` method
    

We can improve it a bit by using `sidekiq_retries_exhausted` within the job

```ruby
class HardJob
  include Sidekiq::Job

  sidekiq_retries_exhausted do |job, ex|
    ex.ignore_please = false
    # ErrorTracker is a wrapper around you tracker
    ErrorTracker.notify(ex, "#{job['class']} #{job["jid"]} just died with error #{ex.message}.")
  end

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  rescue AnnoyingError => e
    # flag it to be ignored
    e.ignore_please = true
    # re-raise it so Sidekiq will retry
    raise e
  end
end
```

Or globally with [`death_handlers`](https://www.rubydoc.info/gems/sidekiq/Sidekiq/Config#death_handlers-instance_method)

```ruby
# this goes in your initializer
Sidekiq.configure_server do |config|
  config.death_handlers << ->(job, ex) do
    if ex.ignore_please # checks that some error has been ignored before send it to avoid double notify
      ex.ignore_please = false
      ErrorTracker.notify(ex, "#{job['class']} #{job["jid"]} just died with error #{ex.message}.")
    end
  end
end
```

With this improvement, we can see that some jobs have been retried too many times, have gone to the dead queue, and there are probably some bugs worth checking.

### #2 Technique: Retry once before erroring

[The second](https://www.mikecoutermarsh.com/silencing-errors-from-noisy-sidekiq-jobs/) technique, by [Mike Coutermarsh](https://twitter.com/mscccc), is more complex, but provides exactly one retry and can be illustrated with this code example

```ruby
class HardJob
  include Sidekiq::Job

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  rescue AnnoyingError => e
    # here we are retrying the job only once without notifying error tracker
    # and then if error will appear again it will be reported
    retry_once_before_raising_error(e)
  end
end
```

You can go through Mike's article to see how it's implemented in detail but I will provide here some ideas to have a clear picture of the final solution.

First of all, it needs to introduce an attribute accessor to the job instance:

```ruby
class ApplicationJob
  include Sidekiq::Job
  attr_writer :retry_count

  # Job instances doesn't have direct access to the job payload
  # so we have to implement accessor
  def retry_count
    @retry_count ||= 0
  end
end
```

Then introduce [custom server middleware](https://github.com/sidekiq/sidekiq/wiki/Middleware) and add it to the sidekiq config. It needs to be done this way because we have to fetch the retry counter and forward this data to the job instance.

```ruby
module SidekiqMiddleware
  class RetryCount
    def call(job_instance, job_payload, _queue, &block)
      if job_instance.respond_to?(:retry_count)
        # assign retry count to the job instance to have access there
        job_instance.retry_count = job_payload.fetch("retry_count", 0)
      end

      yield
    end
  end
end

# config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add(SidekiqMiddleware::RetryCount)
  end
end
```

Implement a *#retry\_once\_before\_raising\_error* method:

```ruby
class ApplicationJob
  # omit previous code for clarity
  RetryError = Class.new(StandardError)

  def retry_once_before_raising_error(exception)
    if retry_count < 1
      raise RetryError, exception.message
    else
      raise exception
    end
  end
end
```

And add `RetryError` to the tracker's ignore list:

```ruby
# https://docs.rollbar.com/docs/ruby#section-exception-level-filters
# config/initializers/rollbar.rb
Rollbar.configure do |config|
  config.exception_level_filters.merge!({
    'ApplicationJob::RetryError' => 'ignore',
  })
end

# https://docs.honeybadger.io/lib/ruby/getting-started/ignoring-errors/#ignore-by-class
# config/honeybadger.yml
 exceptions:
  ignore:
    - 'ApplicationJob::RetryError'
```

**Pros:**

* provides an ability to retry once some flacky error and report it if it occurs next time. When the situation goes out of control the error will occur in the error tracker and the developer will see it
    
* easier to reuse by adding only one `ApplicationJob::RetryError` the exception to the ignore list of tracker
    

**Cons:**

* more complex to implement due to custom server middleware and altered job class
    
* it retries only once. In some cases, it needs to retry one or two times
    
* needs to alter `#perform` method of each job with such an error
    

Let's alter the last technique to have an option to retry several times

```ruby
class HardJob
  include Sidekiq::Job

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  rescue AnnoyingError => e
    # here we are retrying the job limit times without notifying error tracker
    retry_before_raising_error(e, limit: 3)
  end
end
```

The only thing we need to change is the `#retry_before_raising_error` method

```ruby
class ApplicationJob
  # omit previous code for clarity
  RetryError = Class.new(StandardError)

  def retry_before_raising_error(exception, limit: 1)
    if retry_count < limit
      raise RetryError, exception.message
    else
      raise exception
    end
  end
end
```

**Pros:**

* all pros from technique #2
    
* can retry as many times as it needs to
    

**Cons:**

* more complex to implement due to custom server middleware and altered job class
    
* needs to alter `#perform` method of each job with such an error
    

Generally, it's good enough for most of the cases, but what if we need some customized delay for this specific error? Then we can use `sidekiq_retry_in` block!

```ruby
class HardJob
  include Sidekiq::Job
    
  sidekiq_retry_in do |count, exception, _jobhash|
    case exception
    when AnnoyingError
      10 * (count + 1) # (10, 20, 30, 40, 50)
    end
  end

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(10) == 0 # simulate error that occurs time to time
  rescue AnnoyingError => e
    # here we are retrying the job limit times without notifying error tracker
    retry_before_raising_error(e, limit: 3)
  end
end
```

**Pros:**

* all pros from technique #2
    
* can retry as many times as it needs to
    
* can have a custom delay
    

**Cons:**

* the final solution is quite complex - a lot of moving parts
    

### #4 Technique: Use the [sidekiq-rescue](https://github.com/moofkit/sidekiq-rescue) gem

The easiest, in my opinion, way to solve the problem is to use [sidekiq-rescue](https://github.com/moofkit/sidekiq-rescue) gem. It's a tiny plugin, with zero dependency (besides Sidekiq itself!) that provides handy DSL and is very easy to set up.

Install the gem to your project

```bash
bundle add sidekiq-rescue && bundle install
```

Add the middleware to your Sidekiq configuration:

```ruby
# config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add Sidekiq::Rescue::ServerMiddleware
  end
end
```

Let's consider our example:

```ruby
class HardJob
  include Sidekiq::Job

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  end
end
```

All we need it's to include `Sidekiq::Rescue::DSL` module and use `sidekiq_rescue`

```ruby
class HardJob
  include Sidekiq::Job
  include Sidekiq::Rescue::Dsl

  sidekiq_rescue AnnoyingError

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  end
end
```

And that's all! You can configure the number of retries and the delay (in seconds) between retries:

```ruby
class HardJob
  include Sidekiq::Job
  include Sidekiq::Rescue::Dsl

  sidekiq_rescue AnnoyingError, delay: 60, limit: 5

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  end
end
```

The `delay` is not the exact time between retries; it's a minimum delay. The actual delay is calculated based on the retries counter and `delay` value. The formula is `delay + retries * rand(10)` seconds. Randomization is used to avoid retry storms.  
The default values are:

* `delay`: 60 seconds
    
* `limit`: 5 retries
    

Delay and limit can be configured globally:

```ruby
# config/initializers/sidekiq.rb
Sidekiq::Rescue.configure do |config|
  config.delay = 65
  config.limit = 10
end
```

You can also configure a job to have the delay to be a proc:

```ruby
sidekiq_rescue ExpectedError, delay: ->(counter) { counter * 60 }
```

Under the hood, this gem uses [Sidekiq's perform\_at](https://github.com/moofkit/sidekiq-rescue/blob/6353c0c16eaabebed26989a953c4108f813e7a60/lib/sidekiq/rescue/server_middleware.rb#L39) and doesn't rely on the standard retry mechanism. Thus you have independent retry strategies for both types of errors: **unexpected** and **expected.**

**Pros:**

* easy to use
    
* independent retry strategy from the default Sidekiq error handling
    
* can retry as many times as it needs to
    
* can have a custom delay, even with proc
    

**Cons:**

* one more gem in a Gemfile
    

This gem is still under active development, but it has good coverage and has been tested in production.

As an author, I will be very grateful for any response about issues and PR's

## Conclusion

In conclusion, error handling in Sidekiq is a critical aspect of software development. While the default retry mechanism works well for unexpected errors, handling expected errors requires additional strategies.

Techniques such as ignoring errors, retrying once before erroring, or using the [sidekiq-rescue gem](https://github.com/moofkit/sidekiq-rescue) can be employed to handle these errors more effectively. Each method has its pros and cons, and the choice depends on the specific needs of your application.

By mastering these techniques, developers can significantly improve error observability and handling in their Sidekiq jobs, leading to more robust and reliable applications.

## Sources:

* [https://www.mikeperham.com/2013/08/25/please-use-an-error-service/](https://www.mikeperham.com/2013/08/25/please-use-an-error-service/)
    
* [https://github.com/sidekiq/sidekiq/wiki/Error-Handling](https://github.com/sidekiq/sidekiq/wiki/Error-Handling)
    
* [https://github.com/leandromoreira/redlock-rb](https://github.com/leandromoreira/redlock-rb)
    
* [https://www.mikeperham.com/2017/09/29/retries-and-exceptions/](https://www.mikeperham.com/2017/09/29/retries-and-exceptions/)
    
* [https://www.mikecoutermarsh.com/silencing-errors-from-noisy-sidekiq-jobs/](https://www.mikecoutermarsh.com/silencing-errors-from-noisy-sidekiq-jobs/)
    
* [https://github.com/moofkit/sidekiq-rescue](https://github.com/moofkit/sidekiq-rescue)
