Mastering Error Handling in Sidekiq: Techniques and Strategies for Rescuing Jobs

TLDR

If you're looking for a quick solution to handle annoying or expected errors with Sidekiq just try the sidekiq-rescue gem

class MyJob
  include Sidekiq::Job
  include Sidekiq::Rescue::Dsl

  sidekiq_rescue Faraday::ConnectionFailed, delay: 60, limit: 5

  def perform(*)
    # ...
  end
end

If you are interested in learning some hints and tricks about this topic - welcome to the article!

Handling errors in Sidekiq

Sidekiq is one of the most popular background processors for Ruby. It's very easy to integrate with Rails or use it as a standalone without any frameworks.

One of the major aspects of software development is error observability and error handling. It's highly recommended to use one of the error services like Rollbar, Sentry, Honeybadger, Bugsnag, etc.

Most of these services have integration with Sidekiq and it's very easy to set up and have all your errors in one place. Thus, you and your team will always be able to see bugs that have occurred.

But while you are fixing the next bug, Sidekiq does one important thing for you - keep a fallen job in a special queue and retry it.

There is a standard way to retry Sidekiq jobs that have been falling with errors - the automatic job retry mechanism

class RetryableJob
  include Sidekiq::Job
  sidekiq_options retry: 25

  def perform(...)
  end
end

It's a default setting, so specify it only if you want to customize the number of retries

class RetryableJob
  include Sidekiq::Job

  def perform(...)
  end
end

It uses a very clever strategy with an exponential backoff using the formula (retry_count ** 4) + 15 + (rand(10) * (retry_count + 1)) (i.e. 15, 16, 31, 96, 271, ... seconds + a random amount of time). It will perform 25 retries over approximately 20 days.

It works very well when a job encounters a bug and fails because a developer has plenty of time before the job stops retrying. After that time it will go to the "morgue" and will be there for 6 months, and then will be discarded. It is possible to retry the job from the UI or console after a bug has been fixed.

But in real-world applications, not all the exceptions are bugs. It can be also a network issue, 3rd party services downtime. Or your application is using distributed lock heavily and the resource is busy now and needs to be accessed sometime later.

Sidekiq retry mechanism works fine with all of these cases, but there are some downsides though - the error is still reported to the error service. Such behavior decreases the visibility of real bugs and some errors can be quite annoying.

So there are two types of errors that devs usually deal with:

  • unexpected errors that we want to see as soon as possible in the error tracker because it's probably a bug

  • expected errors that can occur from time to time and should be reported only when get out of control

As with the first Sidekiq default retry mechanism works perfectly the second one requires some effort to handle.

Handle expecting errors

Let's consider an example for a clear picture:

class HardJob
  include Sidekiq::Job

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  end
end

In the given example, HardJob fails with approximately a 1% probability. Let's assume that we don't control how often this error occurs. Naturally, we could ignore such errors, but this might lead to a situation where the error spirals out of control and we remain oblivious to it.

However, several techniques can help!

#1 Technique: Ignore expected errors

The first technique proposed by Mike Perham, the author of Sidekiq, is pretty clever and elegant though. The idea is to patch Exception class with ignore flag:

# config/initializers/exceptions.rb
class Exception
  attr_accessor :ignore_please
end

And fully relies on the Sidekiq retry mechanism:

class HardJob
  include Sidekiq::Job

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  rescue AnnoyingError => e
    # flag it to be ignored
    e.ignore_please = true
    # re-raise it so Sidekiq will retry
    raise e
  end
end

And ignore errors with a flag in the error service config:

# https://docs.rollbar.com/docs/ruby#section-ignoring-items
# config/initializers/rollbar.rb
handler = proc do |options|
  raise Rollbar::Ignore if options[:exception].ignore_please
end
Rollbar.configure do |config|
  config.before_process << handler
end

# https://docs.honeybadger.io/lib/ruby/getting-started/ignoring-errors/#ignore-programmatically
# config/initializers/honeybadger.rb
Honeybadger.configure do |config|
  config.before_notify do |notice|
    notice.halt! if notice.exception.ignore_please
  end
end

Pros:

  • Easy to implement

  • Works with almost all error trackers

Cons:

  • There are no limits on ignore exceptions. Thus, if an error is raised several times in a row within one job, it will not be reported to the error tracker

  • Not all trackers can filter errors dynamically, i.e. NewRelic can only filter by error class names, and implementing such a filter needs deep knowledge of tracker internals or request a feature from maintainers.

  • requires custom code for a job - if there are a couple of errors it needs to alter each job #perform method

We can improve it a bit by using sidekiq_retries_exhausted within the job

class HardJob
  include Sidekiq::Job

  sidekiq_retries_exhausted do |job, ex|
    ex.ignore_please = false
    # ErrorTracker is a wrapper around you tracker
    ErrorTracker.notify(ex, "#{job['class']} #{job["jid"]} just died with error #{ex.message}.")
  end

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  rescue AnnoyingError => e
    # flag it to be ignored
    e.ignore_please = true
    # re-raise it so Sidekiq will retry
    raise e
  end
end

Or globally with death_handlers

# this goes in your initializer
Sidekiq.configure_server do |config|
  config.death_handlers << ->(job, ex) do
    if ex.ignore_please # checks that some error has been ignored before send it to avoid double notify
      ex.ignore_please = false
      ErrorTracker.notify(ex, "#{job['class']} #{job["jid"]} just died with error #{ex.message}.")
    end
  end
end

With this improvement, we can see that some jobs have been retried too many times, have gone to the dead queue, and there are probably some bugs worth checking.

#2 Technique: Retry once before erroring

The second technique, by Mike Coutermarsh, is more complex, but provides exactly one retry and can be illustrated with this code example

class HardJob
  include Sidekiq::Job

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  rescue AnnoyingError => e
    # here we are retrying the job only once without notifying error tracker
    # and then if error will appear again it will be reported
    retry_once_before_raising_error(e)
  end
end

You can go through Mike's article to see how it's implemented in detail but I will provide here some ideas to have a clear picture of the final solution.

First of all, it needs to introduce an attribute accessor to the job instance:

class ApplicationJob
  include Sidekiq::Job
  attr_writer :retry_count

  # Job instances doesn't have direct access to the job payload
  # so we have to implement accessor
  def retry_count
    @retry_count ||= 0
  end
end

Then introduce custom server middleware and add it to the sidekiq config. It needs to be done this way because we have to fetch the retry counter and forward this data to the job instance.

module SidekiqMiddleware
  class RetryCount
    def call(job_instance, job_payload, _queue, &block)
      if job_instance.respond_to?(:retry_count)
        # assign retry count to the job instance to have access there
        job_instance.retry_count = job_payload.fetch("retry_count", 0)
      end

      yield
    end
  end
end

# config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add(SidekiqMiddleware::RetryCount)
  end
end

Implement a #retry_once_before_raising_error method:

class ApplicationJob
  # omit previous code for clarity
  RetryError = Class.new(StandardError)

  def retry_once_before_raising_error(exception)
    if retry_count < 1
      raise RetryError, exception.message
    else
      raise exception
    end
  end
end

And add RetryError to the tracker's ignore list:

# https://docs.rollbar.com/docs/ruby#section-exception-level-filters
# config/initializers/rollbar.rb
Rollbar.configure do |config|
  config.exception_level_filters.merge!({
    'ApplicationJob::RetryError' => 'ignore',
  })
end

# https://docs.honeybadger.io/lib/ruby/getting-started/ignoring-errors/#ignore-by-class
# config/honeybadger.yml
 exceptions:
  ignore:
    - 'ApplicationJob::RetryError'

Pros:

  • provides an ability to retry once some flacky error and report it if it occurs next time. When the situation goes out of control the error will occur in the error tracker and the developer will see it

  • easier to reuse by adding only one ApplicationJob::RetryError the exception to the ignore list of tracker

Cons:

  • more complex to implement due to custom server middleware and altered job class

  • it retries only once. In some cases, it needs to retry one or two times

  • needs to alter #perform method of each job with such an error

Let's alter the last technique to have an option to retry several times

class HardJob
  include Sidekiq::Job

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  rescue AnnoyingError => e
    # here we are retrying the job limit times without notifying error tracker
    retry_before_raising_error(e, limit: 3)
  end
end

The only thing we need to change is the #retry_before_raising_error method

class ApplicationJob
  # omit previous code for clarity
  RetryError = Class.new(StandardError)

  def retry_before_raising_error(exception, limit: 1)
    if retry_count < limit
      raise RetryError, exception.message
    else
      raise exception
    end
  end
end

Pros:

  • all pros from technique #2

  • can retry as many times as it needs to

Cons:

  • more complex to implement due to custom server middleware and altered job class

  • needs to alter #perform method of each job with such an error

Generally, it's good enough for most of the cases, but what if we need some customized delay for this specific error? Then we can use sidekiq_retry_in block!

class HardJob
  include Sidekiq::Job

  sidekiq_retry_in do |count, exception, _jobhash|
    case exception
    when AnnoyingError
      10 * (count + 1) # (10, 20, 30, 40, 50)
    end
  end

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(10) == 0 # simulate error that occurs time to time
  rescue AnnoyingError => e
    # here we are retrying the job limit times without notifying error tracker
    retry_before_raising_error(e, limit: 3)
  end
end

Pros:

  • all pros from technique #2

  • can retry as many times as it needs to

  • can have a custom delay

Cons:

  • the final solution is quite complex - a lot of moving parts

#4 Technique: Use the sidekiq-rescue gem

The easiest, in my opinion, way to solve the problem is to use sidekiq-rescue gem. It's a tiny plugin, with zero dependency (besides Sidekiq itself!) that provides handy DSL and is very easy to set up.

Install the gem to your project

bundle add sidekiq-rescue && bundle install

Add the middleware to your Sidekiq configuration:

# config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add Sidekiq::Rescue::ServerMiddleware
  end
end

Let's consider our example:

class HardJob
  include Sidekiq::Job

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  end
end

All we need it's to include Sidekiq::Rescue::DSL module and use sidekiq_rescue

class HardJob
  include Sidekiq::Job
  include Sidekiq::Rescue::Dsl

  sidekiq_rescue AnnoyingError

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  end
end

And that's all! You can configure the number of retries and the delay (in seconds) between retries:

class HardJob
  include Sidekiq::Job
  include Sidekiq::Rescue::Dsl

  sidekiq_rescue AnnoyingError, delay: 60, limit: 5

  def perform(...)
    # do some important stuff
    raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
  end
end

The delay is not the exact time between retries; it's a minimum delay. The actual delay is calculated based on the retries counter and delay value. The formula is delay + retries * rand(10) seconds. Randomization is used to avoid retry storms.
The default values are:

  • delay: 60 seconds

  • limit: 5 retries

Delay and limit can be configured globally:

# config/initializers/sidekiq.rb
Sidekiq::Rescue.configure do |config|
  config.delay = 65
  config.limit = 10
end

You can also configure a job to have the delay to be a proc:

sidekiq_rescue ExpectedError, delay: ->(counter) { counter * 60 }

Under the hood, this gem uses Sidekiq's perform_at and doesn't rely on the standard retry mechanism. Thus you have independent retry strategies for both types of errors: unexpected and expected.

Pros:

  • easy to use

  • independent retry strategy from the default Sidekiq error handling

  • can retry as many times as it needs to

  • can have a custom delay, even with proc

Cons:

  • one more gem in a Gemfile

This gem is still under active development, but it has good coverage and has been tested in production.

As an author, I will be very grateful for any response about issues and PR's

Conclusion

In conclusion, error handling in Sidekiq is a critical aspect of software development. While the default retry mechanism works well for unexpected errors, handling expected errors requires additional strategies.

Techniques such as ignoring errors, retrying once before erroring, or using the sidekiq-rescue gem can be employed to handle these errors more effectively. Each method has its pros and cons, and the choice depends on the specific needs of your application.

By mastering these techniques, developers can significantly improve error observability and handling in their Sidekiq jobs, leading to more robust and reliable applications.

Sources: