Mastering Error Handling in Sidekiq: Techniques and Strategies for Rescuing Jobs
TLDR
If you're looking for a quick solution to handle annoying or expected errors with Sidekiq just try the sidekiq-rescue gem
class MyJob
include Sidekiq::Job
include Sidekiq::Rescue::Dsl
sidekiq_rescue Faraday::ConnectionFailed, delay: 60, limit: 5
def perform(*)
# ...
end
end
If you are interested in learning some hints and tricks about this topic - welcome to the article!
Handling errors in Sidekiq
Sidekiq is one of the most popular background processors for Ruby. It's very easy to integrate with Rails or use it as a standalone without any frameworks.
One of the major aspects of software development is error observability and error handling. It's highly recommended to use one of the error services like Rollbar, Sentry, Honeybadger, Bugsnag, etc.
Most of these services have integration with Sidekiq and it's very easy to set up and have all your errors in one place. Thus, you and your team will always be able to see bugs that have occurred.
But while you are fixing the next bug, Sidekiq does one important thing for you - keep a fallen job in a special queue and retry it.
There is a standard way to retry Sidekiq jobs that have been falling with errors - the automatic job retry mechanism
class RetryableJob
include Sidekiq::Job
sidekiq_options retry: 25
def perform(...)
end
end
It's a default setting, so specify it only if you want to customize the number of retries
class RetryableJob
include Sidekiq::Job
def perform(...)
end
end
It uses a very clever strategy with an exponential backoff using the formula (retry_count ** 4) + 15 + (rand(10) * (retry_count + 1))
(i.e. 15, 16, 31, 96, 271, ... seconds + a random amount of time). It will perform 25 retries over approximately 20 days.
It works very well when a job encounters a bug and fails because a developer has plenty of time before the job stops retrying. After that time it will go to the "morgue" and will be there for 6 months, and then will be discarded. It is possible to retry the job from the UI or console after a bug has been fixed.
But in real-world applications, not all the exceptions are bugs. It can be also a network issue, 3rd party services downtime. Or your application is using distributed lock heavily and the resource is busy now and needs to be accessed sometime later.
Sidekiq retry mechanism works fine with all of these cases, but there are some downsides though - the error is still reported to the error service. Such behavior decreases the visibility of real bugs and some errors can be quite annoying.
So there are two types of errors that devs usually deal with:
unexpected errors that we want to see as soon as possible in the error tracker because it's probably a bug
expected errors that can occur from time to time and should be reported only when get out of control
As with the first Sidekiq default retry mechanism works perfectly the second one requires some effort to handle.
Handle expecting errors
Let's consider an example for a clear picture:
class HardJob
include Sidekiq::Job
def perform(...)
# do some important stuff
raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
end
end
In the given example, HardJob
fails with approximately a 1% probability. Let's assume that we don't control how often this error occurs. Naturally, we could ignore such errors, but this might lead to a situation where the error spirals out of control and we remain oblivious to it.
However, several techniques can help!
#1 Technique: Ignore expected errors
The first technique proposed by Mike Perham, the author of Sidekiq, is pretty clever and elegant though. The idea is to patch Exception
class with ignore flag:
# config/initializers/exceptions.rb
class Exception
attr_accessor :ignore_please
end
And fully relies on the Sidekiq retry mechanism:
class HardJob
include Sidekiq::Job
def perform(...)
# do some important stuff
raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
rescue AnnoyingError => e
# flag it to be ignored
e.ignore_please = true
# re-raise it so Sidekiq will retry
raise e
end
end
And ignore errors with a flag in the error service config:
# https://docs.rollbar.com/docs/ruby#section-ignoring-items
# config/initializers/rollbar.rb
handler = proc do |options|
raise Rollbar::Ignore if options[:exception].ignore_please
end
Rollbar.configure do |config|
config.before_process << handler
end
# https://docs.honeybadger.io/lib/ruby/getting-started/ignoring-errors/#ignore-programmatically
# config/initializers/honeybadger.rb
Honeybadger.configure do |config|
config.before_notify do |notice|
notice.halt! if notice.exception.ignore_please
end
end
Pros:
Easy to implement
Works with almost all error trackers
Cons:
There are no limits on ignore exceptions. Thus, if an error is raised several times in a row within one job, it will not be reported to the error tracker
Not all trackers can filter errors dynamically, i.e. NewRelic can only filter by error class names, and implementing such a filter needs deep knowledge of tracker internals or request a feature from maintainers.
requires custom code for a job - if there are a couple of errors it needs to alter each job
#perform
method
We can improve it a bit by using sidekiq_retries_exhausted
within the job
class HardJob
include Sidekiq::Job
sidekiq_retries_exhausted do |job, ex|
ex.ignore_please = false
# ErrorTracker is a wrapper around you tracker
ErrorTracker.notify(ex, "#{job['class']} #{job["jid"]} just died with error #{ex.message}.")
end
def perform(...)
# do some important stuff
raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
rescue AnnoyingError => e
# flag it to be ignored
e.ignore_please = true
# re-raise it so Sidekiq will retry
raise e
end
end
Or globally with death_handlers
# this goes in your initializer
Sidekiq.configure_server do |config|
config.death_handlers << ->(job, ex) do
if ex.ignore_please # checks that some error has been ignored before send it to avoid double notify
ex.ignore_please = false
ErrorTracker.notify(ex, "#{job['class']} #{job["jid"]} just died with error #{ex.message}.")
end
end
end
With this improvement, we can see that some jobs have been retried too many times, have gone to the dead queue, and there are probably some bugs worth checking.
#2 Technique: Retry once before erroring
The second technique, by Mike Coutermarsh, is more complex, but provides exactly one retry and can be illustrated with this code example
class HardJob
include Sidekiq::Job
def perform(...)
# do some important stuff
raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
rescue AnnoyingError => e
# here we are retrying the job only once without notifying error tracker
# and then if error will appear again it will be reported
retry_once_before_raising_error(e)
end
end
You can go through Mike's article to see how it's implemented in detail but I will provide here some ideas to have a clear picture of the final solution.
First of all, it needs to introduce an attribute accessor to the job instance:
class ApplicationJob
include Sidekiq::Job
attr_writer :retry_count
# Job instances doesn't have direct access to the job payload
# so we have to implement accessor
def retry_count
@retry_count ||= 0
end
end
Then introduce custom server middleware and add it to the sidekiq config. It needs to be done this way because we have to fetch the retry counter and forward this data to the job instance.
module SidekiqMiddleware
class RetryCount
def call(job_instance, job_payload, _queue, &block)
if job_instance.respond_to?(:retry_count)
# assign retry count to the job instance to have access there
job_instance.retry_count = job_payload.fetch("retry_count", 0)
end
yield
end
end
end
# config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
config.server_middleware do |chain|
chain.add(SidekiqMiddleware::RetryCount)
end
end
Implement a #retry_once_before_raising_error method:
class ApplicationJob
# omit previous code for clarity
RetryError = Class.new(StandardError)
def retry_once_before_raising_error(exception)
if retry_count < 1
raise RetryError, exception.message
else
raise exception
end
end
end
And add RetryError
to the tracker's ignore list:
# https://docs.rollbar.com/docs/ruby#section-exception-level-filters
# config/initializers/rollbar.rb
Rollbar.configure do |config|
config.exception_level_filters.merge!({
'ApplicationJob::RetryError' => 'ignore',
})
end
# https://docs.honeybadger.io/lib/ruby/getting-started/ignoring-errors/#ignore-by-class
# config/honeybadger.yml
exceptions:
ignore:
- 'ApplicationJob::RetryError'
Pros:
provides an ability to retry once some flacky error and report it if it occurs next time. When the situation goes out of control the error will occur in the error tracker and the developer will see it
easier to reuse by adding only one
ApplicationJob::RetryError
the exception to the ignore list of tracker
Cons:
more complex to implement due to custom server middleware and altered job class
it retries only once. In some cases, it needs to retry one or two times
needs to alter
#perform
method of each job with such an error
Let's alter the last technique to have an option to retry several times
class HardJob
include Sidekiq::Job
def perform(...)
# do some important stuff
raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
rescue AnnoyingError => e
# here we are retrying the job limit times without notifying error tracker
retry_before_raising_error(e, limit: 3)
end
end
The only thing we need to change is the #retry_before_raising_error
method
class ApplicationJob
# omit previous code for clarity
RetryError = Class.new(StandardError)
def retry_before_raising_error(exception, limit: 1)
if retry_count < limit
raise RetryError, exception.message
else
raise exception
end
end
end
Pros:
all pros from technique #2
can retry as many times as it needs to
Cons:
more complex to implement due to custom server middleware and altered job class
needs to alter
#perform
method of each job with such an error
Generally, it's good enough for most of the cases, but what if we need some customized delay for this specific error? Then we can use sidekiq_retry_in
block!
class HardJob
include Sidekiq::Job
sidekiq_retry_in do |count, exception, _jobhash|
case exception
when AnnoyingError
10 * (count + 1) # (10, 20, 30, 40, 50)
end
end
def perform(...)
# do some important stuff
raise AnnoyingError if rand(10) == 0 # simulate error that occurs time to time
rescue AnnoyingError => e
# here we are retrying the job limit times without notifying error tracker
retry_before_raising_error(e, limit: 3)
end
end
Pros:
all pros from technique #2
can retry as many times as it needs to
can have a custom delay
Cons:
- the final solution is quite complex - a lot of moving parts
#4 Technique: Use the sidekiq-rescue gem
The easiest, in my opinion, way to solve the problem is to use sidekiq-rescue gem. It's a tiny plugin, with zero dependency (besides Sidekiq itself!) that provides handy DSL and is very easy to set up.
Install the gem to your project
bundle add sidekiq-rescue && bundle install
Add the middleware to your Sidekiq configuration:
# config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
config.server_middleware do |chain|
chain.add Sidekiq::Rescue::ServerMiddleware
end
end
Let's consider our example:
class HardJob
include Sidekiq::Job
def perform(...)
# do some important stuff
raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
end
end
All we need it's to include Sidekiq::Rescue::DSL
module and use sidekiq_rescue
class HardJob
include Sidekiq::Job
include Sidekiq::Rescue::Dsl
sidekiq_rescue AnnoyingError
def perform(...)
# do some important stuff
raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
end
end
And that's all! You can configure the number of retries and the delay (in seconds) between retries:
class HardJob
include Sidekiq::Job
include Sidekiq::Rescue::Dsl
sidekiq_rescue AnnoyingError, delay: 60, limit: 5
def perform(...)
# do some important stuff
raise AnnoyingError if rand(99) == 0 # simulate error that occurs time to time
end
end
The delay
is not the exact time between retries; it's a minimum delay. The actual delay is calculated based on the retries counter and delay
value. The formula is delay + retries * rand(10)
seconds. Randomization is used to avoid retry storms.
The default values are:
delay
: 60 secondslimit
: 5 retries
Delay and limit can be configured globally:
# config/initializers/sidekiq.rb
Sidekiq::Rescue.configure do |config|
config.delay = 65
config.limit = 10
end
You can also configure a job to have the delay to be a proc:
sidekiq_rescue ExpectedError, delay: ->(counter) { counter * 60 }
Under the hood, this gem uses Sidekiq's perform_at and doesn't rely on the standard retry mechanism. Thus you have independent retry strategies for both types of errors: unexpected and expected.
Pros:
easy to use
independent retry strategy from the default Sidekiq error handling
can retry as many times as it needs to
can have a custom delay, even with proc
Cons:
- one more gem in a Gemfile
This gem is still under active development, but it has good coverage and has been tested in production.
As an author, I will be very grateful for any response about issues and PR's
Conclusion
In conclusion, error handling in Sidekiq is a critical aspect of software development. While the default retry mechanism works well for unexpected errors, handling expected errors requires additional strategies.
Techniques such as ignoring errors, retrying once before erroring, or using the sidekiq-rescue gem can be employed to handle these errors more effectively. Each method has its pros and cons, and the choice depends on the specific needs of your application.
By mastering these techniques, developers can significantly improve error observability and handling in their Sidekiq jobs, leading to more robust and reliable applications.