Readme — net.modulolotus/truegrit 1.0.12

README

True Grit

For when you need a function that won’t give up at the first sign of failure.

A data-driven, functionally-oriented, idiomatic wrapper library for using Resilience4j. Start in the net.modulolotus.truegrit namespace, and see the individual policy namespaces if you have more advanced needs. It contains all-in-one functions that take a fn and a config map, and return a wrapped fn with the resilience policy attached. All fns return the same results (except for thread-pool-based bulkheads, which make fns that return Futures).

Docs are available here.

Background reading

Before using circuit breakers and bulkheads, be sure to understand how they operate. I highly recommend Release It! to understand the ways distributed systems can fail and how to compensate for them.

See:

Resilience4j docs - https://resilience4j.readme.io/
Circuit breaker pattern - https://www.martinfowler.com/bliki/CircuitBreaker.html
Release It! - https://pragprog.com/titles/mnee2/release-it-second-edition/
Hystrix wiki - https://github.com/Netflix/Hystrix/wiki

Examples

Basic usage

(require '[net.modulolotus.truegrit :as tg])

(def resilient-fn
  (-> flaky-fn
      ;; Give each individual call up to 10s to complete
      (tg/with-time-limiter {:timeout-duration 10000})

      ;; Try up to 5 times, waiting 1s between failures
      ;; Will retry if an exception is thrown by default, but will also retry if
      ;; the return value is nil
      (tg/with-retry {:name            "my-retry"
                      :max-attempts    5
                      :wait-duration   1000
                      :retry-on-result nil?})

      ;; If it still fails after 5 tries, record it as a failure in the CB
      ;; CB will go into OPEN status if 20% of calls end up failures
      ;; CB will wait for at least 40 calls before considering a change in status,
      ;; giving it time to warm up.
      ;; Ignores UserCanceledExceptions, since if the user hit "Cancel", it's not a
      ;; problem in the underlying service
      (tg/with-circuit-breaker {:name                    "my-circuit-breaker"
                                :failure-rate-threshold  20
                                :minimum-number-of-calls 40
                                :ignore-exceptions       [UserCanceledException]})))

Use a shared circuit breaker to track an underlying service called by many fns

(require '[net.modulolotus.truegrit.circuit-breaker :as cb])

(def rest-service-cb (cb/circuit-breaker "shared-rest-service"
                                         {:failure-rate-threshold 30
                                          :minimum-number-of-calls 10}))

(def resil-get (cb/wrap flaky-get rest-service-cb))
(def resil-post (cb/wrap flaky-post rest-service-cb))
(def resil-put (cb/wrap flaky-put rest-service-cb))
(def resil-patch (cb/wrap flaky-patch rest-service-cb))
(def resil-delete (cb/wrap flaky-delete rest-service-cb))

Check circuit breaker to choose an alternative method if status is OPEN

(require '[net.modulolotus.truegrit.circuit-breaker :as cb])

(if (-> resilient-fn
        (cb/retrieve)           ; retrieve associated CircuitBreaker
        (cb/call-allowed?))     ; is a call allowed right now?
  (resilient-fn)                ; if so, make the call
  (some-fallback-fn))           ; if not, we can't wait, try a fallback

Use semaphore-based bulkheads to limit database access, keep 20% capacity in reserve, and log reserved metrics

(require '[net.modulolotus.truegrit.bulkhead :as bh])

(defn database-query-fn
  "Some database fn that we've determined can only handle 100 simultaneous queries"
  [user]
  ;; do some db stuff
  )

;; Make a default version that can use up to 80% of the database's capacity
(def default-database-query (tg/with-bulkhead database-query-fn
                                              {:name "default-db-bulkhead"
                                               :max-concurrent-calls 80}))

;; Make a version that reserves 20% for special needs
(def reserved-database-query (tg/with-bulkhead database-query-fn
                                               {:name "reserved-db-bulkhead"
                                                :max-concurrent-calls 20}))

;; Usage
(defn some-handler-fn
  [user]
  (if (user-is-special-somehow user)   ; Is the user a VIP, sysadmin, etc?
    (reserved-database-query user)     ; Make reserved call - the default bulkhead being full has no impact here
    (default-database-query user)))    ; Make standard call, blocking if unavailable

;; Log reserved bulkhead metrics every 10s
(future
  (loop []
    (-> reserved-database-query
        (bh/retrieve)
        (bh/metrics)
        (log/debug))
    (Thread/sleep 10000)
    (recur)))

Guidelines

Circuit breaker status shorthand	CLOSED is good, OPEN is bad. Think of electricity flowing.
Make sure you’ve read up on bulkheads and circuit breakers before using them.	Seriously.
Retries only make sense if there’s a reasonable expectation the fn will succeed within an acceptable time frame.	They’re better-suited for temporary glitches in the matrix, not a service being down all day. If the fn won’t succeed in time, retries will make things worse, which is why pairing them with circuit breakers works well.
Be mindful of interactions at different levels of the system.	E.g., wrapping a high-level fn with a retry policy of 3 attempts that calls an AWS client lower down that also has its own retry policy of 3 attempts can result in up to 3x3=9 calls under failure modes, exacerbating things. Another common example is having multiple timeouts; it’s pointless, since the shortest timeout will trigger first.
You still need to handle errors.	No amount of resilience policies can ensure a function will always succeed.

Order of wrapping matters. E.g.:

(-> my-fn
    (with-retry some-retry-config)
    (with-time-limiter some-timeout-config)

will retry several times, but if the time limit is up before the tries succeed, it will return failure. This is probably not what you want. On the other hand:

(-> my-fn
    (with-time-limiter some-timeout-config)
    (with-retry some-retry-config)

will make calls with a certain time limit, and only if they return failure or exceed their time limit, will it attempt a retry. If you want a canonical "good" ordering, see the robustify example fn in the source.

TrueGrit architecture

Design goals and constraints

Each resilience policy is implemented as a light-weight functional facade across dozens of underlying r4j Java objects. It tries to ease the pain of directly working with the r4j classes while still offering the same level of functionality. The only exposed r4j classes are the main policies. Where possible, r4j classes that mostly exist to hold properties are replaced with maps for Clojure usage.

No registries

Registries are r4j collections of the same policy. (Time-limiters do not offer registries, but the rest do.) A previous version of this library used registries, but I removed them. They offer too little over existing Clojure data containers to be worth the overhead. You are better off using standard fns to store them in a map in an atom.

If you choose to use them, there are some quirks to how they work that you should be aware of. The registries combine retrieval and creation under the hood. The first time you request an object with a certain name and config, it will create a new one. The second time you request it with the same name, it will return the existing one. This means you can efficiently use r4j on the fly without creating a lot of extra objects, but it also means that a) if you make a typo in the policy name when creating/retrieving, you will get a new object, and b) you cannot update a config to an existing policy in a registry. (R4j policy objects are immutable, but the r4j registry interface doesn’t make it clear that a new config will be effectively ignored.)

No protocols

I chose not to use protocols here. At first glance, this seems like an odd decision: each namespace has many similarly-named fns, in some cases with similar bodies; it seems natural to use protocols.

However, the benefit of protocols is in dealing with abstractions and using polymorphism. With protocols, we can ignore irrelevant underlying details and swap concrete implementations without changing calling code. E.g., if I were coding to a collection abstraction, I could add/remove items without knowing the specific data structure used. Unfortunately, r4j does not have these properties. (Not even the two bulkhead implementations are swappable, since one is async.)

At a superficial level, the r4j resilience strategies do have common behaviors, such as wrapping a fn and adding event listeners. However, they are completely non-interchangeable in behavior and usage (e.g., you can’t meaningfully swap a time-limiter for a circuit breaker). There’s no useful shared abstraction to code to.

On top of that, the polymorphism is limited by functions that have the same name, but very different sets of options. Enough differences in params exist between similar structures (e.g., configurations, event handlers) that the params are not swappable, even if the fn name is identical. Many functions can’t safely be polymorphically called; you’d have to know the underlying type to supply the correct options, and then you don’t have polymorphism. Even in a case where meaningful-but-limited polymorphism could be obtained, they’re still hampered by the non-substitutability of the underlying strategies they use. This is all reflected in the interfaces/classes of Resilience4j itself, which has the exact same issue; there’s fewer common interfaces/superclasses than you’d expect.

But surely protocols wouldn’t hurt, right? Well, they would suggest misleading polymorphism. They would add a bit of extra clutter to the namespaces. But mostly, there’s almost no advantage to using them here, so I didn’t.

But what about all the almost-duplicate fn bodies? Regrettable, but better than the alternatives. If they were exact duplicates, I could rely on automatic reflection, but sadly, r4j like to name fns like getAllRetries instead of a more generic getAll. I could use some funky reflection or macros to DRY it up, but it would be more complex and error-prone than a bit of copying.

Non-goals

The r4j cache module is currently unsupported, since many Clojure/Java caching libraries already exist. However, it could be included, if people are interested.

Supporting all the Java frameworks that r4j interoperates with is also a non-goal for now.

Future directions

The r4j registries add virtually nothing over standard Clojure mutable containers, but the code I wrote for them still exists, so I could add them back if people really need them.

Metric module support may be added, if anyone expresses a need for it.