Checking and Specs — com.fulcrologic/guardrails-analyzer 1.0.0-alpha1

Checking and Specs

The purpose of guardrails pro is to analyze the flow of information through the body of a function. The vast majority of algorithmic expression within a function is a combination of local bindings and expression evaluations. Clojure encourages the use of simple data as primary carrier of information, and functions calls are the primary means of transformation.

Specs of input are sufficient for checks that are trying to determine if an input is sufficient, but as the information flows through the body of a function this is only useful if you have an accurate picture of what the information is.

Take the following example

(let [a 1
      b 2
      c (+ a b)]
  ...)

It is trivial for an analysis tool to know that a and b are numbers, and a spec for + that indicates acceptable arguments are numbers is sufficient to indicate if the overall expression is correct.

Now consider this:

(let [a 1
      b 2
      c (+ a b)]
  (if (> c 1) ... ...))

A casual reader (or sufficiently complex checker) could indicate that the "else" branch is unreachable. Let’s talk about how a static analysis tool might come to this conclusion.

One obvious way is simply to hand-build checking logic into the checker that "understands" addition. While this works, it is certainly not very general or useful on user-written code.

Approaches with Pure Code

Consider another approach: what if the original program could somehow mark + as a pure function? Now a checker could do something more interesting: it could run the operation using the known data and come up with an answer. It can see that a and b are constants, so the inference is now generalized to a wider set of cases:

(let [a "this is a test"
      b "of thing"
      c (str a " " b)] ; assume `str` and `strlen` are known to be side-effect free
  (if (> (strlen c) 1) ... ...))

One of the strengths of Copilot is that it runs in the full runtime of the actual application. It is not a static analysis before compilation, but one that uses the source within the runtime environment to find problems. Therefore it can walk the expression and actually run the user’s code.

Now, let’s expand this to things that have specifications. Remember that we’re not running these checks during the execution of the real program, but instead in our own checker environment that simply has access to the user’s source and compiled code.

Consider this expression:

(let [a (rand-int 100) ; assume spec says output is pos-int?
      b (rand-int 10)
      c (+ a b]
  (if (< c 0) ... ...))

There are a few things that GRP could know about a: First, it would know that it conforms to pos-int?. Next, if rand-int were known to be side-effect free it could gather one or more samples of output by running it (it has an input spec, so generating possible inputs is also something that can be done).

So, assuming the spec of + was simply that it outputs a number we cannot really get enough useful information from just the specs to know how to analyze the condition of the if; however, if all functions are known to be pure we can do stochastic analysis:

Get 10 samples of a
Get 10 samples of b
Get 10 samples of c using combinations of the samples of a and b
Get 10 samples of < using the generated values of c

At this point if our output from the last step always goes one direction we have an indication that the is possibly unreachable code. Now, since this is a statement of probability (and it could end up being quite computationally expensive) it is a future research problem to determine how useful this specific analysis is; however, it becomes a lot more useful if we look for cases where we have things we can prove:

(let [a (rand-int 100) ; assume spec says output is pos-int?
      b (rand-int 10)
      c (+ a b]
  (assert (pos-int? c) ...))

now the stochastic approach is definitely useful: if it finds even one false value for the assertion, then it has a case it can use to prove a problem.

Now, this is exactly what generative testing is supposed to do for you: exercise code and tell you cases where your logic fails: but instead of applying it at just the functional definition boundaries we are expanding it into a tool that analyzes the code line-by-line to give more targeted understanding of where/why a particular expression is incorrect.

Spec Flowing

Specs work such that it is possible to combine them at runtime:

(s/def :person/name string?)
(s/def :person/id int?)

(defn f [] (s/keys :req [:person/name]))
(defn h [] (s/keys :req [:person/id]))

(s/explain (s/and (f) (h)) {:person/name "boo"})
;; => {:person/name "boo"} - failed: (contains? % :person/id)

One of the primary problems with using the "return" spec from an fspec is that it does not indicate how the data flows through the function. This flow-through is important in many cases.

Take range and reverse in the following code:

(>defn range [a b] [int? int? => (s/coll-of int?)] ...)
(>defn reverse [a] [seq? => seq?] ...)
(>defn f [a] [(s/coll-of int?) => int?] ...)

(let [a (range 1 10)
      b (reverse a)]
  (f b))

The specification of reverse has no way to clearly indicate the relationship of the parameter input spec with the output spec. Thus, when we reach f and attempt to do analysis we find that the spec related to b is just seq?. This is insufficient information to correctly check f, and applying a generator from b to get a sample of data for f will result in an incorrect error.

There are several ways we can deal with this:

If the functions are known to be pure, then we can use the generator of range’s return value to generate some samples, pass those through a real call to `reverse, and use the resulting samples to check f.
We could allow the user to indicate the spec "flow" through the function. For example:

(>defn my-reverse [s]
  ^{:spec-flow (fn [return-spec arg1-spec] arg1-spec)}
  [seq? => seq?]
  ...)

Now we can programmatically propagate the specs through the logic:

(let [a (range 1 10) ; a's spec is (s/coll-of int?)
      b (reverse a)] ; b's spec is now *also* (s/coll-of int?)
  (f b)) ; f's argument is now known to be of the correct type

This simple addition of metadata to the overall function signature allows a quite powerful mechanism: The function can now tell the type-checking system how a particular function propagates information. This is particularly important for collection types.

For example, consider a function that combines information in a map:

(>defn with-full-name [{:person/keys [first-name last-name] :as person}]
  [(s/keys :req [:person/first-name :person/last-name]) => (s/keys :req [:person/first-name
                                                                         :person/last-name
                                                                         :person/full-name])]
  (assoc person :person/full-name (str first-name " " last-name)))

The open model of Clojure indicates that maps are open, and the reader of this function can clearly see that whatever was passed as the first argument is the same as the return value, but with one additional key. Unfortunately, spec only describe this vith verification (i.e. a :fn spec), but our extension easily allows it:

(>defn with-full-name [{:person/keys [first-name last-name] :as person}]
  ^{:pure?     false
    :spec-flow (fn [return-spec real-arg1-spec] (s/and real-arg1-spec return-spec))}
  [(s/keys :req [:person/first-name :person/last-name]) => (s/keys :req [:person/first-name
                                                                         :person/last-name
                                                                         :person/full-name])]
  (assoc person :person/full-name (str first-name " " last-name)))

Now anything that was known about the incoming arg is just augmented.

Unfortunately, we potentially lose useful generators when we do this: the additional constraints may lead to a case where automatic generation fails. This can be solved by adding a generator that leverages the incoming spec’s generator, and augments (or filters) the result. There then remain only the cases where the output of the function aggressively narrows an input parameter so that generation is very costly. At the moment we have no great solution for that.

The type flow above, though, does solve a number of quite useful cases. Say we have a logged-in? function that requires :account/id be in the map it is passed. The following code will now properly check:

(let [p {:person/id 1 :person/first-name "Bob" :person/last-name "Jones" :account/id 42}
      ;; spec of a literal is just the base type (e.g. `map?) with a single-case generator
      p2 (with-full-name p) ; p2 is (s/and map? (s/keys ...)), but generates value of `p` w/addl key
      logged-in? (account/logged-in? p2) ; incoming spec is not exact match, uses generator for sample. Passes
      ...]

FIXME: This isn’t good enough. The spec flow needs to be data we can interpret, since we need to combine both the specs and the generators.

The operations needed are:

Enforcement of specified output type (i.e. boolean always outputs a boolean?).
Pure pass-through of type. E.g. reverse
The ability to take an input sample and refine it to an output sample. The function itself could be used to do that if known to be pure.

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field