Liking cljdoc? Tell your friends :D

wreck - the "Whacky Regular Expression Construction Kit"

A micro-library for Clojure(Script) that provides a selection of regular expression (regex) functions, mostly focused on ease of composition. It has no dependencies, other than on Clojure, and emits standard Clojure regex objects, so is fully compatible with Clojure's built-in regex functions (re-matches, re-find, re-seq, etc.). It also doesn't make use of any JVM-specific or JavaScript-specific regex syntax, though it is fully compatible with platform-specific regexes, if you're using those.

The library is not intended to provide a comprehensive functional alternative for constructing regexes - knowledge of regex syntax remains necessary. Instead it is intended to assist in constructing syntactically valid large regexes by composing smaller regexes together in well-defined ways.

It also pairs very nicely with rencg - that library adds first class support for named capturing groups to Clojure (albeit the JVM flavour only).

Why?

I have other projects that perform complex text processing and in some cases have ended up writing very large regexes (as large as ~10KB), and writing and maintaining huge regexes while keeping them syntactically and functionally correct using Clojure regex literals, is... ..."challenging". As a result I'd written some helper functions that let me modularise those regexes, and test and construct them in pieces, and before long I realised that these functions were independently useful, despite not being complex or novel. Hence this library.

Installation

wreck is available as a Maven artifact from Clojars.

Usage

API documentation is available here, or here on cljdoc, and the unit tests are also worth perusing to see worked examples. I'm also active on the Clojure Discord server if you'd like to chat.

warning

JavaScript's RegExp class fundamentally doesn't support lossless round-tripping of RegExp objects to Strings and back, something this library relies upon and does extensively. The library makes a best effort to correct JavaScript's problematic implementation, but because it's fundamentally lossy there are some cases that (on ClojureScript only) may change your regexes in unexpected (though probably not semantically significant) ways. See the unit tests for specific examples.

important

wreck is primarily intended to be used to construct long-lived regex objects once (e.g. at load time), and YMMV if you're constructing large regexes dynamically. This is because it repeatedly round trips regex objects to Strings and back during the construction process, since Clojure regex objects don't natively support concatenation. This can generate a substantial number of shortlived objects on the heap, which can have garbage collection implications (though generational garbage collectors, such as the JVM's, tend to handle this case well).

Regex flags

Regex flags are a thorny corner case with regexes, in that they're both highly platform specific, and (in their usual usage) don't compose properly because of their global nature (and regex composition is the entire point of wreck). As a result wreck makes the opinionated design choice to automatically and aggressively convert all flags it finds to embedded flag groups (e.g. (?i)[abc]+ becomes (?i:[abc]+)), as these groups scope the effect of the flag(s) and are therefore far easier to reason about during regex composition. This is done using the embed-flags function, which you might also use explicitly if you have a 3rd party regex (e.g. from a library), don't know if it has flags or not, and want to use the same logic wreck uses to embed any flags it might have.

However when constructing regexes from scratch, it is strongly recommended that you only use the flags-grp function. It directly creates an embedded flag group, avoiding any guesswork about wreck's automatic embedding logic, or the scope of the flag(s).

important

The JVM has 2 flags that can only be set globally (but cannot be embedded), and JavaScript has 5. wreck very deliberately does not provide functionality to set global flags on regexes, because of the difficulties they create during regex composition. If you have a use case that cannot be satisfied any way except with one of these non-embeddable flags, you can fallback on interop to set them (Clojure itself does not provide such a mechanism either). Such flags must be applied at the very end of regex composition, after all wreck functions have been applied (since wreck functions aggressively remove such flags).

warning

ClojureScript appears to have a JS code generation bug in the logic that emulates support for JVM-style ((?i)) embedded flags (JavaScript does not natively support embedded flags of this form). This bug manifests as a JavaScript syntax error at runtime (the JavaScript code generated by ClojureScript is syntactically invalid). Be aware if you happen to be constructing regexes using JVM-style embedded flags (and better yet, use the flags-grp function instead).

Trying it Out

Clojure CLI

$ clj -Sdeps '{:deps {com.github.pmonks/wreck {:mvn/version "RELEASE"}}}'

Leiningen

$ lein try com.github.pmonks/wreck

deps-try

$ deps-try com.github.pmonks/wreck

Demo

(require '[wreck.api :as re])


;; Basics

(re/esc ".*")
;=> "\\.\\*"  ; Note: a String - most other fns return regexes

(re/qot ".*")
;=> #"\Q.*\E"

(re/join #"a" #"b")
;=> #"ab"

(re/join "[" #"\p{Punct}" #"\p{Space}" "]+")  ; join also supports strings (and other data
                                              ; types), allowing syntactically invalid
                                              ; fragments to be used to build up a valid
                                              ; expression
;=> #"[\p{Punct}\p{Space}]+"

; Because equality isn't defined for regexes in Clojure
(re/=' #"ab" (re/join #"a" #"b"))
;=> true


;; Groups

(re/grp #"a" #"b")
;=> #"(?:ab)"  ; Default group is non-capturing

(re/cg #"a" #"b")
;=> #"(ab)"  ; But we can also do capturing groups

(re/ncg "ab" #"a" #"b")
;=> #"(?<ab>ab)"  ; And named capturing groups (much more useful, especially with rencg!)

(re/grp "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" 0 1 2 3 4 5 6 7 8 9)
;=> #"(?:abcdefghijklmnopqrstuvwxyz0123456789)"  ; Group functions are variadic, including
                                                 ; most of the variants shown next. They also
                                                 ; (like join) support regexes, strings, and
                                                 ; other data types


;; Cardinality

(re/opt #"foo")  ; opt = optional (i.e. zero or one)
;=> #"foo?"  ; Probably not what we want, so...

(re/opt-grp #"foo")
;=> #"(?:foo)?"  ; That's more like it!

(re/zom-grp #"foo")  ; zom = zero or more
;=> #"(?:foo)*"

(re/oom-grp #"foo")  ; oom = one or more
;=> #"(?:foo)+"

(re/exn-grp 2 #"foo")  ; exn = exactly n
;=> #"(?:foo){2}"

(re/nom-grp 4 #"foo")  ; nom = n or more
;=> #"(?:foo){4,}"

(re/n2m-grp 12 17 #"foo")  ; n2m = n to m
;=> #"(?:foo){12,17}"

; There are -cg and -ncg variants of all of these fns as well, and all are variadic


;; Alternation

(re/alt #"foo" #"bar")  ; Be careful using this fn as alternation has the lowest
;=> #"foo|bar"          ; precedence in regexes

(re/alt-grp #"foo" #"bar")
;=> #"(?:foo|bar)"

; There are -cg and -ncg variants of this fn as well, and all are variadic


;; Logical operators

(re/and-grp #"foo" #"bar")
;=> #"(?:foobar|barfoo)"

(re/or-grp #"foo" #"bar")
;=> #"(?:foobar|barfoo|foo|bar)"

(re/or-grp #"foo" #"bar" #"\s+")  ; Logical operators also support separators
;=> #"(?:foo\s+bar|bar\s+foo|foo|bar)"

(re/xor-grp #"foo" #"bar")  ; The same as alt, but provided for ease of comprehension in
;=> #"(?:foo|bar)"          ; lengthy regex composition expressions that use the logical
                            ; operators


; There are -cg and -ncg variants of all of these fns as well, but note that unlike the other
; variants, none of the logical operator grouping variants are variadic


;; A more complex example that composes a longer regex from just a few easy-to-read statements
;; (from the unit tests)

; "Lesser" or "Library", but in any order, or either word by itself, with either a forward
; slash or the word "or" as a separator
(def lorl-re (re/or-grp "Lesser" "Library" (re/alt-grp #"\s*/\s*" #"\s+or\s+")))
;=> #"(?:Lesser(?:\s*/\s*|\s+or\s+)Library|Library(?:\s*/\s*|\s+or\s+)Lesser|Lesser|Library)"

(def lgpl-re (re/join
               #"(?<!\w)"                                       ; No word character before
               (re/flags-grp "i"                                ; Flags (in a group)
                 (re/alt-ncg "lgpl"                             ; Alternations, in a NCG
                   "LGPL"                                       ; LGPL literal (string)
                   (re/join "GNU" #"\s+" lorl-re #"\s+" "GPL")  ; GNU <lorl regex> GPL
                   (re/join "GNU" #"\s+" lorl-re)               ; GNU <lorl regex>
                   (re/join lorl-re #"\s+" "GPL")))             ; <lorl regex> GPL
               #"(?!\w)"))                                      ; No word character after
;=> #"(?<!\w)(?i:(?<lgpl>LGPL|GNU\s+(?:Lesser(?:\s*/\s*|\s+or\s+)Library|Library(?:\s*/\s*|
;=>   \s+or\s+)Lesser|Lesser|Library)\s+GPL|GNU\s+(?:Lesser(?:\s*/\s*|\s+or\s+)Library|Library
;=>   (?:\s*/\s*|\s+or\s+)Lesser|Lesser|Library)|(?:Lesser(?:\s*/\s*|\s+or\s+)Library|Library
;=>   (?:\s*/\s*|\s+or\s+)Lesser|Lesser|Library)\s+GPL))(?!\w)"

; Which would you rather maintain?  😉

Contributor Information

Contributing Guidelines

Bug Tracker

Code of Conduct

Developer Workflow

This project uses the git-flow branching strategy, and the permanent branches are called release and dev. Any changes to the release branch are considered a release and auto-deployed (JARs to Clojars, API docs to GitHub Pages, etc.).

For this reason, all development must occur either in branch dev, or (preferably) in temporary branches off of dev. All PRs from forked repos must also be submitted against dev; the release branch is only updated from dev via PRs created by the core development team. All other changes submitted to release will be rejected.

Build Tasks

wreck uses tools.build. You can get a list of available tasks by running:

clojure -A:deps -T:build help/doc

Of particular interest are:

clojure -T:build test - run the unit tests
clojure -T:build lint - run the linters (clj-kondo and eastwood)
clojure -T:build ci - run the full CI suite (check for outdated dependencies, run the unit tests, run the linters)
clojure -T:build install - build the JAR and install it locally (e.g. so you can test it with downstream code)

Please note that the release and deploy tasks are restricted to the core development team (and will not function if you run them yourself).

License

Distributed under the Mozilla Public License, version 2.0.

SPDX-License-Identifier: MPL-2.0

Can you improve this documentation?Edit on GitHub

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field

Raise an issue Browse cljdoc source Chat on Slack

× close