Pure Clojure implementations of the wcwidth and wcswidth POSIX functions, plus some other, more useful non-POSIX functions related to this use case.
When Unicode text is sent to a Unicode-capable fixed-width device (e.g. a terminal, monospaced printer, etc.), the "characters" that make up that text each have a well-defined "notional width" of either 0, 1, or 2 columns (where a typical ASCII character takes up 1 column). This is standardised in Unicode Technical Report #11, and implemented as the POSIX C functions wcwidth and wcswidth. The JVM doesn't provide these functions however, so applications that need to know these display widths (e.g. for terminal output formatting purposes) are left to their own devices. While there are Java libraries that have implemented this (notably JLine), pulling in a large dependency when one only uses a very small part of it is sometimes overkill.
clj-wcwidth provides a small, zero-dependency-by-default, pure Clojure implementation of this functionality (and more).
This library addresses various inconveniences in both POSIX and JLine:
wcswidth function returns -1 if a string contains any non-printing characters. In practice this means that Unicode text needs to be pre-processed before being passed to this function.wcwidth (the POSIX function that returns the display width of a single code point), but what we think of as a "character" is actually a "Unicode grapheme cluster" and critically, some grapheme clusters are made up of multiple code points (see below).UTR11 defines display widths for most Unicode code points, but a code point is not necessarily the same thing as a grapheme cluster (a "character"). The way this library (and others like it) function is to break strings up into their grapheme clusters, determine the width of each cluster based on the display width rules of the code point(s) that comprise that cluster (i.e. using the rules in UTR11), then sum up the cluster widths to arrive at the string's overall display width.
At the grapheme cluster level this manifests in several ways, including:
a is defined by a single code point (U+0061), and takes up 1 display column.☕️ is defined by a single code point (U+2615), and takes up 2 display columns.é is defined by 2 code points (U+0065 and U+0341), but only takes up 1 display column.🏳️⚧️ is defined by 5 code points (U+1F3F3, U+FE0F, U+200D, U+26A7, and U+FE0F), and takes up 2 display columns.caution
There is a common misconception that the JVM's char and Character types represent a Unicode code point, but that is not the case. Instead, due to an epicly shortsighted decision by Sun in the early 2000s, they represent a UTF-16 "code unit", a footgun that spawns bugs throughout JVM / Clojure code when surrogate pairs aren't properly handled during processing of sequences of chars (including strings). This is why, for example, calling count on the string "🏳️⚧️" returns 6, instead of the expected 5 - the leading code point (U+1F3F3) cannot be represented by a single JVM char, and is instead represented as two chars containing the equivalent UTF-16 surrogate pair ([0xD83C, 0xDFF3]).
note
This library fundamentally depends on being able to break strings into grapheme clusters, which evolves with each version of the Unicode specification. The JVM provides this capability via the java.text.BreakIterator class, but unfortunately the implementation of this class tends to lag behind the latest version of the Unicode specification, especially in JVM versions prior to 20. For that reason, this library will check at runtime whether the ICU4J library is on the classpath, and if so use its implementation of the BreakIterator class instead of the JDK's. This gives downstream users of the library the ability to choose whether to consume this library in a lightweight, zero-dependency, "best effort of the JVM" form (the default), or whether to introduce the large (14MB) ICU4J dependency in order to ensure correct behaviour across a wider range of JVM versions and Unicode inputs.
clj-wcwidth is available as a Maven artifact from Clojars.
API documentation is available here. The unit tests provide comprehensive usage examples.
$ clj -Sdeps '{:deps {com.github.pmonks/clj-wcwidth {:mvn/version "RELEASE"}}}'
$ lein try com.github.pmonks/clj-wcwidth
$ deps-try com.github.pmonks/clj-wcwidth
(require '[clojure.string :as s])
(require '[wcwidth.api :as wcw])
;; POSIX-compliant wcwidth / wcswidth
(def ascii-esc \u001B)
(wcw/wcwidth \A)
; ==> 1
(wcw/wcwidth \©)
; ==> 1
(wcw/wcwidth 0x0000) ; ASCII NUL (zero width)
; ==> 0
(wcw/wcwidth ascii-esc) ; ASCII ESC (non printing)
; ==> -1
(wcw/wcwidth 0x1F921) ; 🤡 (double width)
; ==> 2
(wcw/wcswidth "hello, world")
; ==> 12
(wcw/wcswidth "hello, 🌏")
; ==> 9
;; wcswidth (POSIX) vs display-width (non-POSIX, but more practical)
(wcw/wcswidth (str "hello, " ascii-esc))
; ==> -1
(wcw/display-width (str "hello, " ascii-esc))
; ==> 7
;; ANSI escape code support
(def ansi-hide-cursor (str ascii-esc "[25l"))
(wcw/display-width (str "hello, " ansi-hide-cursor))
; ==> 7
;; Examples showing how clojure.core/count doesn't work for this use case
(def jerome (wcw/code-points->string [\J \e 0x0341 \r \o 0x0302 \m \e])) ; Jérôme, using combining diacritics
(wcw/display-width jerome)
; ==> 6
(count jerome)
; ==> 8
(def deseret-capital-long-i (wcw/code-point->string 0x10400)) ; 𐐀
(wcw/display-width deseret-capital-long-i)
; ==> 1
(count deseret-capital-long-i)
; ==> 2
(def zalgo-text "Ẓ̌á̲l͔̝̞̄̑͌g̖̘̘̔̔͢͞͝o̪̔T̢̙̫̈̍͞e̬͈͕͌̏͑x̺̍ṭ̓̓ͅ")
(wcw/display-width zalgo-text)
; ==> 9
(count zalgo-text)
; ==> 44 ; lol 🤡
(def lots-of-escapes (s/join (repeat 1000 ascii-esc)))
(wcw/display-width lots-of-escapes)
; ==> 0
(count lots-of-escapes)
; ==> 1000 ; lol 🤡
(def transgender-flag (wcw/code-points->string [0x1F3F3 0xFE0F 0x200D 0x26A7 0xFE0F])) ; 🏳️⚧️
(wcw/display-width transgender-flag)
; ==> 2
(count transgender-flag)
; ==> 6 ; lol 🤡
This project uses the git-flow branching strategy, with the caveat that the permanent branches are called release and dev. Any changes to the release branch are considered a release and auto-deployed (JARs to Clojars, API docs to GitHub Pages, etc.).
For this reason, all development must occur either in branch dev, or (preferably) in temporary branches off of dev. All PRs from forked repos must also be submitted against dev; the release branch is only updated from dev via PRs created by the core development team. All other changes submitted to release will be rejected.
wcwidth uses tools.build. You can get a list of available tasks by running:
clojure -A:deps -T:build help/doc
Of particular interest are:
clojure -T:build test - run the unit testsclojure -T:build lint - run the linters (clj-kondo and eastwood)clojure -T:build ci - run the full CI suite (check for outdated dependencies, run the unit tests, run the linters)clojure -T:build install - build the JAR and install it locally (e.g. so you can test it with downstream code)Please note that the deploy task is restricted to the core development team (and will not function if you run it yourself).
Copyright © 2022 Peter Monks
Distributed under the Mozilla Public License, version 2.0.
SPDX-License-Identifier: MPL-2.0
Can you improve this documentation?Edit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |