Pure Clojure implementations of the wcwidth
and wcswidth
POSIX functions (plus some other useful Unicode functions).
When Unicode grapheme clusters ("characters") are sent to a fixed-width device (e.g. a terminal or monospaced editor), many have a well-defined "notional width", expressed in units of columns (where a typical ASCII character takes up 1 column). This is partially standardised in Unicode Technical Report #11, which is implemented as the POSIX functions wcwidth
and wcswidth
.
The JVM doesn't provide these functions however, so applications that need to know these widths (e.g. for terminal screen formatting purposes) are left to their own devices. While there are Java libraries that have implemented this themselves (notably ICU4J and JLine), pulling in a large dependency when one only uses a very small part of it is sometimes overkill.
This library provides a small, zero-dependency-by-default, pure Clojure implementation of this functionality and goes further by (optionally) also taking ANSI escape codes into account (as these are also zero width on an ANSI-capable terminal).
count
?When supplied with a sequence of textual data (i.e. a String
or char[]
), count
simply counts the number of Java char
s in that sequence, which is not the same thing as a Unicode grapheme cluster (since a Unicode grapheme cluster may be made up of multiple Unicode code points). What's worse is that due to a historical oddity of the JVM, a Java char
isn't even necessarily the same thing as a Unicode code point. Specifically, Java char
s are a 16 bit "code unit" from UTF-16, and Unicode code points in the supplementary planes are represented by 2 such code units (and therefore as 2 char
s on the JVM).
Furthermore, count
doesn't account for combining, non-printing, or zero-width Unicode code points; it counts them as char
s regardless of whether they get displayed on Unicode-enabled devices or not. Similarly it has no awareness of the non-printing nature of ANSI escape codes.
Technically, UTR11 defines display widths for every Unicode code point, which is not necessarily the same thing as a grapheme cluster (a "character"). So the way this library (and others like it) function is to break strings up into their grapheme clusters, and then determine the width of each cluster based on the display width rules of the code point(s) that comprise that cluster, then add the cluster widths together to arrive at the string's overall display width.
In many cases this is a simple 1:1 correspondence - the Latin character "a", for example, is a single grapheme cluster (a
) defined by a single code point (U+0061
), and takes up a single display column. At the other end of complexity, the transgender flag emoji is a single grapheme cluster (π³οΈββ§οΈ
), defined by 5 code points (U+1F3F3 U+FE0F U+200D U+26A7 U+FE0F
), and takes up 2 display columns. It also (due to the historical Java issue mentioned above) takes up 6 (!) JVM char
s, further complicating the situation for Clojure developers.
This library fundamentally depends on being able to break strings into Unicode grapheme clusters, which the JVM supports via the java.text.BreakIterator
class. Unfortunately the implementation of this class tends to lag behind the latest Unicode specification, especially in JVM versions prior to 20 (see JDK-8291660 for some specifics).
For that reason, this library will check at runtime whether the ICU4J library is on the classpath, and if so use its implementation of the BreakIterator
class instead of the JDK's. This gives downstream users of the library the ability to choose whether to consume this library in a lightweight, zero-dependency "best effort" form, or whether to introduce the (large) ICU4J library and thereby ensure correct behaviour across a wider range of esoteric Unicode inputs.
Note that the unit tests are run using the ICU4J library only, since they are run on a CI matrix of JVM versions, and include some tests that are known to fail on JVM versions prior to v24.
clj-wcwidth
is available as a Maven artifact from Clojars.
API documentation is available here. The unit tests provide comprehensive usage examples.
$ clj -Sdeps '{:deps {com.github.pmonks/clj-wcwidth {:mvn/version "RELEASE"}}}'
$ lein try com.github.pmonks/clj-wcwidth
$ deps-try com.github.pmonks/clj-wcwidth
(require '[clojure.string :as s])
(require '[wcwidth.api :as wcw])
;; POSIX-compliant wcwidth / wcswidth
(def ascii-esc \u001B)
(wcw/wcwidth \A)
; ==> 1
(wcw/wcwidth \Β©)
; ==> 1
(wcw/wcwidth 0x0000) ; ASCII NUL (zero width)
; ==> 0
(wcw/wcwidth ascii-esc) ; ASCII ESC (non printing)
; ==> -1
(wcw/wcwidth 0x1F921) ; π€‘ (double width)
; ==> 2
(wcw/wcswidth "hello, world")
; ==> 12
(wcw/wcswidth "hello, π")
; ==> 9
;; wcswidth (POSIX) vs display-width (non-POSIX, but more practical)
(wcw/wcswidth (str "hello, " ascii-esc))
; ==> -1
(wcw/display-width (str "hello, " ascii-esc )) ; ASCII ESC
; ==> 7
;; ANSI escape code support
(def ansi-hide-cursor (str ascii-esc "[25l"))
(wcw/display-width (str "hello, " ansi-hide-cursor))
; ==> 7
;; Examples showing how clojure.core/count doesn't work
(def jerome (wcw/code-points-to-string [\J \e 0x0341 \r \o 0x0302 \m \e])) ; JeΝroΜme, using combining diacritics
(wcw/display-width jerome)
; ==> 6
(count jerome)
; ==> 8
(def deseret-capital-long-i (wcw/code-point-to-string 0x10400)) ; π
(wcw/display-width deseret-capital-long-i)
; ==> 1
(count deseret-capital-long-i)
; ==> 2
(def lots-of-escapes (s/join (repeat 1000 ascii-esc)))
(wcw/display-width lots-of-escapes)
; ==> 0
(count lots-of-escapes)
; ==> 1000 ; lol π€‘
(def trans-flag (wcw/code-points-to-string [0x1F3F3 0xFE0F 0x200D 0x26A7 0xFE0F])) ; π³οΈββ§οΈ
(wcw/display-width trans-flag)
; ==> 2
(count trans-flag)
; ==> 6 ; lol π€‘
This project uses the git-flow branching strategy, with the caveat that the permanent branches are called release
and dev
. Any changes to the release
branch are considered a release and auto-deployed (JARs to Clojars, API docs to GitHub Pages, etc.).
For this reason, all development must occur either in branch dev
, or (preferably) in temporary branches off of dev
. All PRs from forked repos must also be submitted against dev
; the release
branch is only updated from dev
via PRs created by the core development team. All other changes submitted to release
will be rejected.
wcwidth
uses tools.build
. You can get a list of available tasks by running:
clojure -A:deps -T:build help/doc
Of particular interest are:
clojure -T:build test
- run the unit testsclojure -T:build lint
- run the linters (clj-kondo and eastwood)clojure -T:build ci
- run the full CI suite (check for outdated dependencies, run the unit tests, run the linters)clojure -T:build install
- build the JAR and install it locally (e.g. so you can test it with downstream code)Please note that the deploy
task is restricted to the core development team (and will not function if you run it yourself).
Copyright Β© 2022 Peter Monks
Distributed under the Mozilla Public License, version 2.0.
SPDX-License-Identifier: MPL-2.0
Can you improve this documentation?Edit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
Ctrl+k | Jump to recent docs |
β | Move to previous article |
β | Move to next article |
Ctrl+/ | Jump to the search field |