Liking cljdoc? Tell your friends :D

clj-wcwidth

Pure Clojure implementations of the wcwidth and wcswidth POSIX functions (plus some other useful Unicode functions).

Why?

When Unicode grapheme clusters ("characters") are sent to a fixed-width device (e.g. a terminal or monospaced editor), many have a well-defined "notional width", expressed in units of columns (where a typical ASCII character takes up 1 column). This is partially standardised in Unicode Technical Report #11, which is implemented as the POSIX functions wcwidth and wcswidth.

The JVM doesn't provide these functions however, so applications that need to know these widths (e.g. for terminal screen formatting purposes) are left to their own devices. While there are Java libraries that have implemented this themselves (notably ICU4J and JLine), pulling in a large dependency when one only uses a very small part of it is sometimes overkill.

This library provides a small, zero-dependency-by-default, pure Clojure implementation of this functionality and goes further by (optionally) also taking ANSI escape codes into account (as these are also zero width on an ANSI-capable terminal).

Why not `count`?

When supplied with a sequence of textual data (i.e. a String or char[]), count simply counts the number of Java chars in that sequence, which is not the same thing as a Unicode grapheme cluster (since a Unicode grapheme cluster may be made up of multiple Unicode code points). What's worse is that due to a historical oddity of the JVM, a Java char isn't even necessarily the same thing as a Unicode code point. Specifically, Java chars are a 16 bit "code unit" from UTF-16, and Unicode code points in the supplementary planes are represented by 2 such code units (and therefore as 2 chars on the JVM).

Furthermore, count doesn't account for combining, non-printing, or zero-width Unicode code points; it counts them as chars regardless of whether they get displayed on Unicode-enabled devices or not. Similarly it has no awareness of the non-printing nature of ANSI escape codes.

How does it work?

Technically, UTR11 defines display widths for every Unicode code point, which is not necessarily the same thing as a grapheme cluster (a "character"). So the way this library (and others like it) function is to break strings up into their grapheme clusters, and then determine the width of each cluster based on the display width rules of the code point(s) that comprise that cluster, then add the cluster widths together to arrive at the string's overall display width.

In many cases this is a simple 1:1 correspondence - the Latin character "a", for example, is a single grapheme cluster (a) defined by a single code point (U+0061), and takes up a single display column. At the other end of complexity, the transgender flag emoji is a single grapheme cluster (🏳️‍⚧️), defined by 5 code points (U+1F3F3 U+FE0F U+200D U+26A7 U+FE0F), and takes up 2 display columns. It also (due to the historical Java issue mentioned above) takes up 6 (!) JVM chars, further complicating the situation for Clojure developers.

A note about JVM Unicode suppport

This library fundamentally depends on being able to break strings into Unicode grapheme clusters, which the JVM supports via the java.text.BreakIterator class. Unfortunately the implementation of this class tends to lag behind the latest Unicode specification, especially in JVM versions prior to 20 (see JDK-8291660 for some specifics).

For that reason, this library will check at runtime whether the ICU4J library is on the classpath, and if so use its implementation of the BreakIterator class instead of the JDK's. This gives downstream users of the library the ability to choose whether to consume this library in a lightweight, zero-dependency "best effort" form, or whether to introduce the (large) ICU4J library and thereby ensure correct behaviour across a wider range of esoteric Unicode inputs.

Note that the unit tests are run using the ICU4J library only, since they are run on a CI matrix of JVM versions, and include some tests that are known to fail on JVM versions prior to v24.

Installation

clj-wcwidth is available as a Maven artifact from Clojars.

API Documentation

API documentation is available here. The unit tests provide comprehensive usage examples.

Trying it Out

Clojure CLI

$ clj -Sdeps '{:deps {com.github.pmonks/clj-wcwidth {:mvn/version "RELEASE"}}}'

Leiningen

$ lein try com.github.pmonks/clj-wcwidth

deps-try

$ deps-try com.github.pmonks/clj-wcwidth

Demo

(require '[clojure.string :as s])
(require '[wcwidth.api :as wcw])

;; POSIX-compliant wcwidth / wcswidth

(def ascii-esc \u001B)

(wcw/wcwidth \A)
; ==> 1
(wcw/wcwidth \©)
; ==> 1
(wcw/wcwidth 0x0000)     ; ASCII NUL (zero width)
; ==> 0
(wcw/wcwidth ascii-esc)  ; ASCII ESC (non printing)
; ==> -1
(wcw/wcwidth 0x1F921)    ; 🤡 (double width)
; ==> 2

(wcw/wcswidth "hello, world")
; ==> 12
(wcw/wcswidth "hello, 🌏")
; ==> 9

;; wcswidth (POSIX) vs display-width (non-POSIX, but more practical)

(wcw/wcswidth (str "hello, " ascii-esc))
; ==> -1
(wcw/display-width (str "hello, " ascii-esc ))  ; ASCII ESC
; ==> 7

;; ANSI escape code support

(def ansi-hide-cursor (str ascii-esc "[25l"))
(wcw/display-width (str "hello, " ansi-hide-cursor))
; ==> 7

;; Examples showing how clojure.core/count doesn't work

(def jerome (wcw/code-points-to-string [\J \e 0x0341 \r \o 0x0302 \m \e]))  ; Jérôme, using combining diacritics
(wcw/display-width jerome)
; ==> 6
(count jerome)
; ==> 8

(def deseret-capital-long-i (wcw/code-point-to-string 0x10400))  ; 𐐀
(wcw/display-width deseret-capital-long-i)
; ==> 1
(count deseret-capital-long-i)
; ==> 2

(def lots-of-escapes (s/join (repeat 1000 ascii-esc)))
(wcw/display-width lots-of-escapes)
; ==> 0
(count lots-of-escapes)
; ==> 1000                  ; lol 🤡

(def trans-flag (wcw/code-points-to-string [0x1F3F3 0xFE0F 0x200D 0x26A7 0xFE0F]))  ; 🏳️‍⚧️
(wcw/display-width trans-flag)
; ==> 2
(count trans-flag)
; ==> 6                     ; lol 🤡

Contributor Information

Contributing Guidelines

Bug Tracker

Code of Conduct

Developer Workflow

This project uses the git-flow branching strategy, with the caveat that the permanent branches are called release and dev. Any changes to the release branch are considered a release and auto-deployed (JARs to Clojars, API docs to GitHub Pages, etc.).

For this reason, all development must occur either in branch dev, or (preferably) in temporary branches off of dev. All PRs from forked repos must also be submitted against dev; the release branch is only updated from dev via PRs created by the core development team. All other changes submitted to release will be rejected.

Build Tasks

wcwidth uses tools.build. You can get a list of available tasks by running:

clojure -A:deps -T:build help/doc

Of particular interest are:

clojure -T:build test - run the unit tests
clojure -T:build lint - run the linters (clj-kondo and eastwood)
clojure -T:build ci - run the full CI suite (check for outdated dependencies, run the unit tests, run the linters)
clojure -T:build install - build the JAR and install it locally (e.g. so you can test it with downstream code)

Please note that the deploy task is restricted to the core development team (and will not function if you run it yourself).

License

Distributed under the Mozilla Public License, version 2.0.

SPDX-License-Identifier: MPL-2.0

Can you improve this documentation?Edit on GitHub

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field

Raise an issue Browse cljdoc source Chat on Slack

× close

Why not count?

Why not `count`?