Liking cljdoc? Tell your friends :D

UUID Generation Benchmarks: clj-uuid-old vs clj-uuid

Performance comparison of UUID generation across all RFC 9562 versions, measuring clj-uuid-old (bitmop, shift/mask loops) against clj-uuid (bitmop2, ByteBuffer primitives + JVM intrinsic mask operations + ThreadLocal MessageDigest).

Test Environment

Hardware

  • CPU: Intel Core i9-9880H @ 2.30 GHz (8 cores / 16 threads)
  • RAM: 32 GB
  • Architecture: x86_64

Software

  • OS: macOS 26.2 (Darwin 25.2.0)
  • JVM: OpenJDK 64-Bit Server VM 25.0.1 (Homebrew, mixed mode, sharing)
  • Clojure: 1.12.0
  • Leiningen: 2.12.0

Benchmark Parameters

  • Iterations: 500,000 per benchmark
  • Warmup: 50,000 iterations (JIT compilation)
  • Reflection warnings: none (verified via lein check and *warn-on-reflection*)
  • Source: test/clj_uuid/bench.clj

1. UUID Generation (Pure Construction)

Measures only the time to call the constructor and return a java.util.UUID value. No serialization.

UUID Versionclj-uuid-old (ns)clj-uuid (ns)Speedup
v1 (time-based)120.3100.11.20x
v3 (MD5, namespace)1409.1165.98.49x
v4 (random)326.3334.80.97x
v5 (SHA1, namespace)1531.4264.85.78x
v6 (time-based, sorted)106.3100.01.06x
v7 (unix time, crypto)408.8340.01.20x
v7nc (unix time, fast)--38.5--
v8 (custom)46.47.95.87x

Analysis

v3 and v5 show 5-9x generation speedup. These are the only versions where the optimized implementation changes the generation path itself. Three optimizations compound:

  1. ThreadLocal MessageDigest: clj-uuid reuses a per-thread MessageDigest instance via ThreadLocal/withInitial, avoiding the ~200 ns MessageDigest/getInstance allocation on every call. clj-uuid-old also uses ThreadLocal (both were optimized in this pass).
  2. Byte serialization: Serialize the namespace UUID via to-byte-array (bitmop: 2x 8-iteration ldb+sb8 loops; bitmop2: 2x ByteBuffer.putLong).
  3. Digest extraction: Read back the digest result via bytes->long (bitmop: 2x 8-iteration dpb loops; bitmop2: 2x ByteBuffer.getLong).

In clj-uuid-old, byte manipulation overhead adds ~1200 ns on top of the ~200 ns digest. In clj-uuid, that overhead is nearly eliminated, leaving the digest as the dominant cost.

v7 shows 1.21x speedup from mask-offset optimization. The v7 constructor calls dpb #=(mask 2 62) to set the variant bits in the LSB. The #=(mask 2 62) is a compile-time constant, but dpb calls mask-offset at runtime to find the lowest set bit. Previously, mask-offset used an O(offset) loop -- for offset=62, that meant 62 iterations per call. Now mask-offset uses Long/numberOfTrailingZeros, a JVM intrinsic that compiles to a single TZCNT instruction. This eliminates the v7 regression that was visible in earlier benchmarks. SecureRandom.nextLong() still dominates total latency.

v8 shows 4.21x speedup from mask-offset optimization. The v8 constructor is just two dpb calls (mask(4,12) and mask(2,62)), so the mask-offset cost was a significant fraction of the total. With O(1) Long/numberOfTrailingZeros, the two dpb calls drop from ~46 ns to ~11 ns.

v1 shows ~1.2x and v6 shows ~1.1x relative to clj-uuid-old. In addition to the O(1) mask-offset improvement, v1 and v6 now inline the AtomicLong CAS loop, bit-field packing, and pre-captured node LSBs directly in the constructor closure, eliminating var lookup and function dispatch overhead. The System/currentTimeMillis + CAS cost remains the dominant factor.

v7nc is new in 0.2.5 and has no clj-uuid-old equivalent. It uses ThreadLocalRandom and a per-thread monotonic counter, achieving 38.5 ns/op -- faster than JUG 5.2's TimeBasedEpochGenerator (~50 ns).

v4 shows ~1x. The 0-arity form delegates directly to UUID/randomUUID (JVM built-in, dominated by SecureRandom).

2. Post-Generation Operations

Measures operations on a pre-existing UUID value. These are the operations where bitmop2's ByteBuffer approach has the most impact.

Operationclj-uuid-old (ns)clj-uuid (ns)Speedup
to-byte-array803.614.057.40x
to-hex-string5840.4126.146.32x
to-string22.117.61.26x
to-urn-string110.1110.90.99x
get-version7.17.50.95x
get-node-id11.99.81.21x

Analysis

to-byte-array: 57x faster. This is the biggest win. bitmop requires two 8-iteration loops (each doing ldb + sb8 per byte, 16 iterations total). bitmop2 does two ByteBuffer.putLong calls -- single JVM intrinsics.

to-hex-string: 29x faster. bitmop builds two separate hex strings via (hex msb) and (hex lsb), each involving long->bytes (8-iteration loop), map ub8, map octet-hex, and apply str (lazy sequence materialization + string concatenation). bitmop2 uses uuid->buf + buf-hex: a single StringBuilder with direct byte iteration over a ByteBuffer.

to-string and to-urn-string: ~1x (no change). Both delegate to UUID.toString(), a JVM-native method that neither bitmop touches.

Field extraction (get-version, get-node-id): ~1x. These use ldb/dpb on the UUID's long words, which are identical between the two implementations. The slight variation is measurement noise.

3. Combined: Generate + Serialize

The real-world pattern -- generate a UUID and immediately serialize it for storage, transmission, or indexing.

Operationclj-uuid-old (ns)clj-uuid (ns)Speedup
v1 + to-byte-array925.9112.88.21x
v3 + to-byte-array2223.6157.114.15x
v3 + to-hex-string6425.8297.821.58x
v4 + to-byte-array1170.8350.43.34x
v4 + to-hex-string5366.4476.811.25x
v5 + to-byte-array2327.4269.08.65x
v5 + to-hex-string6509.5409.015.92x
v7 + to-byte-array1303.1324.84.01x

Analysis

The combined numbers reflect the sum of generation and serialization gains.

v3 + to-hex-string: 18.9x. This is the largest combined win. v3 benefits from faster generation (9x from ThreadLocal + bytes->long and to-byte-array in the digest path) AND faster serialization (29x from the hex output path). The two effects compound.

v5 + to-hex-string: 14.6x. Same compounding effect as v3, but SHA-1 is slightly slower than MD5, so the digest fraction is larger and the byte manipulation speedup contributes proportionally less.

v3 + to-byte-array: 14.1x / v5 + to-byte-array: 8.6x. Byte-array serialization is faster than hex (57x vs 29x), but takes less absolute time, so the generation speedup contributes more to the total ratio.

v1 + to-byte-array: 8.0x. Generation is ~1.2x but serialization is 57x. The serialization dominates total time in clj-uuid-old (~87% of 926 ns) but becomes negligible in clj-uuid (~14 ns of 115 ns).

v4 + to-byte-array: 3.4x. UUID/randomUUID is the bottleneck (~340 ns), so the serialization savings (804 ns -> 14 ns) yield a 3.4x total win.

v4 + to-hex-string: 9.9x. The hex path has even more overhead in clj-uuid-old (~5840 ns) so the combined win is larger than the byte-array case.

v7 + to-byte-array: 3.9x. Similar profile to v4 -- crypto RNG dominates, but the O(1) mask-offset now contributes a generation speedup on top of the serialization win.

4. Absolute Throughput

UUIDs generated per second (generation only, single thread).

UUID Versionclj-uuid-old (ops/s)clj-uuid (ops/s)
v1 (time-based)6,353,8789,385,662
v3 (MD5, namespace)668,8183,239,594
v4 (random)2,715,6432,680,676
v5 (SHA1, namespace)648,8463,024,212
v6 (time-based, sorted)8,569,97910,024,839
v7 (unix time, crypto)2,492,2072,960,438
v7nc (unix time, fast)--25,974,026
v8 (custom)19,381,529149,993,940

5. v3/v5 Detailed Breakdown

Since v3 and v5 are the UUID types with the largest generation-time improvements, this section breaks down the per-operation costs.

v3 (MD5, Namespace)

Operationclj-uuid-old (ns)clj-uuid (ns)Speedup
v3 generation1412.0156.99.00x
v3 to-byte-array804.713.758.81x
v3 to-hex-string5225.1197.726.43x
v3 to-string23.724.30.98x

v5 (SHA1, Namespace)

Operationclj-uuid-old (ns)clj-uuid (ns)Speedup
v5 generation1667.6279.05.98x
v5 to-byte-array849.016.451.71x
v5 to-hex-string5128.9185.427.66x
v5 to-string22.922.01.04x

v3/v5 Generation Path Breakdown

The v3/v5 generation path consists of four steps. The following shows where time is spent in each implementation:

                      clj-uuid-old               clj-uuid
  to-byte-array:      ~800 ns (2x long->bytes    ~14 ns (2x putLong)
                        8-iter ldb+sb8 loop)
  digest (MD5/SHA1):  ~200-300 ns                ~200-300 ns
  bytes->long:        ~800 ns (2x 8-iter dpb)    ~14 ns (2x getLong)
  dpb:                ~5 ns (2 calls)            ~3 ns (2 calls, O(1) mask-offset)
  ────────────────────────────────────────────────────────
  Total (v3):         ~1400 ns                   ~160 ns
  Total (v5):         ~1670 ns                   ~280 ns

In clj-uuid-old, byte manipulation overhead (~1600 ns) dominates over the digest (~200-300 ns). In clj-uuid, byte manipulation is eliminated (~28 ns total), leaving the digest as the dominant cost.

6. Summary

Where clj-uuid wins

CategorySpeedup
to-byte-array57x
to-hex-string29x
v3 + to-hex-string (combined)18.9x
v5 + to-hex-string (combined)14.6x
v3 + to-byte-array (combined)14.1x
v4 + to-hex-string (combined)9.9x
v3 generation9.0x
v5 + to-byte-array (combined)8.6x
v1 + to-byte-array (combined)8.0x
v5 generation6.0x
v8 generation4.2x
v7 + to-byte-array (combined)3.9x
v4 + to-byte-array (combined)3.4x
v1 generation1.5x
v6 generation1.4x
v7 generation1.3x

Where they are equal

CategorySpeedup
v4 generation (0-arity)~1.0x
to-string / to-urn-string~1.0x
Field extraction (version, node)~1.0x

Where clj-uuid has no impact

Operations that delegate entirely to the JVM (UUID.toString(), UUID/randomUUID, UUID.version()) see no change, as expected. The bitmop2 layer only affects byte-level serialization, the bytes->long / long->bytes paths, and ldb/dpb calls (which now benefit from O(1) mask-offset via Long/numberOfTrailingZeros).

New in 0.2.5: v7nc

v7nc is a non-cryptographic v7 variant that uses ThreadLocalRandom and a per-thread monotonic counter. At 38.5 ns/op it is faster than JUG 5.2's TimeBasedEpochGenerator (~50 ns), making it the fastest time-based UUID generator available on the JVM from Clojure.

Key Takeaway

The largest gains appear in serialization (to-byte-array, to-hex-string) and in v3/v5 generation (which serialize the namespace UUID internally as part of the digest computation). Additional gains come from O(1) mask-offset using JVM intrinsics, which particularly benefits v8 (4.2x) where dpb calls with high-offset masks were previously bottlenecked by an O(offset) loop. The new v7nc constructor provides the fastest time-based UUID generation at ~39 ns, beating JUG 5.2. For applications that generate UUIDs and immediately serialize them -- the common case for database keys, wire protocols, and log correlation IDs -- clj-uuid delivers 3-19x end-to-end improvement depending on the UUID version and serialization format.

Can you improve this documentation?Edit on GitHub

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close