Liking cljdoc? Tell your friends :D

Evaluations

This page records the first public long-context eval run for fractal-engine v17. The harness lives under evals/ as an external consumer of the engine. It does not add model-facing functions or ship in the uberjar.

Setup

Engine:

  • commit: 78444eedd7261ef344998911f66a1918bfbe6e6e
  • prompt: repl, prompt-version 17
  • runtime change under test: leaf provider calls capped at 50 concurrent calls per run tree

The result manifests record engine/git-dirty? true because the eval harness and result files were still untracked while the benchmarks ran. The engine commit under test is the engine/git-sha above.

Models:

roleprovidermodel
rootvertex-geminigemini-3.5-flash
childvertex-geminigemini-3.5-flash
leafvertex-geminigemini-3.1-flash-lite-preview

Limits:

  • --budget-usd 50
  • --max-turns 1000000
  • --call-timeout-ms 180000
  • engine-only mode, one benchmark at a time

Validation before spend:

  • clojure -M:evals-test: green
  • clojure -M:test: green
  • live provider auth checked before the canary

The aggregate result files are tracked under evals/results/*-v17/results.*. Raw runs/ session journals and logs are intentionally not tracked: they are large, noisy, and reproducible from the recorded commands in each result file.

Results

benchmarknheadlinestrict / exactcostcost / qtokensmean wall
OOLONG-synth smart subset15exact 0.86713 / 15$3.3989$0.22666,721,22095,714 ms
FanOutQA smart subset15loose 0.6958 / 15$4.5360$0.30247,129,76491,578 ms

For FanOutQA, the benchmark headline is loose accuracy. Strict accuracy means all gold strings are present and is deliberately harsher.

Human audit of FanOutQA final answers:

measurevalue
official strict rows8 / 15 = 0.533
official loose accuracy0.695
semantic row correctness11 / 15 = 0.733

The semantic audit is not a replacement benchmark metric; it explains where the string scorer or stale gold under-counted the engine's final answer.

OOLONG

Command:

clojure -M:evals run --benchmark oolong --data evals/data/oolong-smart.jsonl --mode engine \
  --provider vertex-gemini --model gemini-3.5-flash \
  --child-provider vertex-gemini --child-model gemini-3.5-flash \
  --leaf-provider vertex-gemini --leaf-model gemini-3.1-flash-lite-preview \
  --budget-usd 50 --max-turns 1000000 --call-timeout-ms 180000 \
  --runs-dir evals/results/oolong-v17/runs --out evals/results/oolong-v17

Result:

  • exact accuracy: 13 / 15 = 0.867
  • numeric accuracy mean: 0.998
  • spend: $3.3989
  • errors: 0

The two misses were genuine:

idfinalgoldnote
218020027377382262K-token sentiment count; off by 5
211020009same frequencymore commonsmall date comparison; counted equal positive before/after

The important success signal is not just the headline number. The six long OOLONG examples include 262K-token contexts; the engine decomposed them into Clojure parsing, map-lm chunks, and deterministic reductions instead of relying on one flat read. The result was high exact accuracy with near-perfect count accuracy on the numeric rows.

FanOutQA

Command:

clojure -M:evals run --benchmark fanoutqa --data evals/data/fanoutqa-smart.jsonl --mode engine \
  --provider vertex-gemini --model gemini-3.5-flash \
  --child-provider vertex-gemini --child-model gemini-3.5-flash \
  --leaf-provider vertex-gemini --leaf-model gemini-3.1-flash-lite-preview \
  --budget-usd 50 --max-turns 1000000 --call-timeout-ms 180000 \
  --runs-dir evals/results/fanoutqa-v17/runs --out evals/results/fanoutqa-v17

Result:

  • strict accuracy: 8 / 15 = 0.533
  • loose accuracy: 0.695
  • semantic row correctness after audit: 11 / 15 = 0.733
  • spend: $4.5360
  • errors: 0

Rows the official scorer marked wrong but the final answer was semantically right:

idreason
71552a38345f892eall codons were present; the scorer missed labels and comma/and formatting
ff866ee3e2bf4820all Ivy League acre values were present; the scorer missed comma-normalized numbers and extra detail
146e74771fcf6a30founder ages matched the provided evidence; the gold was stale

Actually wrong rows:

idreason
29242cc91b49e88ereturned only Samuel L. Jackson; incomplete for the cast-wide Academy Award question
ae1c3cec94b75e55age answer used a stale/as-of date and several ages were wrong
00065f204bddb94dmissed J. K. Rowling's 1965 birth year
585ead607ef66fb1included the United Kingdom via the wrong EGOT span interpretation

FanOutQA is useful as a fan-out and join stress test, but this run shows why it should not be treated as a clean headline benchmark without auditing. Some golds are time-sensitive, and loose substring scoring can under-credit correct structured answers.

Runtime Notes

The v17 leaf concurrency cap did its main job on this run: there were no terminal provider overloads, rate-limit failures, or fan-out limit failures. Across the two benchmarks:

  • terminal errors: 0
  • terminal leaf-batch-failed: 0
  • recovered intermediate leaf-batch-failed: 2 unique OOLONG batches, both caused by parse failures in a single leaf output and recovered by subsequent model work

That last point is still a real runtime weakness. map-lm is currently all-or-nothing at the batch boundary; one malformed leaf result can throw the whole batch, even though the model can often recover in the next step. The next hardening pass should preserve retryable provider/error metadata and make batch retry/failure semantics less brittle.

What This Establishes

This run supports the engine thesis on long-context aggregation:

  • The model used ordinary Clojure for parsing, partitioning, counting, and reducing.
  • Leaf calls handled bounded probabilistic judgments.
  • The host preserved order, costs, event journals, and reproducibility manifests.
  • The engine completed both long-context suites under modest spend with no terminal runtime failures.

It does not yet establish a same-model flat baseline comparison. The harness can run that with --mode both or --mode all; this public v17 report is engine-only.

Can you improve this documentation?Edit on GitHub

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close