Liking cljdoc? Tell your friends :D

Evaluations

This page records the public long-context eval evidence for fractal-engine. The harness lives under evals/ as an external consumer of the engine. It does not add model-facing functions and does not ship in the uberjar.

Two result sets matter:

  • Tracked v17 baseline. The first public run whose aggregate files are committed under evals/results/*-v17/results.*.
  • Release-branch v22 validation. A fresh live validation run for the canonical session/storage/prompt branch. Its raw stores were local validation artifacts, not committed benchmark output.

Model Split

Both runs used Vertex Gemini:

roleprovidermodel
rootvertex-geminigemini-3.5-flash
childvertex-geminigemini-3.5-flash
leafvertex-geminigemini-3.1-flash-lite-preview

The root and child roles use the same model because both are full recursive sessions. Leaves are cheaper non-recursive provider calls over bounded inputs.

Results

runbenchmarknheadlinestrict / exactcosttokensmean wall
tracked v17OOLONG-synth smart subset15exact 0.86713 / 15$3.39896,721,22095,714 ms
tracked v17FanOutQA smart subset15loose 0.695strict 8 / 15$4.53607,129,76491,578 ms
release v22 validationOOLONG-synth smart subset15exact 0.93314 / 15$3.30866,243,085116,802 ms
release v22 validationFanOutQA smart subset15loose 0.704strict 7 / 15$2.88734,805,63354,115 ms

For OOLONG, the headline is exact accuracy. For FanOutQA, the headline is loose accuracy: fraction of gold reference strings present in the answer after SQuAD-style normalization. FanOutQA strict accuracy means every gold string is present exactly after normalization; it is useful as a diagnostic, but it under-credits some semantically correct structured answers.

Tracked v17 Baseline

Engine:

  • commit: 78444eedd7261ef344998911f66a1918bfbe6e6e
  • prompt: repl, prompt-version 17
  • runtime change under test: leaf provider calls capped at 50 concurrent calls per run tree

The result manifests record engine/git-dirty? true because the eval harness and result files were still untracked while the benchmarks ran. The engine commit under test is the engine/git-sha above.

Commands:

clojure -M:evals run --benchmark oolong --data evals/data/oolong-smart.jsonl --mode engine \
  --provider vertex-gemini --model gemini-3.5-flash \
  --child-provider vertex-gemini --child-model gemini-3.5-flash \
  --leaf-provider vertex-gemini --leaf-model gemini-3.1-flash-lite-preview \
  --budget-usd 50 --max-turns 1000000 --call-timeout-ms 180000 \
  --runs-dir evals/results/oolong-v17/runs --out evals/results/oolong-v17

clojure -M:evals run --benchmark fanoutqa --data evals/data/fanoutqa-smart.jsonl --mode engine \
  --provider vertex-gemini --model gemini-3.5-flash \
  --child-provider vertex-gemini --child-model gemini-3.5-flash \
  --leaf-provider vertex-gemini --leaf-model gemini-3.1-flash-lite-preview \
  --budget-usd 50 --max-turns 1000000 --call-timeout-ms 180000 \
  --runs-dir evals/results/fanoutqa-v17/runs --out evals/results/fanoutqa-v17

OOLONG v17:

  • exact accuracy: 13 / 15 = 0.867
  • numeric accuracy mean: 0.998
  • spend: $3.3989
  • terminal errors: 0

The two exact-unmatched rows were real answer errors:

idfinalgoldnote
218020027377382262K-token sentiment count; off by 5
211020009same frequencymore commonsmall date comparison; counted equal positive before/after

FanOutQA v17:

  • strict accuracy: 8 / 15 = 0.533
  • loose accuracy: 0.695
  • semantic row correctness after audit: 11 / 15 = 0.733
  • spend: $4.5360
  • terminal errors: 0

Rows strict scoring under-credited but the final answer was semantically right:

idreason
71552a38345f892eall codons were present; strict string matching under-credited labels and comma/and formatting
ff866ee3e2bf4820all Ivy League acre values were present; strict string matching under-credited comma-normalized numbers and extra detail
146e74771fcf6a30founder ages matched the provided evidence; the gold was stale

Rows that were semantically wrong:

idreason
29242cc91b49e88ereturned only Samuel L. Jackson; incomplete for the cast-wide Academy Award question
ae1c3cec94b75e55age answer used a stale/as-of date and several ages were wrong
00065f204bddb94domitted J. K. Rowling's 1965 birth year
585ead607ef66fb1included the United Kingdom via the wrong EGOT span interpretation

Release-Branch v22 Validation

Engine:

  • prompt: repl, prompt-version 22
  • storage: canonical SQLite facts + BlobStore payloads; derived Datahike query index
  • eval runner: cost derived from canonical session call facts, not from old file-based usage summaries
  • fanout: partial-failure tolerant map-lm / map-rlm
  • parallelism: 5
  • call timeout: 600000 ms
  • max turns: 1000000

Commands:

clojure -M:evals run --benchmark oolong --data evals/data/oolong-smart.jsonl --mode engine \
  --provider vertex-gemini --model gemini-3.5-flash \
  --child-provider vertex-gemini --child-model gemini-3.5-flash \
  --leaf-provider vertex-gemini --leaf-model gemini-3.1-flash-lite-preview \
  --parallelism 5 --budget-usd 50 --max-turns 1000000 --call-timeout-ms 600000 \
  --runs-dir evals/results/oolong-v22/runs --out evals/results/oolong-v22

clojure -M:evals run --benchmark fanoutqa --data evals/data/fanoutqa-smart.jsonl --mode engine \
  --provider vertex-gemini --model gemini-3.5-flash \
  --child-provider vertex-gemini --child-model gemini-3.5-flash \
  --leaf-provider vertex-gemini --leaf-model gemini-3.1-flash-lite-preview \
  --parallelism 5 --budget-usd 50 --max-turns 1000000 --call-timeout-ms 600000 \
  --runs-dir evals/results/fanoutqa-v22/runs --out evals/results/fanoutqa-v22

The validation run used temporary local stores and did not commit raw session data. Store consistency checks passed with :status :ok and :issue-count 0 for both benchmarks.

OOLONG v22:

  • exact accuracy: 14 / 15 = 0.933
  • numeric accuracy mean: 0.998
  • spend: $3.3086
  • tokens: 6,243,085
  • mean wall: 116,802 ms

FanOutQA v22:

  • strict accuracy: 7 / 15 = 0.467
  • loose accuracy: 0.704
  • spend: $2.8873
  • tokens: 4,805,633
  • mean wall: 54,115 ms

FanOutQA strict-unmatched rows need semantic audit before being called failures. The observed strict-unmatched set mixed:

  • real semantic issues or incomplete answers,
  • scorer under-credit for grouping/formatting/aliases,
  • time-sensitive or stale-gold disagreements,
  • multiplicity issues where the answer was correct as a set but not in the expected repeated-string form.

Runtime Notes

The v17 leaf concurrency cap did its main job: no terminal provider overloads, rate-limit failures, or fan-out limit failures. Since then, map-lm and map-rlm have also become partial-failure tolerant: failed slots return {:fractal/failed true :index i :error ...} sentinels in an input-aligned vector while successful slots remain usable.

The v22 run validates the newer canonical store path: per-example sessions are stored as SQLite facts and BlobStore payloads, Datahike is a rebuildable query index, and the eval runner computes cost/tokens from canonical call facts.

What This Establishes

These runs support the engine thesis on long-context aggregation:

  • The model used ordinary Clojure for parsing, partitioning, counting, and reducing.
  • Leaf calls handled bounded probabilistic judgments.
  • Recursive sessions handled decomposed sub-problems.
  • The host preserved order, costs, canonical state, and reproducibility manifests.
  • The engine completed both long-context suites under modest spend with no terminal runtime failures.

It does not yet establish a same-model flat baseline comparison. The harness can run that with --mode both or --mode all; the public reports above are engine-only.

Can you improve this documentation? These fine people already did:
DeadMeme5441 & DeadMeme
Edit on GitHub

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close