bots-bench / AI SOC evals / Splunk BOTSv3

Benchmarking AI models and agents for SOC and IR investigations.

Loading benchmark corpus and run snapshots…

Made with love by Team Graphistry, makers of Louie.ai

Benchmark Program

What this benchmark is, and what it refuses to fake.

Frontier Leaderboard

Who is solving the work, and how fast?

One point per model line and reasoning level. We keep the best published config in each bucket, then plot first-pass pass rate against total time spent on solved questions so both quality and spend stay visible.

Benchmark Atlas

The corpus underneath the scoreboards

The point is representative investigation work, not toy prompts. This makes the breadth visible: 100+ overlapping log and alert providers, incident families grounded in the BOTSv3 IR corpus, and question-level hardness.

Track coverage

ATT&CK coverage

Chart Wall

Question hardness

Each dot is a benchmark question. Left means fewer configs solve it. Up means it burns more time or more attempts. Bigger dots mean more repeated attempts, and color stays tied to track.

Benchmark Hygiene

Could the model already know the answers?

This asks whether a model can answer the benchmark from prior exposure or memory before using the investigation tools.

No-tools priors versus scored runs

We keep the same score axis as the main benchmark so you can see how much of the corpus a model reaches before any tool use.

Can we trace answers back to evidence?

Loading literal-answer coverage and traceability statuses...

MCP vs CLI

How much does MCP actually buy right now?

Looking for publishable same-model MCP-versus-curl comparisons...

Prompt Effects

Cross-validation and planning loops

Looking for clean cross-validation and OODA / OSCAR-style planning-loop pairs...

Latest Runs

Pass-rate snapshot cards for published benchmark executions

Looking for local run artifacts…

Retry Lab

What happens when a failed config gets more tries?

These rows show configs where the evaluator allows extra retry passes after a miss. That can improve score, but it also blows out time and token budgets.

Config Microscope

Drill into one configuration, one question at a time

The map and microscope share one selection. Pick a config here, or click a point in the solve-versus-time map, and inspect its question-level pattern instead of only looking at rollups.

Config Map

Solve rate versus time

Human-readable config labels, not raw run IDs. Pick the score dimension and the time dimension you care about, then click a point to sync the microscope below.

Supporting Boards

Secondary cuts of the same snapshot

These stay here as supporting diagnostics, not the headline benchmark verdict.

Experiment Surface

What we are actually varying

This stays on the page for auditability. It shows what we are varying, but it is not the headline result.