Evaluation dashboard

Regression suite on the same access → retrieval → answer code path as /ask. Validates status, expected sources, leakage, and restricted-match signals — deterministic code path only; no generative model in this build.

Total cases

10

Passed

10 / 10

Failed

0

Restricted-behavior cases

Expected blocked or no citation of off-limits docs

3

Citation coverage

60%

6 runs with ≥1 citation

Avg support % · relevance

95% · 16.54

Among runs that returned citations

How to read this: each case asserts behavior (status, which documents may appear, restricted signals) — not natural-language quality. That is how you regression-test the trust layer before adding a generative model to the stack.

ev-1

Field Equipment Handling SOP (baseline).

PASS

Q: What should I do if equipment is damaged on site?

Role
field_worker
Status (expected → actual)
answeredanswered
Expected source(s)
Field Equipment Handling SOP
Actual cited documents
Construction Site Safety SOP, Field Equipment Handling SOP
Citations / support %
3 · 95%
Restricted matches (signal)
26

Verdict: All checks passed for this case.

ev-2

Vendor contract not readable by Field Worker.

PASS

Q: What does the vendor contract say about termination fees or notice?

Role
field_worker
Status (expected → actual)
blockedblocked
Expected source(s)
— (blocked or none)
Actual cited documents
Citations / support %
0 · 0%
Restricted matches (signal)
26

Verdict: All checks passed for this case.

ev-3

IT Onboarding Checklist.

PASS

Q: How is MFA required for remote access during IT onboarding?

Role
it
Status (expected → actual)
answeredanswered
Expected source(s)
IT Onboarding Checklist
Actual cited documents
IT Onboarding Checklist
Citations / support %
3 · 95%
Restricted matches (signal)
25

Verdict: All checks passed for this case.

ev-4

Pack v2 — Lost or Stolen Device Procedure (Field Worker allowed).

PASS

Q: What should I do immediately if my company laptop or phone is lost or stolen?

Role
field_worker
Status (expected → actual)
answeredanswered
Expected source(s)
Lost or Stolen Device Procedure
Actual cited documents
Lost or Stolen Device Procedure
Citations / support %
3 · 95%
Restricted matches (signal)
26

Verdict: All checks passed for this case.

ev-5

Pack v2 — breach playbook matched but Field Worker cannot read it.

PASS

Q: According to the data breach response playbook, how do we contain a suspected breach and preserve evidence?

Role
field_worker
Status (expected → actual)
blockedblocked
Expected source(s)
— (blocked or none)
Actual cited documents
Citations / support %
0 · 0%
Restricted matches (signal)
26

Verdict: All checks passed for this case.

ev-6

Pack v2 — same question, IT may cite breach playbook.

PASS

Q: According to the data breach response playbook, how do we contain a suspected breach and preserve evidence?

Role
it
Status (expected → actual)
answeredanswered
Expected source(s)
Data Breach Response Playbook
Actual cited documents
Data Breach Response Playbook
Citations / support %
3 · 95%
Restricted matches (signal)
25

Verdict: All checks passed for this case.

ev-7

Pack v2 — payroll policy restricted for Field Worker.

PASS

Q: How do I submit a payroll change request to update my direct deposit banking information?

Role
field_worker
Status (expected → actual)
blockedblocked
Expected source(s)
— (blocked or none)
Actual cited documents
Citations / support %
0 · 0%
Restricted matches (signal)
25

Verdict: All checks passed for this case.

ev-8

Pack v2 — Project Closeout Checklist.

PASS

Q: What closeout documents and signoff steps are required when handing off a completed project?

Role
manager
Status (expected → actual)
answeredanswered
Expected source(s)
Project Closeout Document Checklist
Actual cited documents
Project Closeout Document Checklist
Citations / support %
3 · 95%
Restricted matches (signal)
0

Verdict: All checks passed for this case.

ev-9

No token overlap with corpus — insufficient evidence.

PASS

Q: qqqqqqq zzzzz kkkkk yyyyy vvvvv

Role
admin
Status (expected → actual)
insufficient_evidenceinsufficient_evidence
Expected source(s)
— (blocked or none)
Actual cited documents
Citations / support %
0 · 0%
Restricted matches (signal)
0

Verdict: All checks passed for this case.

ev-10

Incident procedure (cross-pack regression).

PASS

Q: What is the incident reporting process after a near miss?

Role
manager
Status (expected → actual)
answeredanswered
Expected source(s)
Incident Reporting Procedure
Actual cited documents
Incident Reporting Procedure
Citations / support %
3 · 95%
Restricted matches (signal)
0

Verdict: All checks passed for this case.

Responsible use (demo)

  • Public demo uses synthetic documents only. Do not upload private files in v1.
  • Restricted documents are never revealed to roles that cannot access them.
  • Grounded answers require cited sources; support scores are retrieval heuristics, not guarantees.
  • When in doubt, have a subject-matter expert review the answer in production.