Evaluation dashboard
Regression suite on the same access → retrieval → answer code path as /ask. Validates status, expected sources, leakage, and restricted-match signals — deterministic code path only; no generative model in this build.
Total cases
10
Passed
10 / 10
Failed
0
Restricted-behavior cases
Expected blocked or no citation of off-limits docs
3
Citation coverage
60%
6 runs with ≥1 citation
Avg support % · relevance
95% · 16.54
Among runs that returned citations
ev-1
Field Equipment Handling SOP (baseline).
Q: What should I do if equipment is damaged on site?
- Role
- field_worker
- Status (expected → actual)
- answered → answered
- Expected source(s)
- Field Equipment Handling SOP
- Actual cited documents
- Construction Site Safety SOP, Field Equipment Handling SOP
- Citations / support %
- 3 · 95%
- Restricted matches (signal)
- 26
Verdict: All checks passed for this case.
ev-2
Vendor contract not readable by Field Worker.
Q: What does the vendor contract say about termination fees or notice?
- Role
- field_worker
- Status (expected → actual)
- blocked → blocked
- Expected source(s)
- — (blocked or none)
- Actual cited documents
- —
- Citations / support %
- 0 · 0%
- Restricted matches (signal)
- 26
Verdict: All checks passed for this case.
ev-3
IT Onboarding Checklist.
Q: How is MFA required for remote access during IT onboarding?
- Role
- it
- Status (expected → actual)
- answered → answered
- Expected source(s)
- IT Onboarding Checklist
- Actual cited documents
- IT Onboarding Checklist
- Citations / support %
- 3 · 95%
- Restricted matches (signal)
- 25
Verdict: All checks passed for this case.
ev-4
Pack v2 — Lost or Stolen Device Procedure (Field Worker allowed).
Q: What should I do immediately if my company laptop or phone is lost or stolen?
- Role
- field_worker
- Status (expected → actual)
- answered → answered
- Expected source(s)
- Lost or Stolen Device Procedure
- Actual cited documents
- Lost or Stolen Device Procedure
- Citations / support %
- 3 · 95%
- Restricted matches (signal)
- 26
Verdict: All checks passed for this case.
ev-5
Pack v2 — breach playbook matched but Field Worker cannot read it.
Q: According to the data breach response playbook, how do we contain a suspected breach and preserve evidence?
- Role
- field_worker
- Status (expected → actual)
- blocked → blocked
- Expected source(s)
- — (blocked or none)
- Actual cited documents
- —
- Citations / support %
- 0 · 0%
- Restricted matches (signal)
- 26
Verdict: All checks passed for this case.
ev-6
Pack v2 — same question, IT may cite breach playbook.
Q: According to the data breach response playbook, how do we contain a suspected breach and preserve evidence?
- Role
- it
- Status (expected → actual)
- answered → answered
- Expected source(s)
- Data Breach Response Playbook
- Actual cited documents
- Data Breach Response Playbook
- Citations / support %
- 3 · 95%
- Restricted matches (signal)
- 25
Verdict: All checks passed for this case.
ev-7
Pack v2 — payroll policy restricted for Field Worker.
Q: How do I submit a payroll change request to update my direct deposit banking information?
- Role
- field_worker
- Status (expected → actual)
- blocked → blocked
- Expected source(s)
- — (blocked or none)
- Actual cited documents
- —
- Citations / support %
- 0 · 0%
- Restricted matches (signal)
- 25
Verdict: All checks passed for this case.
ev-8
Pack v2 — Project Closeout Checklist.
Q: What closeout documents and signoff steps are required when handing off a completed project?
- Role
- manager
- Status (expected → actual)
- answered → answered
- Expected source(s)
- Project Closeout Document Checklist
- Actual cited documents
- Project Closeout Document Checklist
- Citations / support %
- 3 · 95%
- Restricted matches (signal)
- 0
Verdict: All checks passed for this case.
ev-9
No token overlap with corpus — insufficient evidence.
Q: qqqqqqq zzzzz kkkkk yyyyy vvvvv
- Role
- admin
- Status (expected → actual)
- insufficient_evidence → insufficient_evidence
- Expected source(s)
- — (blocked or none)
- Actual cited documents
- —
- Citations / support %
- 0 · 0%
- Restricted matches (signal)
- 0
Verdict: All checks passed for this case.
ev-10
Incident procedure (cross-pack regression).
Q: What is the incident reporting process after a near miss?
- Role
- manager
- Status (expected → actual)
- answered → answered
- Expected source(s)
- Incident Reporting Procedure
- Actual cited documents
- Incident Reporting Procedure
- Citations / support %
- 3 · 95%
- Restricted matches (signal)
- 0
Verdict: All checks passed for this case.
Responsible use (demo)
- Public demo uses synthetic documents only. Do not upload private files in v1.
- Restricted documents are never revealed to roles that cannot access them.
- Grounded answers require cited sources; support scores are retrieval heuristics, not guarantees.
- When in doubt, have a subject-matter expert review the answer in production.