Agent evaluation, reliability, and operational scorecards

Agent Eval Lab

A public workspace for small, repeatable evaluation workflows for tool-using AI agents. It connects papers, datasets, demos, package utilities, and scorecard templates.

View portfolio Open main site

Public Research Artifacts

Lightweight Agent Evaluation

Scenario design, expected behavior, failure modes, and readiness scorecards.

Zenodo DOI

AI Eval Forge

Mixed-check regression testing for LLM and agent workflows.

Zenodo DOI

RAG Guardrails

Small-rule checks for prompt injection and vector poisoning risks.

Figshare DOI

Scorecard Template

Area Question Signal
Task fit Does the agent know the boundary of the task? Clear goal, inputs, and stop condition
Tool use Are tool calls traceable and justified? Observable calls and recoverable failures
Reliability Can the workflow be repeated? Fixtures, expected behavior, and regression checks
Readiness Is it safe to widen access? Known risks, review notes, and rollout decision