AV safety in edge cases is important.
We finally have a reliable way to benchmark and compare these systems when routine displacement metrics are not enough.
Next Challenge Coming at July
FluidTest turns long-tail driving scenes into comparable planner safety evidence, so model teams can see whether their systems introduce additional threats beyond human driving logs.
FluidTest benchmark
Reliable AV evaluation starts where routine metrics are weakest: rare scenes, planner disagreements, and behavior that can introduce new threats beyond the human driving log.
We finally have a reliable way to benchmark and compare these systems when routine displacement metrics are not enough.
Labelers compare each planner result against the human driver's log and judge whether it brings additional threats.
Submissions are validated, labeled, reviewed by a neuro-symbolic agentic system, and ranked on the leaderboard.
How it works
Right-of-way projection
PoutineSFT parked-vehicle projection
PoutineSFT signal approach
Roadway position
PoutineSFT corridor drift
Join as a labeler
Register, learn the review protocol, pass the test, earn a certificate, and contribute labels that shape how autonomous planners are evaluated.
For AV teams
FluidTest leaderboard is currently in warmup. To try the baseline, download Val151, run UniPlan inference, and upload the result to preflight; when the projection looks correct, that baseline pipeline is complete. To beat the baseline, train on WOD-E2E, evaluate ADE, infer on Val151, review the projection, then submit an official result for human and agentic review.
Download Val151, run the UniPlan warmup baseline, and upload the predictions for a projection check.
->Use preflight as often as needed to validate JSON format, Val151 coverage, and projected trajectories.
->After projection review, each account and team can submit up to 3 official leaderboard results per month.
->Track previous submissions and choose public or private visibility after final review is ready.
->