Next Challenge Coming at July

AV safety in edge cases needs a reliable benchmark.

FluidTest turns long-tail driving scenes into comparable planner safety evidence, so model teams can see whether their systems introduce additional threats beyond human driving logs.

Join as a labeler Submit and win View Current Leaderboard

Long-Tailwith Challenging Scenarios

HumanPairwise Safety Preference

AgenticExplainable and Auditable

FluidTest benchmark

Edge-case safety is finally comparable.

Reliable AV evaluation starts where routine metrics are weakest: rare scenes, planner disagreements, and behavior that can introduce new threats beyond the human driving log.

Edge-case safety

AV safety in edge cases is important.

We finally have a reliable way to benchmark and compare these systems when routine displacement metrics are not enough.

Human evidence

Planner outputs are judged against human logs.

Labelers compare each planner result against the human driver's log and judge whether it brings additional threats.

Comparable score

Teams can compete on validated safety.

Submissions are validated, labeled, reviewed by a neuro-symbolic agentic system, and ranked on the leaderboard.

How it works

Labelers compare planner behavior against the human driver's log.

Each panel pairs a projected planner path with the threat class a labeler can review before the final neuro-symbolic audit. The red trajectory is the planner's result, and the cyan trajectory is the expert reference trajectory.

Vehicle yield conflictProjected ego motion enters another road user's priority path instead of holding behind the human log.

Right-of-way projection

Vehicle yield conflict

Projected ego motion enters another road user's priority path instead of holding behind the human log.S2 yield conflict

Stationary vehicle collisionProjected ego occupancy overlaps a parked vehicle while the human log gives it clearance.

PoutineSFT parked-vehicle projection

Stationary vehicle collision

Projected ego occupancy overlaps a parked vehicle while the human log gives it clearance.S3 collision course

Yellow-phase noncomplianceThe planner proceeds into a safely stoppable amber phase while the human log remains stopped.

PoutineSFT signal approach

Yellow-phase noncompliance

The planner proceeds into a safely stoppable amber phase while the human log remains stopped.S2 signal conflict

Off-road drivingPlanner trajectory leaves the normal lane and enters a gore, separator, shoulder, or curbside surface.

Roadway position

Off-road driving

Planner trajectory leaves the normal lane and enters a gore, separator, shoulder, or curbside surface.S1 departure

Meaningless lateral driftThe predicted path deviates laterally from the human log inside a narrow corridor without a visible reason.

PoutineSFT corridor drift

Meaningless lateral drift

The predicted path deviates laterally from the human log inside a narrow corridor without a visible reason.S1 lateral drift

Join as a labeler

Push AV systems forward with careful human review.

Register, learn the review protocol, pass the test, earn a certificate, and contribute labels that shape how autonomous planners are evaluated.

01Register
02Learn and test
03Get certified
04Contribute labels

Start labeler training

For AV teams

Warm up on the FluidTest leaderboard.

FluidTest leaderboard is currently in warmup. To try the baseline, download Val151, run UniPlan inference, and upload the result to preflight; when the projection looks correct, that baseline pipeline is complete. To beat the baseline, train on WOD-E2E, evaluate ADE, infer on Val151, review the projection, then submit an official result for human and agentic review.

Read Baseline Instructions Submit and validate My submissions

Baseline Instructions

Download Val151, run the UniPlan warmup baseline, and upload the predictions for a projection check.

Unlimited Preflight Checks

Use preflight as often as needed to validate JSON format, Val151 coverage, and projected trajectories.

Official Submission Limit

After projection review, each account and team can submit up to 3 official leaderboard results per month.

Private Until Released

Track previous submissions and choose public or private visibility after final review is ready.