Leaderboard

Track-specific rankings for MedFlowBench.

This page is only for Track A/B/C results. The table stays focused on scores, methods, and interpretation.

Track A

Viewer-native model rankings sorted by evidence-constrained Strict score.

Track B

Advanced-operation deltas against matched viewer-native controls.

Track C

Runtime-agnostic pipelines that output the same answer and evidence schema.

Leaderboard

Track-specific results for models, operations, and pipelines.

Track A is sorted by evidence-constrained Strict score by default. Track B reports deltas from viewer-native controls, and Track C compares runtime-agnostic alternatives.

Strict is the product-critical metric: a final answer only receives credit when the required evidence passes deterministic hidden-reference checks.