Leaderboard
Track-specific rankings for MedFlowBench.
This page is only for Track A/B/C results. The table stays focused on scores, methods, and interpretation.
Track A
Viewer-native model rankings sorted by evidence-constrained Strict score.
Track B
Advanced-operation deltas against matched viewer-native controls.
Track C
Runtime-agnostic pipelines that output the same answer and evidence schema.
Leaderboard
Track-specific results for models, operations, and pipelines.
Track A is sorted by evidence-constrained Strict score by default. Track B reports deltas from viewer-native controls, and Track C compares runtime-agnostic alternatives.
Strict is the product-critical metric: a final answer only receives credit when the required evidence passes deterministic hidden-reference checks.