Evaluation & Metrics

The evaluation will be based on two aspects: referral performance and justification performance.

Referral performance (P_ref): It will be assessed based on the sensitivity at 95% specificity. Based on the algorithm's output, we will determine the operating point for 95% specificity and assess the corresponding sensitivity. 

Justification performance (P_just): For the referable glaucoma cases, the 10 additional labels will be compared against the additional labels produced by the algorithm. We use a modified Hamming distance: If the manual graders do not agree on one or more of the additional labels, the algorithm's result will not be evaluated on those labels. Normalization of the Hamming distance will be done based on the number of labels that both graders agreed on. 

The final participant ranking will be calculated as follows:

Sfinal = (R_ref + R_just)/2

The final ranking will subsequently be based on Sfinal, where a lower value for Sfinal will result in a higher ranking.