Methods Inf Med 2012; 51(06): 489-494
DOI: 10.3414/ME12-01-0005
Original Articles
Schattauer GmbH

Measuring Inter-observer Agreement in Contour Delineation of Medical Imaging in a Dummy Run Using Fleiss’ Kappa[*]

G. Rücker
1   Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, Freiburg, Germany
,
T. Schimek-Jasch
2   Department of Radiology, University Medical Center Freiburg, Freiburg, Germany
,
U. Nestle
2   Department of Radiology, University Medical Center Freiburg, Freiburg, Germany
› Author Affiliations
Further Information

Publication History

received:11 January 2012

accepted:03 July 2012

Publication Date:
20 January 2018 (online)

Preview

Summary

Background: In medical imaging used for planning of radiation therapy, observers delineate contours of a treatment volume in a series of images of uniform slice thickness.

Objective: To summarize agreement in contouring between an arbitrary number of observers by a single number, we generalized the kappa index proposed by Zijdenbos et al. (1994).

Methods: Observers characterized voxels by allocating them to one of two categories, inside or outside the contoured region. Fleiss’ kappa was used to measure association between n indistinguishable observers. Given the number Vi of voxels contoured by exactly i observers (i = 1, …, n), the resulting overall kappa is representable as a ratio of weighted sums of the Vi .

Results: Overall kappa was applied to analyze inter-center variations in a multicenter trial on radiotherapy planning in patients with locally advanced lung cancer. A contouring dummy run was performed within the quality assurance program. Contouring was done twice, once before and once after a training program. Observer agreement was enhanced from 0.59 (with a 95% confidence interval (CI) of 0.51 – 0.67) to 0.69 (95% CI 0.59 – 0.78)

Conclusion: By contrast to average pairwise indices, overall kappa measures observer agreement for more than two observers using the full information about overlapping volumes, while not distinguishing between observers. It is particularly adequate for measuring observer agreement when identification of observers is not possible or desirable and when there is no gold standard.

* Supplementary material published on our website www.methods-online.com