A Consideration of the Measurement and Reporting of Interrater Reliability

医学等级间信度可靠性（半导体）可靠性工程统计评定量表功率（物理）物理数学量子力学工程类

作者

Frank C. Day,David L. Schriger

出处

期刊：Annals of Emergency Medicine [Elsevier BV]
日期：2009-11-27 卷期号：54 (6): 843-853 被引量：8

链接

annemergmed.com nih.govdoi.org

标识

DOI：10.1016/j.annemergmed.2009.07.013

摘要

Discussion Points1Cruz et al1Cruz C.O. Meshberg E.G. Shofer F.S. et al.Interrater reliability and accuracy of clinicians and trained research assistants performing prospective data collection in emergency department patients with potential acute coronary syndrome.Ann Emerg Med. 2009; 54: 1-7Abstract Full Text Full Text PDF PubMed Scopus (13) Google Scholar contains 2 parts, a comparison of the values gathered by trained research assistants and physicians about historical information in chest pain patients and the comparison of these participants' recordings with a “correct” value for each item.A. For each part, indicate whether the authors are studying reliability or validity and explain the difference between these concepts.B. What did the authors use as their criterion standard for the validity analysis?C. What are potential problems with their method of defining the criterion (gold) standard? Can you think of alternative approaches?D. The authors report crude agreement and interquartile range for their validity analysis. What part of a distribution is described by the interquartile range? List other statistics used to describe the validity of a measure and why they might be preferable to reporting crude agreement.2Tabled 1MD Recorded “Yes”MD Recorded “No”TotalRA recorded yes1176123RA recorded no18220Total1358143MD, Medical doctor; RA, research assistant. Open table in a new tab A. Calculate the crude percentage agreement for this table. What is the range of possible values for percentage agreement?B. Calculate Cohen's κ for this table. What is the formula for κ for raters making a binary assessment (eg, yes/no or true/false)? Discuss the purpose of Cohen's κ, its range, and the interpretations of key values such as –1, 0, and 1.C. What other measures can be used to measure reliability for binary, categorical, and continuous data? 3Cruz et al quote the oft-cited Landis and Koch2Landis J.R. Koch G.C. The measurement of observer agreement for categorical data.Biometrics. 1977; 33: 159-174Crossref PubMed Scopus (49675) Google Scholar article stating that a κ of “less than 0.2 represents poor agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, good agreement; and 0.81 to 1.00, excellent agreement.” Consider studies of the agreement of airline pilots deciding whether it is safe to land and psychologists deciding whether interviewees have type A or type B personalities. If the studies produced the same numeric κ value, would the adjectives assigned by Landis and Koch be equally appropriate? 4A. Imagine 2 blindfolded, intelligent individuals who are sitting in distant corners of a room and listening to 100 easy true/false statements such as “red is a color,” “2+2=5,” etc, over a loudspeaker. Each indicates his or her choice by pressing a button in the left hand for “false” and in the right for “true.” Questions are not repeated and the respondents are expected to offer a response for each statement. Verify that if they agree on all 100 answers, percentage agreement is 100 and κ is 1.0, regardless of how many statements are true and how many are false. Now imagine that the testing site is under the final approach for a major airport, and, at times, noise from jets flying overhead drowns out the statements from the loudspeaker. When this occurs, respondents agree, on average, only half the time (as one would expect).Recalculate percentage agreement and κ for the same 100-statement test conducted under the following sets of conditions: (1) half the statements are true and 1% of the statements are rendered incomprehensible by the planes; (2) 90% of the statements are true and 1% of the statements are rendered incomprehensible by the planes; (3) half the statements are true and 20% of the statements are rendered incomprehensible by the planes; and (4) 90% of the statements are true and 20% of the statements are rendered incomprehensible by the planes. Discuss the meaning of percentage agreement and κ in these 4 settings.B. Consider the 2 tables below and calculate percentage agreement and κ for each. Why is κ lower on the right? What does this mean?Imagine that the right-hand table was from the true/false experiment described above and that planes were flying so frequently that every question was somewhat difficult to hear. Imagine 2 scenarios: in the first, both raters are told that there are 80 true statements and 20 false statements. In the second, raters are told that there could be 100 true statements with no false statements, 100 false statements with no true statements, or any combination in between, with each having an equal probability of occurring. Does κ mean the same thing in these 2 situations? C. To further consider the meaning of κ, imagine that planes flew overhead such that 60 statements were heard perfectly and 40 were barely comprehensible or not heard at all. Below are separate tables for the 60 audible and 40 incomprehensible statements.Calculate percentage agreement and κ for these tables. Which is the better measure for each? Consider the confidence level of the raters in the different scenarios presented in this exercise. Should rater confidence be considered when interrater reliability is described? How might this be done? 5Finally, the following graph shows percentage agreement versus κ for the first 50 items in Table 1 of Cruz et al. The points are shaded to indicate how many subjects fall into the smallest cell in the 2×2 table.Tables 1 and 2 Open table in a new tab A. Four lines in the table are denoted with square markers (near the arrow) on the graph (Is pain burning? Does it radiate to the back? Does it radiate to the jaw? Does it radiate to the left arm?). Create (approximate) 2×2 tables for these 4 points. Can you explain why these tables have similar percentage agreement but varying κs? Which do you believe is the better measure? Why do the κs differ?B. Can you comment on the relationship between the size of the smallest cell in the 2×2 table and the extent to which κ may deviate from percentage agreement?C. Given the problems with both percentage agreement and κ illustrated in these examples, do you think it would be better if investigators presented the 4 numbers in the inner cells of each 2×2 table instead of reporting the percentage agreement or κ?Answer 1Q1. Cruz et al contains 2 parts, a comparison of the values gathered by trained research assistants and physicians regarding historical information in chest pain patients, and the comparison of these participants' recordings with a “correct” value for each item.Q1.a For each part, indicate whether the authors are studying reliability or validity and explain the difference between these concepts.The first part is an assessment of reliability, and the second is an assessment of validity. The distinction between reliability and validity is an important one. At the racetrack handicappers may unanimously agree (100% interrater reliability) that Galloping George will win the third race. When he comes in dead last, however, track aficionados receive a painful reminder that even a perfectly reliable analysis does not guarantee a valid result. The reliability of a test speaks only to the agreement obtained when multiple fallible observers independently conduct the test on the same persons, specimens, or images. In contrast, an assessment of validity compares a fallible observer against a criterion, or “gold,” standard. Because the criterion standard is assumed to be correct, validity studies typically report the comparative performance of the fallible observer using statistics such as sensitivity and specificity, or likelihood ratios, not reliability metrics such as percentage agreement or κ.Q1.b What did the authors use as their criterion standard for the validity analysis?When the physician and the research assistant agree, it is assumed that their answer is correct. When they disagree, a different research assistant has the patient select which of the 2 discrepant answers is “correct.”Q1.c What are potential problems with their method of defining the gold (criterion) standard? Can you think of any alternative approaches?Defining a gold (criterion) standard for this study is not trivial. For any item, there can be 2 truths: what the patient answers and what is actually true. For example, a patient asked “Do you have pain in the epigastric region” might say yes, thinking that “epigastric” is a fancy word for butt cheeks when in fact the true answer is “no.” Or a patient might say that his cholesterol is normal despite its being 300 because he does not understand the laboratory results his physician shared with him.What then is the criterion standard for this study: what the patient said, what the patient should have said, or what the patient would say if the information were optimally elicited? The answer, of course, is that we have no way to know what the “true” answer is. Most emergency medicine residents have had the experience of reporting some part of a patient's history to their attending physician and soon thereafter hearing the patient give the attending physician a completely different history! This “answer drift” could be because one or the other of the physicians asked the question in a manner that was clearer to the patient, or because extra time or reflection resulted in the patient expressing a different answer. We do not know if the answers given on the second or third interview are more truthful, or whether they are provided to appease the interviewers and end the questioning. Some patients may be too confused, in too much pain, or too distracted to give an accurate reply.A better approach, which the authors acknowledge, would have been to randomize the order in which the research assistant and physician interviewed each patient. That should equally distribute and minimize the effect of any bias related to a change in answer accuracy with repeated questioning.Q1.d The authors report crude agreement and interquartile range for their validity analysis. What part of a distribution is described by the interquartile range? List other statistics used to describe the validity of a measure and why they might be preferable to reporting crude agreement.The interquartile range (IQR) refers to the middle 50% of a distribution:The original definition of IQR is a single number that represents the distance from the 25th percentile to the 75th percentile, though this format is seldom used. Instead, investigators typically present the 25th and 75th percentile (in the format [25th percentile, 75th percentile]) from which the “real” IQR can easily be gleaned by subtraction.In statistics, the term “quartiles” refers to the 3 points that divide a distribution into 4 equal parts. In epidemiology, the term is typically used to signify these 4 equal parts. The second quartile is called the “median,” the first the “25th percentile,” and the third the “75th percentile.” The IQR is the difference between the third and first quartiles. This is a more robust (less influenced by outlier observations) descriptive statistic than the range of a distribution, and it is more relevant when data are not in the shape of a classic bell curve (ie, not “normally distributed”).View Large Image Figure ViewerDownload Hi-res image Download (PPT)Figure 2(Adapted from: http://en.wikipedia.org/wiki/Interquartile_range.) [From Wikipedia: “In the Creative Commons Attribution and Share Alike license (CC-BY-SA), re-users are free to make derivative works and copy, distribute, display, and perform the work, even commercially. When re-using the work or distributing it, you must attribute the work to the author(s) and you must mention the license terms or a link to them. You must make your version available under CC-BY-SA.”]View Large Image Figure ViewerDownload Hi-res image Download (PPT)For those questions in which the research assistant and physician did not agree, the authors report (by category) the percentage agreement with the “correct” answers (as determined by the tiebreaker criterion standard). Percentage agreement is a reasonable statistic for a reliability assessment, but is not the appropriate statistic to best describe this validity assessment (comparison of a fallible observer with a criterion standard). Studies that are designed to estimate a test's validity should report statistics such as sensitivity and specificity, or likelihood ratios, not reliability metrics such as percentage agreement or κ.Answer 2Q2. Crude percentage agreement is a simple way to report reliability. Consider the contingency table for the question, “Was the quality of the chest pain crushing? (yes or no)”:Tabled 1MD Recorded “Yes”MD Recorded “No”TotalRA recorded yes117 [a]6 [b]123RA recorded no18 [c]2 [d]20Total1358143 Open table in a new tab Q2.a Calculate the crude percentage agreement for this table. What is the range of possible values for percentage agreement?The 2 observers both agreed “yes” 117 times, and “no” 2 times. Thus, crude percentage agreement for this table is (117+2)/143=83.2%. Percentage agreement can range between 0% and 100%.Q2.b Calculate Cohen's κ for this table. What is the formula for κ for raters making a binary assessment (eg, yes/no or true/false)? Discuss the purpose of Cohen's κ, its range, and the interpretations of key values such as –1, 0, and 1.The κ statistic, introduced by Cohen3Cohen J. A coefficient of agreement for nominal scales.Educ Psychol Measure. 1960; 20: 37-46Crossref Scopus (26151) Google Scholar in 1960, is defined as:% Agreement observed–% Agreement expected due to chance 1–% Agreement expected due to chanceκ is easily calculated with statistical software, but we will discuss the manual method as well. For 2×2 contingency tables, it is customary to refer to the inner cells by letters, with [a] and [b] on the top row and [c] and [d] just below. The outer 5 cells represent various row and column totals of the 4 inner cells a to d. Because these five cells occupy the margin of the table, they are often referred to as “marginal totals.” Note that if one knows the values of inner cells a to d, then one can calculate all 5 marginal totals. The reverse is not true; in most circumstances one cannot determine the inner cells from the marginal totals. The agreement cells for this table (where the 2 raters both recorded the same thing, either yes or no) are [a] and [d]. κ uses the marginal totals to calculate the percentage of expected agreement due to chance for each agreement cell, and these are summed to determine the expected percent agreement. Using the values from Table 1, κ is calculated as follows:(i) The expected value of cell “a” due to chance alone is:[a] expected=[a+b]*[a+c][a+b+c+d]=16,605143=116.1(ii) The expected value of cell “d” due to chance alone is:[d] expected=[c+d]*[b+d][a+b+c+d]=160143=1.1(iii) The percentage agreement due to chance alone is:% Agreement due to chance=[a]+[d][a+b+c+d]=117.2143=.82At the beginning of this section, we calculated that observed agreement was 83.2%, or 0.832. Now, we have determined that the expected agreement due to chance alone is 0.820. Plugging these 2 numbers into the κ formula yieldsκ=(.832–.820)=0.07(1–.820)κ was introduced as a “coefficient of agreement for nominal scales”3Cohen J. A coefficient of agreement for nominal scales.Educ Psychol Measure. 1960; 20: 37-46Crossref Scopus (26151) Google Scholar intended to measure agreement beyond chance. κ can range from -1 (with negative numbers indicating that observed agreement occurs less often than expected by chance) to 1 (perfect agreement when observed percentage agreement=1, regardless of percentage agreement expected due to chance). A κ of zero signifies that observed agreement is exactly that expected by chance alone: percentage observed agreement=percentage expected agreement). An inherent assumption of the κ statistic is that the marginal totals of the observed agreement table adequately define “chance” agreement. This assumption, like many of the assumptions in classic statistics, implies that all observations are independent, identically distributed, and drawn from the same probability density function. Under these very limited and strict conditions, “chance” agreement will be a function of the observed marginals. We explain these assumptions in layman's terms in subsequent questions.Q2.c What other measures can be used to measure reliability for binary, categorical, and continuous data?Reliability can be measured with a multitude of methods. An excellent review4Uebersax J. Statistical methods for rater agreement.http://ourworld.compuserve.com/homepages/jsuebersax/agree.htmGoogle Scholar emphasizes that there is little consensus about which is “best” and that no one method is appropriate for all occasions.It is important to consider what kinds of data are being compared. Categorical (also called discrete) variables take on a small, finite number of values. These qualitative variables include nominal (no meaningful order, such as disposition=admitted, transferred, or home) and ordinal (ordered in a meaningful sequence, such as Glasgow Coma Scale score=3 to 15). A binary variable is a categorical variable with only 2 options (female or male; yes or no). Continuous variables (such as pulse rate) can theoretically take on an infinite number of values (a patient's pulse could be precisely 88.228 beats/min), but both clinical relevance and measurement accuracy effectively categorize most continuous variables (pulse rate is estimated to the nearest integer). Many reliability measurements are intended for use only with continuous variables, and one must decide whether a variable is “continuous enough” to permit their use.Figure 3Adapted from http://en.wikipedia.org/wiki/File:Correlation_examples.png.View Large Image Figure ViewerDownload Hi-res image Download (PPT)Because correlation coefficients measure only how well observed data fit with a straight line, a correlation coefficient of zero may indicate that the 2 variables are not associated with each other (or are “independent,” as in the middle of the top row) or may be missing a more complex but potentially meaningful nonlinear association (as in the bottom row).Another limitation with using correlation is that 2 judges' scores could be highly correlated but show little agreement, as in the following example5Wuensch K.L. Inter-rater agreement.http://core.ecu.edu/psyc/wuenschk/docs30/InterRater.docGoogle Scholar:Tabled 1Subject12345Rater A108642Rater B65321 Open table in a new tab The Kendall6Kendall M. A new measure of rank correlation.Biometrika. 1938; 30: 81-89Google Scholar and Spearman7Spearman C. The proof and measurement of association between two things.Am J Psychol. 1904; 15: 72-101Crossref Google Scholar coefficients measure the degree of correlation between 2 rankings. These coefficients require ordinal and not simply nominal data. Kendall S is a simple way to measure the strength of a relationship in a 2×2 table: S=C (the number of agreement pairs)–D (the number of disagreement pairs). A preponderance of agreement pairs (resulting in a large positive value of S) indicates a strong correlation between 2 variables; a preponderance of disagreement pairs (resulting in a large negative value of S) indicates weak correlation. A disadvantage of S is that its range depends on the sample size, but a simple standardization (computed as τ=2S/n(n–1)) gets around this problem, and Kendall's τ always ranges between –1 and 1. Spearman's ρ involves a more complicated, less-intuitive calculation8Noether G.E. Why Kendall tau?.http://rsscse.org.uk/ts/bts/noether/text.htmlGoogle Scholar and is equivalent to Kendall's τ in terms of ability to measure correlation.The intraclass correlation (ICC) can also be used to measure reliability.9Shrout P.E. Fleiss J.L. Intraclass correlations: uses in assessing rater reliability.Psychol Bull. 1979; 86: 420-428Crossref PubMed Scopus (16792) Google Scholar The ICC compares the variance among multiple raters (within a subject) to the overall variance (across all ratings and all subjects). Imagine that 4 physicians (raters) use a decision aid to independently estimate the likelihood of acute coronary syndrome in each of 20 patients (subjects). The 2 graphs (Figure 4) show 4 estimates (1 dot for each physician) for each patient. In the upper graph, the 4 raters give similar ratings for each patient. The variation in ratings for any given patient is small compared with the total variance of all the ratings. Said another way, there is more variance in the ratings among patients than there is in the ratings within patients. A high ICC suggests that the raters have good correlation (when one rater scores a subject high, so do the others). Ratings for each patient tend to be clustered. In the bottom graph, ratings within each subject are all over the place. Here the ICC would be lower as the raters are not highly correlated. A number of ICC estimators have been proposed within the framework of ANOVA. Unfortunately, the various ICC statistics can produce markedly different results when applied to the same data. We believe that the pictures tell the most complete story about agreement and are free from the assumptions made by the various statistics.Figure 4Adapted from http://en.wikipedia.org/wiki/Intra-class_correlation.View Large Image Figure ViewerDownload Hi-res image Download (PPT)The Bland-Altman approach is a graphic presentation of agreement data that plots the difference in measurements for each subject pair against their mean.10Bland J.M. Altman D.G. Statistical methods for assessing agreement between two methods of clinical measurement.Lancet. 1986; 1: 307-310Abstract PubMed Scopus (38865) Google Scholar Consider a study that measures peak expiratory flow rate, using 2 different meters in 17 patients:Figure 5View Large Image Figure ViewerDownload Hi-res image Download (PPT)Answer 3Q3. Cruz et al quote the oft-cited Landis and Koch article2stating that a κ of “less than 0.2 represents poor agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, good agreement; and 0.81 to 1.00, excellent agreement.” Consider studies of the agreement of airline pilots deciding whether it is safe to land and psychologists deciding whether interviewees have type A or type B personalities. If the studies produced the same numeric κ value, would the adjectives assigned by Landis and Koch be equally appropriate?Many investigators contrast their κ values to arbitrary guidelines originally proposed by Landis and Koch2Landis J.R. Koch G.C. The measurement of observer agreement for categorical data.Biometrics. 1977; 33: 159-174Crossref PubMed Scopus (49675) Google Scholar and further popularized by Fleiss.12Fleiss J.L. Statistical Methods for Rates and Proportions.2nd ed. John Wiley & Sons, New York, NY1981Google Scholar As we hope our example demonstrates, the mechanical mapping of numeric values of κ to the adjectives poor, fair, moderate, good, and excellent is fraught with problems. A κ of .75 might be good enough if the cost of being wrong is low (such as categorizing subjects into personality types), but nothing less than near-perfect agreement is requisite if the decision has important consequences. We would not be pleased if our airplane's copilots attained a κ of 0.75 on “is it safe to land?” Some tests (eg, a set of historical questions that are used to identify patients at high risk for alcohol addiction) might be useful even if their results are only somewhat reliable. Other tests, however (eg, a set of history and physical examination data that are used to identify which patients with traumatic neck pain can safely forgo cervical spine radiography), will be useful only if they are highly reliable. This is because no poorly reliable test will ever be highly valid when used by multiple fallible observers. Conceptualizing any specific degree of agreement as poor, excellent, or anywhere in between regardless of the test's clinical context is, therefore, a dangerous oversimplification.Answer 4Q4.a Imagine 2 blindfolded, intelligent individuals who are sitting in distant corners of a room and listening to 100 easy true-false statements such as “red is a color,” “2+2=5”, etc, over a loudspeaker. Each indicates his or her choice by pressing a button in the left hand for “false” and in the right for “true.” Questions are not repeated and the respondents are expected to offer a response for each statement. Verify that if they agree on all 100 answers, percentage agreement is 100 and κ is 1.0, regardless of how many of statements are true and how many are false. Now imagine that the testing site is under the final approach for a major airport, and, at times, noise from jets flying overhead drowns out the statements from the loudspeaker. When this occurs, respondents agree, on average, only half the time (as one would expect). Recalculate percentage agreement and κ for the same 100-statement test conducted under the following sets of conditions: (1) half the statements are true and 1% of the statements are rendered incomprehensible by the planes; (2) 90% of the statements are true and 1% of the statements are rendered incomprehensible by the planes; (3) half the statements are true and 20% of the statements are rendered incomprehensible by the planes; (4) 90% of the statements are true and 20% of the statements are rendered incomprehensible by the planes. Discuss the meaning of percentage agreement and κ in these 4 settings.This example is designed to show that in certain situations, κ can underestimate actual agreement, particularly when there are skewed marginals and when percentage agreement is fairly high.Recall that κ=(% observed agreement–% expected agreement)/(100–% expected agreement). Thus, when 2 raters agree on all observations, regardless of how these are distributed between true and false or what the expected agreements are, κ=1. Tables 1 and 2 show 2 of the possible results under condition 1. Note that we do not know whether the one incomprehensible statement was true or false or how each rater will classify it. We do know that because only 1 of the 100 observations was made with low confidence, all possible results will yield very similar percentage agreement and κ.Tables 3 and 4 show 2 possible results under condition 2. The preponderance of true questions has skewed the marginal totals. Consequently, how each rater classifies the one unheard question affects κ a bit more than when the marginals are roughly equal.Tables 3 and 4 Open table in a new tab Recognize that in these first 2 sets of conditions, the raters are asked to rate 99 easy statements and 1 hard (plane flying overhead) statement. The expected agreement should be the same in all 4 tables, yet the value of percentage agreement expected due to chance, as calculated for κ, changes. κ is lower in Table 4 than in Table 2, even though raters are performing equally well in both. Observed percentage agreement also varies, but only slightly.Table 5 shows the most likely result under condition 3 (50% of statements are true and 20% are incomprehensible). This can be derived by considering the 80 high-confidence and 20 low-confidence classifications separately (Tables 6 and 7). The audible questions will result in Table 6. If each rater has no knowledge about how often inaudible statements are true, then the modal result for the 20 inaudible questions is that depicted in Table 7. Summing Tables 6 and 7 results in Table 5.Tables 5, 6 and 7 Open table in a new tab Of course, by chance alone, the 20 unheard questions might result in a more skewed table, like either Table 8 (all observations falling into disagreement cells b or c) or Table 9 (all observations falling into agreement cells a or d). Summing these results to Table 6 yields Tables 10 and 11 respectively. Thus, depending on how the incomprehensible 20 questions end up being classified, percentage agreement and κ could range between 80% and 0.615 to 100% and 1.Tables 8, 9, 10, and 11 Open table in a new tab Table 12 shows a possible result under condition 4 (90% of statements are true and 20% are incomprehensible), again derived by considering the 80 high-confidence and 20 low-confidence classifications separately. The audible questions will result in Table 13. If both raters believe that 90% of the unheard questions are true, those 20 questions might result in Table 14. κ is negative for this table because the observed agreement (80%) is less than that expected due to chance ((0.9×0.9)+(0.1×0.1)=0.82=82%). Summing Tables 13 and 14

求助该文献

最长约 10秒，即可获得该文献文件

A Consideration of the Measurement and Reporting of Interrater Reliability

今日热心研友