Skip to main content

Challenging script concordance test reference standard by evidence: do judgments by emergency medicine consultants agree with likelihood ratios?



We aimed to compare the clinical judgments of a reference panel of emergency medicine academic physicians against evidence-based likelihood ratios (LRs) regarding the diagnostic value of selected clinical and paraclinical findings in the context of a script concordance test (SCT).


A SCT with six scenarios and five questions per scenario was developed. Subsequently, 15 emergency medicine attending physicians (reference panel) took the test and their judgments regarding the diagnostic value of those findings for given diseases were recorded. The LRs of the same findings for the same diseases were extracted from a series of published systematic reviews. Then, the reference panel judgments were compared to evidence-based LRs. To investigate the test-retest reliability, five participants took the test one month later, and the correlation of their first and second judgments were quantified using Spearman rank-order coefficient.

In 22 out of 30 (73.3%) findings, the expert judgments were significantly different from the LRs. The differences included overestimation (30%), underestimation (30%), and judging the diagnostic value in an opposite direction (13.3%). Moreover, the score of a hypothetical test-taker was calculated to be 21.73 out of 30 if his/her answers were based on evidence-based LRs.

The test showed an acceptable test-retest reliability coefficient (Spearman coefficient: 0.83).


Although SCT is an interesting test to evaluate clinical decision-making in emergency medicine, our results raise concerns regarding whether the judgments of an expert panel are sufficiently valid as the reference standard for this test.


Script concordance test (SCT) has become a recognized tool to assess clinical reasoning in various fields including emergency medicine [1]-[14]. This case-based test consists of short clinical scenarios followed by questions regarding diagnosis or management. The questions are presented in three parts: A) a diagnostic or management option, B) a clinical finding, and C) a scale to capture examinee's decision (Figure 1) [15]. The test is based on measuring the concordance of test-taker judgments with those of a reference panel of experts [15]. Expert physicians usually organize their knowledge regarding diseases in `illness scripts, and when they encounter patients, they effortlessly recall the relevant scripts and promptly recognize the most appropriate courses of action [16]. SCT is indeed an effort to capture how close the scripts of test-takers are with the scripts of experts, and the rationale behind it is the more close to the experts' scripts, the better the decision-making by the test-takers. However, the expert judgments are reported to be frequently incorrect [17] and therefore, the reference standard of the test, which is the expert judgments, seems to be not necessarily valid. The test is mainly used to assess reasoning in uncertain situations [15] in which robust evidence is usually limited. However, that the test reference standard is not necessarily valid is still a critical issue and should be carefully investigated.

Figure 1

Sample clinical scenario and questions.

The diagnostic value of clinical and/or paraclinical findings is an appropriate context in which expert opinions can be compared with the best current evidence. At one hand, findings can be presented to experts and how such findings would modify the experts' diagnostic judgments, regarding the likelihood of particular diseases, can be measured. On the other hand, the expected effect of the presented findings on the likelihood of the same disease can be sought from the best current evidence. According to Bayes' theorem, the likelihood ratio (LR) of any diagnostic finding is a precise indicator of the expected change in the likelihood of that disease if the suspected individual has that particular finding [18]. Fortunately, LRs for a wide variety of clinical and paraclinical findings are either available or calculable based on robust studies [19]. Hence, we aimed to seek the judgments of a panel of emergency medicine experts regarding the diagnostic value of select clinical and paraclinical findings, acquire the evidence-based LRs for the same findings, and finally compare the judgments against the LRs.


Study design and setting

We invited all emergency medicine attending physicians (consultants) of the main teaching hospitals of two academic universities (Iran University of Medical Sciences and Tehran University of Medical Sciences) to participate in our study. The two teaching hospitals have an ED yearly census of over 90,000 patients. Participating consultants consented to be enrolled after receiving detailed explanations regarding the purpose and the design of the study. The required sample size was 15 according to the SCT development guidelines [15].

Test development

We developed a test containing six clinical scenarios on the following emergent conditions: 1) meningitis, 2) myocardial infarction, 3) pneumonia (in a child), 4) thoracic aortic dissection, 5) appendicitis, and 6) congestive heart failure. Each scenario was followed by five questions, and each question was intended to measure the judgments of our panel of consultants regarding the diagnostic value of the presence or absence of a clinical, laboratory or imaging finding. To develop the test, two investigators (SK and SFA) studied a series of systematic reviews containing a wide variety of clinical scenarios and the corresponding evidence-based LRs for the related clinical and paraclinical findings [19]. The investigators selected the scenarios and findings that could properly represent diagnostic challenges in the emergency room. Afterwards, they designed a SCT based on those scenarios and findings according to the recommended guidelines [15]. The only variation from the ordinary SCT was using 10-cm visual analog scales (VASs) instead of five-point Likert scales since both tools yield comparable measurements [20],[21] while VAS was also able to quantify the judgments of the reference panel. In addition to the main test, a separate sample scenario with two questions was developed and utilized to familiarize the participants with the test-taking process, so that they could completely understand how the test works before taking the main test (Figure 1).

Test validation

Prior to the main experiment, the content validity of the test was carefully evaluated and confirmed by an emergency medicine expert (PHM) and a medical education expert (KSA). In addition, we invited five expert participants to take the test again one month after the main experiment in order to measure the test-retest reliability. The correlation of the two sets of responses was measured using Spearman rank-order coefficient.

Data collection

After a brief orientation, the consultants received the main test in their private office and answered the questions while having no access to any medical resources. To answer each question regarding the diagnostic value of a finding, they marked a point on the VAS. The numbers equivalent to the VAS markings were identified using an ordinary ruler.

Analytical approach

The numbers representing the consultants' judgments were multiplied by 2 in order to rescale the original range of ?5 to 5 to a range of ?10 to +10. For the LR values, LRs >10 or <0.1 were considered 10 and 0.1, respectively, as we needed to establish a bounded LR range. This conversion seemed rational as an LR?=?10 is considered sufficiently large to rule-in a disease and an LR?=?0.1 is considered sufficiently small to rule-out a disease [22], and whether an LR is 10 or higher, or whether it is 0.1 or lower is not substantially different for clinical reasoning purposes. Subsequently, LR values were converted to `10xlog(LR) in order to convert their naturally geometric scale to an arithmetic scale. We used one-sample t test to compare the transformed mean judgments with the corresponding transformed LRs. In addition, the score of a hypothetical test-taker was calculated if his/her answers were based on evidence-based LRs, and the answers were scored based on the judgments of our consultants as the reference standard. The calculation is described elsewhere [15]. For statistical analysis, IBM SPSS Statistics 19 was used. A P?<?0.05 was considered significant.


Participant characteristics

Fifteen emergency medicine consultants consented to participate in our study, from which 13 consultants were board certified in emergency medicine and the other two consultants were board certified in internal medicine and pediatrics, respectively, with additional fellowship training in emergency medicine. The mean age, clinical practice experience, and emergency medicine experience were 35.9, 10.3, and 6.6 years, respectively. The Spearman coefficient was 0.83 for the two sets of answers from a subset of five consultants.

Comparison of the reference panel judgments against the evidence-based LRs

We have summarized the results of comparing values representing consultants judgments with evidence-based LRs in Table 1. Our results showed that in 22 out of 30 (73.3%) findings, the mean judgments were significantly different from the corresponding LRs. Our results also demonstrated that consultants overestimated the value of the 9 (30%) findings and underestimated the value of another 9 (30%) findings. In addition to the above discrepancies regarding the magnitude of the diagnostic value, the consultants chose a different direction (regarding ruling in or ruling out) for 4 (13.3%) findings compared to the evidence-based LRs.

Table 1 Comparison of the participants' judgments with likelihood ratios (LRs)

Subgroups of positive and negative findings

When positive and negative findings (presence or absence of findings) were considered separately, we noted a significant difference between the consultants' judgments and the LRs in 17 out of 20 (85%) positive findings and 5 out of 10 (50%) negative findings. The diagnostic values of 8 (40%) positive and 1 (10%) negative findings were overestimated and the values of 7 (35%) positive and 2 (20%) negative findings were underestimated by the consultants. Moreover, the judgments were in opposite direction to the LRs in 2/20 (10%) and 2/10 (20%) of the positive and negative findings, respectively.

Subgroups of history, physical examination, and laboratory findings

When we calculated the percentage of significantly different consultants' judgments from the corresponding LRs in subgroups of findings from history, physical examination, and laboratory/imaging findings, we observed comparable percentages for findings of history (6 out of 8: 75.0%), physical examination (10 out of 14: 71.4%), and laboratory/imaging (6 out of 8: 75%). However, physical examination findings were more frequently overestimated (25%, 35.7%, and 25% for history, physical examination, and laboratory findings, respectively) and less frequently underestimated (37.5%, 21.4%, and 37.5% for history, physical examination, and laboratory findings, respectively).

The score of the hypothetical test-taker

The calculated score of a hypothetical test-taker was 21.73 out of 30 based on the consultants' judgments as the reference standard. The categorization of LRs, the score of each category, and the calculated score for each answer are summarized in Table 2.

Table 2 Calculation of the score of a hypothetical test-taker


In summary, we observed that in a considerable proportion of the questions, the consultants' judgments regarding the value of the findings were significantly different from the corresponding evidence-based LRs; the differences included discrepant magnitude (over/underestimation) and also discrepant direction. Moreover, the value of the physical examination findings was more frequently overestimated and less frequently underestimated. This is possibly due to a popular attitude that the objective clinical findings are more valuable than paraclinical findings in the diagnosis process [23]. Furthermore, we showed that if a hypothetical test-taker had answered the test based on evidence-based LRs and his/her answers were evaluated using the consultants' judgments as a reference standard, the test-taker would get approximately two-thirds of the total score.

Previous studies have investigated aspects of SCT such as comparing the answer keys obtained from panels with different levels of expertise [24], optimizing the answer keys [25], improving the development of the scoring key [26], investigating the effect of variability within the reference panel [27], and validating the test in different clinical fields [1]-[14]. However, to our knowledge, no study had challenged the reference standard of SCT by evidence before our study. Clinical decision-making is a complex process influenced by both clinical knowledge and experience. As physicians collect experience by practicing medicine, their knowledge may be outdated [28],[29]. Therefore, while the judgments of expert physicians benefit the most from valuable experiences, they may suffer from outdated knowledge and also cognitive biases [30]-[32]. A recent review has discussed the potential pitfalls of using SCT as a valid tool to measure clinical reasoning competencies, among which is implicitly discouraging the seeking of empirical evidence for the scoring key since this test assumes no single correct answer for any item [33].


Despite the novel idea and methods, this study had the following limitations: A) Although the transformations in the judgment numbers and the LRs made these two entities comparable, the transformations could have introduced bias in the results. Knowing this limitation, we found no alternative approach to compare the consultant's judgments with evidence-based LRs. B) The optimal number of the clinical scenarios and the questions per each scenario is reported to be 20 and 3, respectively [15]. However, we used six scenarios and five questions per scenario because this test structure needed less time and could better address the time limitations of our consultant participants. C) The results were derived from only two centers in Tehran and therefore they cannot be easily generalized to all settings. D) As this study was carried out in emergency medicine context that has substantial differences with other specialties, our findings cannot be directly extrapolated to other fields of clinical medicine.


SCT is an interesting tool to score the clinical decision-making practices of novice trainees based on the judgments of expert physicians. However, experts' judgments may occasionally be inconsistent with evidence. This should raise concerns regarding the validity of the experts' judgments as a valid reference standard for SCT. We suggest future investigators should explore alternative evidence-based approaches to establish more robust reference standards for clinical reasoning tests such as the script concordance test in the field of emergency medicine.

Authors contributions

SFA conceived of the study, participated in the design of the study, performed the statistical analysis, and drafted the manuscript. SK conceived of the study, participated in the design of the study, and contributed to the data collection, statistical analysis, and draft of the manuscript. KSA contributed to conceive and design of the study and critically reviewed and revised the manuscript. PHM participated in design of the study and critically reviewed and revised the manuscript. GZ contributed to the data collection and draft of the manuscript. PH contributed to the data collection and data analysis. DBB contributed to the data collection and draft of the manuscript; HRB contributed to conceive of the study and critically reviewed and revised the manuscript. SL critically reviewed and revised the manuscript. All authors read and approved the final manuscript.

Authors information

SL is a professor of emergency medicine and public health at the University of California, Irvine's School of Medicine. PHM is an associate professor of emergency medicine and deputy for education at Iran University of Medical Sciences School of Medicine. KSA is a professor of medicine and medical education and head of the Department of Medical Education at Iran University of Medical Sciences. HRB is an associate professor of clinical epidemiology and evidence-based medicine at Iran University of Medical Sciences. The other authors were medical students at the time of conducting this study.



likelihood ratio


script concordance test


visual analog scale.


  1. 1.

    Boulouffe C, Doucet B, Muschart X, Charlin B, Vanpee D: Assessing clinical reasoning using a script concordance test with electrocardiogram in an emergency medicine clerkship rotation. Emerg Med J 2013, 31: 313–316. 10.1136/emermed-2012-201737

    Article  PubMed  Google Scholar 

  2. 2.

    Humbert AJ, Besinger B, Miech EJ: Assessing clinical reasoning skills in scenarios of uncertainty: convergent validity for a script concordance test in an emergency medicine clerkship and residency. Acad Emerg Med 2011, 18: 627–634. 10.1111/j.1553-2712.2011.01084.x

    Article  PubMed  Google Scholar 

  3. 3.

    Carriere B, Gagnon R, Charlin B, Downing S, Bordage G: Assessing clinical reasoning in pediatric emergency medicine: validity evidence for a script concordance test. Ann Emerg Med 2009, 53: 647–652. 10.1016/j.annemergmed.2008.07.024

    Article  PubMed  Google Scholar 

  4. 4.

    Park AJ, Barber MD, Bent AE, Dooley YT, Dancz C, Sutkin G, Jelovsek JE: Assessment of intraoperative judgment during gynecologic surgery using the script concordance test. Am J Obstet Gynecol 2010,203(240):240. e241-246

    PubMed  Google Scholar 

  5. 5.

    Mathieu S, Couderc M, Glace B, Tournadre A, Malochet-Guinamand S, Pereira B, Dubost J-J, Soubrier M: Construction and utilization of a script concordance test as an assessment tool for dcem3 (5th year) medical students in rheumatology. BMC Med Educ 2013, 13: 166. 10.1186/1472-6920-13-166

    PubMed Central  Article  PubMed  Google Scholar 

  6. 6.

    Duggan P, Charlin B: Summative assessment of 5th year medical students' clinical reasoning by script concordance test: requirements and challenges. BMC Med Educ 2012, 12: 29. 10.1186/1472-6920-12-29

    PubMed Central  Article  PubMed  Google Scholar 

  7. 7.

    Nouh T, Boutros M, Gagnon R, Reid S, Leslie K, Pace D, Pitt D, Walker R, Schiller D, MacLean A, Hameed M, Fata P, Charlin B, Meterissian SH: The script concordance test as a measure of clinical reasoning: a national validation study. Am J Surg 2012, 203: 530–534. 10.1016/j.amjsurg.2011.11.006

    Article  PubMed  Google Scholar 

  8. 8.

    Piovezan RD, Custdio O, Cendoroglo MS, Batista NA, Lubarsky S, Charlin B: Assessment of undergraduate clinical reasoning in geriatric medicine: application of a script concordance test. J Am Geriatr Soc 2012, 60: 1946–1950. 10.1111/j.1532-5415.2012.04152.x

    Article  PubMed  Google Scholar 

  9. 9.

    Bursztejn AC, Cuny JF, Adam JL, Sido L, Schmutz JL, de Korwin JD, Latarche C, Braun M, Barbaud A: Usefulness of the script concordance test in dermatology. J Eur Acad Dermatol Venereol 2011, 25: 1471–1475. 10.1111/j.1468-3083.2011.04008.x

    Article  PubMed  Google Scholar 

  10. 10.

    Humbert AJ, Johnson MT, Miech E, Friedberg F, Grackin JA, Seidman PA: Assessment of clinical reasoning: a script concordance test designed for pre-clinical medical students. Med Teach 2011, 33: 472–477. 10.3109/0142159X.2010.531157

    Article  PubMed  Google Scholar 

  11. 11.

    Kania RE, Verillaud B, Tran H, Gagnon R, Kazitani D, Huy PTB, Herman P, Charlin B: Online script concordance test for clinical reasoning assessment in otorhinolaryngology: the association between performance and clinical experience. Arch Otolaryngol Head Neck Surg 2011, 137: 751–755. 10.1001/archoto.2011.106

    Article  PubMed  Google Scholar 

  12. 12.

    Lambert C, Gagnon R, Nguyen D, Charlin B: The script concordance test in radiation oncology: validation study of a new tool to assess clinical reasoning. Radiat Oncol 2009, 4: 7. 10.1186/1748-717X-4-7

    PubMed Central  Article  PubMed  Google Scholar 

  13. 13.

    Lubarsky S, Chalk C, Kazitani D, Gagnon R, Charlin B: The script concordance test: a new tool assessing clinical judgement in neurology. Can J Neurol Sci 2009, 36: 326–331.

    Article  PubMed  Google Scholar 

  14. 14.

    Meterissian SH: A novel method of assessing clinical reasoning in surgical residents. Surg Innov 2006, 13: 115–119. 10.1177/1553350606291042

    Article  PubMed  Google Scholar 

  15. 15.

    Fournier JP, Demeester A, Charlin B: Script concordance tests: guidelines for construction. BMC Med Inform Decis Mak 2008, 8: 18. 10.1186/1472-6947-8-18

    PubMed Central  Article  PubMed  Google Scholar 

  16. 16.

    Bowen JL: Educational strategies to promote clinical diagnostic reasoning. N Engl J Med 2006, 355: 2217–2225. 10.1056/NEJMra054782

    CAS  Article  PubMed  Google Scholar 

  17. 17.

    Oxman AD, Guyatt GH: The science of reviewing research. Ann N Y Acad Sci 1993, 703: 125–133. Discussion 133-124 10.1111/j.1749-6632.1993.tb26342.x

    CAS  Article  PubMed  Google Scholar 

  18. 18.

    Zehtabchi S, Kline JA: The art and science of probabilistic decision-making in emergency medicine. Acad Emerg Med 2010, 17: 521–523. 10.1111/j.1553-2712.2010.00739.x

    Article  PubMed  Google Scholar 

  19. 19.

    Simel DL, Rennie D: Rational clinical examination: Evidence-based clinical diagnosis. McGraw-Hill, Chicago; 2009.

    Google Scholar 

  20. 20.

    van Laerhoven H, van der Zaag-Loonen HJ, Derkx BHF: A comparison of Likert scale and visual analogue scales as response options in children's questionnaires. Acta Paediatr 2004, 93: 830–835. 10.1111/j.1651-2227.2004.tb03026.x

    CAS  Article  PubMed  Google Scholar 

  21. 21.

    Guyatt GH, Townsend M, Berman LB, Keller JL: A comparison of Likert and visual analogue scales for measuring change in function. J Chronic Dis 1987, 40: 1129–1133. 10.1016/0021-9681(87)90080-4

    CAS  Article  PubMed  Google Scholar 

  22. 22.

    Straus SERW, Glasziou P, Haynes RB: Diagnosis and screening. In Evidence-based medicine: how to practice and teach EBM. 3rd edition. Elsevier, London; 2005:67–99.

    Google Scholar 

  23. 23.

    Hampton JR, Harrison MJ, Mitchell JR, Prichard JS, Seymour C: Relative contributions of history-taking, physical examination, and laboratory investigation to diagnosis and management of medical outpatients. Br Med J 1975, 2: 486–489. 10.1136/bmj.2.5969.486

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  24. 24.

    Petrucci AM, Nouh T, Boutros M, Gagnon R, Meterissian SH: Assessing clinical judgment using the script concordance test: the importance of using specialty-specific experts to develop the scoring key. Am J Surg 2013, 205: 137–140. 10.1016/j.amjsurg.2012.09.002

    Article  PubMed  Google Scholar 

  25. 25.

    Gagnon R, Lubarsky S, Lambert C, Charlin B: Optimization of answer keys for script concordance testing: should we exclude deviant panelists, deviant responses, or neither? Adv Health Sci Educ Theory Pract 2011, 16: 601–608. 10.1007/s10459-011-9279-2

    Article  PubMed  Google Scholar 

  26. 26.

    Charlin B, Gagnon R, Lubarsky S, Lambert C, Meterissian S, Chalk C, Goudreau J, van der Vleuten C: Assessment in the context of uncertainty using the script concordance test: more meaning for scores. Teach Learn Med 2010, 22: 180–186. 10.1080/10401334.2010.488197

    Article  PubMed  Google Scholar 

  27. 27.

    Charlin B, Gagnon R, Pelletier J, Coletti M, Abi-Rizk G, Nasr C, Sauve E, van der Vleuten C: Assessment of clinical reasoning in the context of uncertainty: the effect of variability within the reference panel. Med Educ 2006, 40: 848–854. 10.1111/j.1365-2929.2006.02541.x

    Article  PubMed  Google Scholar 

  28. 28.

    Straus SE, Glasziou P, Richardson WS, Haynes RB: Introduction. In Evidence-based medicine: How to practice and teach it. 4th edition. Churchill Livingstone, Edinburgh; 2010:1–12.

    Google Scholar 

  29. 29.

    Ramos K, Linscheid R, Schafer S: Real-time information-seeking behavior of residency physicians. Fam Med 2003, 35: 257–260.

    PubMed  Google Scholar 

  30. 30.

    Graber M, Gordon R, Franklin N: Reducing diagnostic errors in medicine: what's the goal? Acad Med 2002, 77: 981–992. 10.1097/00001888-200210000-00009

    Article  PubMed  Google Scholar 

  31. 31.

    Norman GR, Eva KW: Diagnostic error and clinical reasoning. Med Educ 2010, 44: 94–100. 10.1111/j.1365-2923.2009.03507.x

    Article  PubMed  Google Scholar 

  32. 32.

    Nendaz M, Perrier A: Diagnostic errors and flaws in clinical reasoning: mechanisms and prevention in practice. Swiss Med Wkly 2012, 142: w13706.

    PubMed  Google Scholar 

  33. 33.

    Lineberry M, Kreiter CD, Bordage G: Threats to validity in the use and interpretation of script concordance test scores. Med Educ 2013, 47: 1175–1183. 10.1111/medu.12283

    Article  PubMed  Google Scholar 

Download references


We would like to acknowledge Dr. Amir Nejati for his contributions in collecting data for this study.

Sources of funding

This study was the M.D. thesis of SK and was funded by Iran University of Medical Sciences. The authors have not received fund from any other source.

Author information



Corresponding author

Correspondence to Shahram Lotfipour.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ahmadi, S., Khoshkish, S., Soltani-Arabshahi, K. et al. Challenging script concordance test reference standard by evidence: do judgments by emergency medicine consultants agree with likelihood ratios?. Int J Emerg Med 7, 34 (2014).

Download citation


  • Clinical judgment
  • Script concordance test
  • Likelihood ratio
  • Visual analog scales
  • Evidence-based medicine
  • Diagnosis
  • Decision-making