Proceedings Volume 7966

Medical Imaging 2011: Image Perception, Observer Performance, and Technology Assessment

cover
Proceedings Volume 7966

Medical Imaging 2011: Image Perception, Observer Performance, and Technology Assessment

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 2 March 2011
Contents: 10 Sessions, 60 Papers, 0 Presentations
Conference: SPIE Medical Imaging 2011
Volume Number: 7966

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 7966
  • Perception in Screening Exams
  • Human Performance
  • Model Observers
  • ROC and Decision Metrics
  • Keynote and Assessment in Pathology
  • Image Display and Presentation
  • Vision in Medical Imaging
  • Technology Assessment and Impact
  • Poster Session
Front Matter: Volume 7966
icon_mobile_dropdown
Front Matter: Volume 7966
This PDF file contains the front matter associated with SPIE Proceedings Volume 7966, including the Title Page, Copyright information, Table of Contents, Introduction, and the Conference Committee listing.
Perception in Screening Exams
icon_mobile_dropdown
Optimizing viewing procedures of breast tomosynthesis image volumes using eye tracking combined with a free response human observer study
Kristina Lång, Sophia Zackrisson, Kenneth Holmqvist, et al.
The purpose of this study was to evaluate four different viewing procedures as part of improving viewing conditions of breast tomosynthesis (BT) image volumes. The procedures consisted of free scroll volume browsing, and a combination of initial cine loops at three different frame rates (9, 14 and 25 fps) terminated upon request followed by free scroll volume browsing. Fifty-five normal BT image volumes in MLO view were collected. In these, simulated lesions (20 masses and 20 clusters of microcalcifications) were randomly inserted, creating four unique image sets for each procedure. Four readers interpreted the cases in a random order. Their task was to locate a lesion, mark and assign a five level confidence scale. The diagnostic accuracy was analyzed using Jackknife Free Receiver Operating Characteristics (JAFROC). Time efficiency and visual search behavior were also investigated using eye tracking. The results indicate that there was no statistically significant difference in JAFROC FOM between the different viewing procedures, however the medium cine loop speed seemed to be the preferred viewing procedure in terms of total analyze time and dwell time.
Assessment of breast density: reader performance using synthetic mammographic images
Janine Makaronidis, Michael Berks, Jamie Sergeant, et al.
The quantity and appearance of dense breast tissue in mammograms is related to the risk of developing breast cancer, the sensitivity of mammographic interpretation, and the likelihood of local recurrence of cancer following surgery. Visual assessment of breast density is widely used, often with readers indicating the percentage of dense tissue in a mammogram. Although real mammograms can be used to investigate intra- and inter-observer variability, ground truth is difficult to ascertain, so to investigate reader accuracy, we created 60 synthetic, mammogram-like images with densities comparable in area to those found in screening. The images contained either a single dense area, multiple or linear densities, or a variable breast size with a single density. The images were randomized and assessed by 9 expert and 6 non-expert readers who marked percentage area of density on a visual analogue scale. Non-expert readers' estimates of percentage area of density were closer to the truth (6-11% mean absolute difference) than the experts' estimates (10- 19%). The readers were most accurate when the density formed a single area in the image, and least accurate when the dense area was composed of linear structures. In almost every case, the dense area was overestimated by the expert readers. When experts were ranked according to the degree of overestimation, this broadly reflected their relative performance on real mammograms.
Health professionals' agreement on density judgements and successful abnormality identification within the UK Breast Screening Programme
Higher breast density is associated with a greater chance of developing breast cancer. Additionally, it is well known that higher mammographic breast density is associated with increased difficulty in accurately identifying breast cancer. However, comparatively little is known of the reliability of breast density judgements. All UK breast screeners (primarily radiologists and technologists) annually participate in the PERFORMS self-assessment scheme where they make several judgements about series of challenging recent screening cases of known outcomes. As part of this process, for each case, they provide a radiological assessment of the likelihood of cancer on a confidence scale, alongside an assessment of case density using a three point scale. Analysis of the data from two years of the scheme found that the degree of agreement on case density was significantly greater than no agreement (p < .001). However, only a moderate degree of inter-rater reliability was exhibited (κ = .44) with significant differences between the occupational groups. The reasons for differences between the occupational groups and the relationship between agreement on density rating and case reading ability are explored.
The time course of cancer detection performance
Sian Taylor-Phillips, Aileen Clarke, Matthew Wallis, et al.
The purpose of this study was to measure how mammography readers' performance varies with time of day and time spent reading. This was investigated in screening practice and when reading an enriched case set. In screening practice records of time and date that each case was read, along with outcome (whether the woman was recalled for further tests, and biopsy results where performed) was extracted from records from one breast screening centre in UK (4 readers). Patterns of performance with time spent reading was also measured using an enriched test set (160 cases, 41% malignant, read three times by eight radiologists). Recall rates varied with time of day, with different patterns for each reader. Recall rates decreased as the reading session progressed both when reading the enriched test set and in screening practice. Further work is needed to expand this work to a greater number of breast screening centres, and to determine whether these patterns of performance over time can be used to optimize overall performance.
Can horizontally oriented breast tomosynthesis image volumes or the use of a systematic search strategy improve interpretation? An eye tracking and free response human observer study
Kristina Lång, Sophia Zackrisson, Kenneth Holmqvist, et al.
Our aim was to evaluate if there is a benefit in diagnostic accuracy and efficiency of viewing breast tomosynthesis (BT) image volumes presented horizontally oriented, but also to evaluate the use of a systematic search strategy where the breast is divided, and analyzed consecutively, into two sections. These image presentations were compared to regular vertical image presentation. All methods were investigated using viewing procedures consisting of free scroll volume browsing, and a combination of initial cine loops at three different frame rates (9, 14, 25 fps) terminated upon request followed by free scroll volume browsing if needed. Fifty-five normal BT image volumes in MLO view were collected. In these, simulated lesions (20 masses and 20 clusters of microcalcifications) were randomly inserted, creating four unique image sets for each procedure. Four readers interpreted the cases in a random order. Their task was to locate the lesions, mark and assign a five level confidence scale. The diagnostic accuracy was analyzed using Jackknife Free Receiver Operating Characteristics (JAFROC). Time efficiency and visual search behavior were also investigated using eye tracking. Results indicate there was no statistically significant difference in JAFROC FOM between the different image presentations, although visual search was more time efficient when viewing horizontally oriented image volumes in medium cine loops.
Human Performance
icon_mobile_dropdown
Modeling error in assessment of mammographic image features for improved computer-aided mammography training: initial experience
In this study we investigate the hypothesis that there exist patterns in erroneous assessment of BI-RADS image features among radiology trainees when performing diagnostic interpretation of mammograms. We also investigate whether these error making patterns can be captured by individual user models. To test our hypothesis we propose a user modeling algorithm that uses the previous readings of a trainee to identify whether certain BI-RADS feature values (e.g. "spiculated" value for "margin" feature) are associated with higher than usual likelihood that the feature will be assessed incorrectly. In our experiments we used readings of 3 radiology residents and 7 breast imaging experts for 33 breast masses for the following BI-RADS features: parenchyma density, mass margin, mass shape and mass density. The expert readings were considered as the gold standard. Rule-based individual user models were developed and tested using the leave one-one-out crossvalidation scheme. Our experimental evaluation showed that the individual user models are accurate in identifying cases for which errors are more likely to be made. The user models captured regularities in error making for all 3 residents. This finding supports our hypothesis about existence of individual error making patterns in assessment of mammographic image features using the BI-RADS lexicon. Explicit user models identifying the weaknesses of each resident could be of great use when developing and adapting a personalized training plan to meet the resident's individual needs. Such approach fits well with the framework of adaptive computer-aided educational systems in mammography we have proposed before.
Time of day does not affect radiologists' accuracy in breast lesion detection
Muhammad Al-s'adi, Mark F. McEntee, Elaine Ryan
Mammographic image reporting accuracy among radiologists varies. This study examines whether radiologists' accuracy in detecting breast lesions varies at different times throughout the day. Observers comprised of 69 experienced breast radiologists who reviewed 50 mammograms, consisting of 4 images each, of which 15 cases were abnormal. All the observers were grouped and assigned a specific hour, starting at 7:00am and finishing 8:00pm. They were asked to detect the lesion if present and mark their confidence rating (1-5) in a provided booklet. Demographic details were recorded including age, experience and average number of mammographic readings undertaken per year. Radiologists' performance was measured and compared in terms of sensitivity, specificity and receiver operating characteristic (ROC) scores. Kruskal-Wallis methods with Dunn's post-hoc test was performed. Mean ROC scores demonstrated no significant differences (p≥0.46) between groups performing at different times of the day. Also, no significant differences were noted for sensitivity (p≥0.78) or specificity (p≥0.99) when groups were compared with each other. The findings from the study suggest that although radiologists' performance varies slightly throughout the day, the exact time of day has no significant effect on radiologists' detection accuracy. The results found suggest further studies are required for these to investigate this effect.
Extended analysis of the effect of learning with feedback on the detectability of pulmonary nodules in chest tomosynthesis
Sara Asplund, Åse A. Johnsson, Jenny Vikgren, et al.
In chest tomosynthesis, low-dose projections collected over a limited angular range are used for reconstruction of section images of the chest, resulting in a reduction of disturbing anatomy at a moderate increase in radiation dose compared to chest radiography. In a previous study, we investigated the effects of learning with feedback on the detection of pulmonary nodules in chest tomosynthesis. Six observers with varying degrees of experience of chest tomosynthesis analyzed tomosynthesis cases for presence of pulmonary nodules. The cases were analyzed before and after learning with feedback. Multidetector computed tomography (MDCT) was used as reference. The differences in performance between the two readings were calculated using the jackknife alternative free-response receiver operating characteristics (JAFROC-2) as primary measure of detectability. Significant differences between the readings were found only for observers inexperienced in chest tomosynthesis. The purpose of the present study was to extend the statistical analysis of the results of the previous study, including JAFROC-1 analysis and FROC curves in the analysis. The results are consistent with the results of the previous study and, furthermore, JAFROC-1 gave lower p-values than JAFROC-2 for the observers who improved their performance after learning with feedback.
Classification of radiological errors in chest radiographs, using support vector machine on the spatial frequency features of false- negative and false-positive regions
Aim: To optimize automated classification of radiological errors during lung nodule detection from chest radiographs (CxR) using a support vector machine (SVM) run on the spatial frequency features extracted from the local background of selected regions. Background: The majority of the unreported pulmonary nodules are visually detected but not recognized; shown by the prolonged dwell time values at false-negative regions. Similarly, overestimated nodule locations are capturing substantial amounts of foveal attention. Spatial frequency properties of selected local backgrounds are correlated with human observer responses either in terms of accuracy in indicating abnormality position or in the precision of visual sampling the medical images. Methods: Seven radiologists participated in the eye tracking experiments conducted under conditions of pulmonary nodule detection from a set of 20 postero-anterior CxR. The most dwelled locations have been identified and subjected to spatial frequency (SF) analysis. The image-based features of selected ROI were extracted with un-decimated Wavelet Packet Transform. An analysis of variance was run to select SF features and a SVM schema was implemented to classify False-Negative and False-Positive from all ROI. Results: A relative high overall accuracy was obtained for each individually developed Wavelet-SVM algorithm, with over 90% average correct ratio for errors recognition from all prolonged dwell locations. Conclusion: The preliminary results show that combined eye-tracking and image-based features can be used for automated detection of radiological error with SVM. The work is still in progress and not all analytical procedures have been completed, which might have an effect on the specificity of the algorithm.
A novel platform to simplify human observer performance experiments in clinical reading environments
J. Jacobs, F. Zanca, H. Bosmans
Human observer performance experiments (HOPE) are frequently carried out in controlled environments in order to maximize the influence of the performance parameter under study. As an example, the amount of ambient reading variables can be kept as low as possible during HOPE. This is contrasting with the dynamic nature of a clinical reading environment that may therefore be suboptimal for the majority of the experiments. The aim of current work was to extend our previously developed software platform Sara² to cope with the influences of the reading environment on HOPE experiments. Generic modules for ROC, LROC, FROC, MAFC and visual grading analysis/image quality criteria (VGA/IQC) experiments were developed for 2D and 3D input images. Additional modules were included in the platform for finding unexpected interruptions due to clinical emergencies by means of idle time and for mouse trajectory monitoring. Also a generic approach towards the inclusion of reading questionnaires and a RFID enabled secured login system was added. Next, we created a sensor network consisting of off-the-shelf components which continuously monitor ambient reading conditions like: temperature, ambient lighting, humidity, ambient noise levels and observer reading distance. These measured parameters can be synchronized with the reading findings. Finally we included a link to incorporate the use of specialized 3rd party PACS viewers in our software framework. Using the proposed software and hardware solution, we could simplify the setup and the performing of HOPE in clinical reading environments and we can now properly control our reading experiments.
Analysis of physiological impact while reading stereoscopic radiographs
Yasuko Y. Unno, Takashi Tajima, Takao Kuwabara, et al.
A stereoscopic viewing technology is expected to improve diagnostic performance in terms of reading efficiency by adding one more dimension to the conventional 2D images. Although a stereoscopic technology has been applied to many different field including TV, movies and medical applications, physiological fatigue through reading stereoscopic radiographs has been concerned although no established physiological fatigue data have been provided. In this study, we measured the α-amylase concentration in saliva, heart rates and normalized tissue hemoglobin index (nTHI) in blood of frontal area to estimate physiological fatigue through reading both stereoscopic radiographs and the conventional 2D radiographs. In addition, subjective assessments were also performed. As a result, the pupil contraction occurred just after the reading of the stereoscopic images, but the subjective assessments regarding visual fatigue were nearly identical for the reading the conventional 2D and stereoscopic radiographs. The α-amylase concentration and the nTHI continued to decline while examinees read both 2D and stereoscopic images, which reflected the result of subjective assessment that almost half of the examinees reported to feel sleepy after reading. The subjective assessments regarding brain fatigue showed that there were little differences between 2D and stereoscopic reading. In summary, this study shows that the physiological fatigue caused by stereoscopic reading is equivalent to the conventional 2D reading including ocular fatigue and burden imposed on brain.
Model Observers
icon_mobile_dropdown
Incorporating holistic visual search concepts into a SPECT myocardial perfusion imaging numerical observer
Previous Single Photon Emission Computed Tomography (SPECT) myocardial perfusion imaging (MPI) research has explored the utility of numerical observers. One previous study proposed that the model of holistic visual search of a myocardial perfusion image by an expert human observer might improve the development of a SPECT MPI numerical observer. Further examination of numerical processing techniques that seem to be analogous to initial stage of human holistic image search has helped to further refine the numerical observer. The current numerical observer considers some fundamental issues in the refinement of the numerical observer: the need for background estimation, the determination of blobs and the 'search-like' selection of a few blobs for subsequent decision analysis.
Channelized relevance vector machine as a numerical observer for cardiac perfusion defect detection task
Mahdi M. Kalayeh, Thibault Marin, P. Hendrik Pretorius, et al.
In this paper, we present a numerical observer for image quality assessment, aiming to predict human observer accuracy in a cardiac perfusion defect detection task for single-photon emission computed tomography (SPECT). In medical imaging, image quality should be assessed by evaluating the human observer accuracy for a specific diagnostic task. This approach is known as task-based assessment. Such evaluations are important for optimizing and testing imaging devices and algorithms. Unfortunately, human observer studies with expert readers are costly and time-demanding. To address this problem, numerical observers have been developed as a surrogate for human readers to predict human diagnostic performance. The channelized Hotelling observer (CHO) with internal noise model has been found to predict human performance well in some situations, but does not always generalize well to unseen data. We have argued in the past that finding a model to predict human observers could be viewed as a machine learning problem. Following this approach, in this paper we propose a channelized relevance vector machine (CRVM) to predict human diagnostic scores in a detection task. We have previously used channelized support vector machines (CSVM) to predict human scores and have shown that this approach offers better and more robust predictions than the classical CHO method. The comparison of the proposed CRVM with our previously introduced CSVM method suggests that CRVM can achieve similar generalization accuracy, while dramatically reducing model complexity and computation time.
Development of model observers applied to 3D breast tomosynthesis microcalcifications and masses
Ivan Diaz, Pontus Timberg, Sheng Zhang, et al.
The development of model observers for mimicking human detection strategies has followed from symmetric signals in simple noise to increasingly complex backgrounds. In this study we implement different model observers for the complex task of detecting a signal in a 3D image stack. The backgrounds come from real breast tomosynthesis acquisitions and the signals were simulated and reconstructed within the volume. Two different tasks relevant to the early detection of breast cancer were considered: detecting an 8 mm mass and detecting a cluster of microcalcifications. The model observers were calculated using a channelized Hotelling observer (CHO) with dense difference-of-Gaussian channels, and a modified (Partial prewhitening [PPW]) observer which was adapted to realistic signals which are not circularly symmetric. The sustained temporal sensitivity function was used to filter the images before applying the spatial templates. For a frame rate of five frames per second, the only CHO that we calculated performed worse than the humans in a 4-AFC experiment. The other observers were variations of PPW and outperformed human observers in every single case. This initial frame rate was a rather low speed and the temporal filtering did not affect the results compared to a data set with no human temporal effects taken into account. We subsequently investigated two higher speeds at 5, 15 and 30 frames per second. We observed that for large masses, the two types of model observers investigated outperformed the human observers and would be suitable with the appropriate addition of internal noise. However, for microcalcifications both only the PPW observer consistently outperformed the humans. The study demonstrated the possibility of using a model observer which takes into account the temporal effects of scrolling through an image stack while being able to effectively detect a range of mass sizes and distributions.
Numerical observer for cardiac motion assessment using machine learning
Thibault Marin, Mahdi M. Kalayeh, P. Hendrik Pretorius, et al.
In medical imaging, image quality is commonly assessed by measuring the performance of a human observer performing a specific diagnostic task. However, in practice studies involving human observers are time consuming and difficult to implement. Therefore, numerical observers have been developed, aiming to predict human diagnostic performance to facilitate image quality assessment. In this paper, we present a numerical observer for assessment of cardiac motion in cardiac-gated SPECT images. Cardiac-gated SPECT is a nuclear medicine modality used routinely in the evaluation of coronary artery disease. Numerical observers have been developed for image quality assessment via analysis of detectability of myocardial perfusion defects (e.g., the channelized Hotelling observer), but no numerical observer for cardiac motion assessment has been reported. In this work, we present a method to design a numerical observer aiming to predict human performance in detection of cardiac motion defects. Cardiac motion is estimated from reconstructed gated images using a deformable mesh model. Motion features are then extracted from the estimated motion field and used to train a support vector machine regression model predicting human scores (human observers' confidence in the presence of the defect). Results show that the proposed method could accurately predict human detection performance and achieve good generalization properties when tested on data with different levels of post-reconstruction filtering.
Accounting for anatomical noise in SPECT with a visual-search human-model observer
H. C. Gifford, M. A. King, M. S. Smyczynski
Reliable human-model observers for clinically realistic detection studies are of considerable interest in medical imaging research, but current model observers require frequent revalidation with human data. A visual-search (VS) observer framework may improve reliability by better simulating realistic etection-localization tasks. Under this framework, model observers execute a holistic search to identify tumor-like candidates and then perform careful analysis of these candidates. With emission tomography, anatomical noise in the form of elevated uptake in neighboring tissue often complicates the task. Some scanning model observers simulate the human ability to read around such noise by presubtracting the mean normal background from the test image, but this backgroundknown- exactly (BKE) assumption has several drawbacks. The extent to which the VS observer can overcome these drawbacks was investigated by comparing it against humans and a scanning observer for detection of solitary pulmonary nodules in a simulated SPECT lung study. Our results indicate that the VS observer offers a robust alternative to the scanning observer for modeling humans.
ROC and Decision Metrics
icon_mobile_dropdown
Support of the decision variable densities of the three-class ideal observer for bivariate trinormal data
Despite theoretical and practical difficulties, we are attempting to extend receiver operating characteristic (ROC) analysis to tasks with more than two classes. Previously we investigated a univariate trinormal model for the underlying data of a three-class ideal observer. Although analytically tractable, this is less realistic than a multivariate data model. We have developed expressions for the region of support of the decision variable probability density functions for bivariate trinormal underlying data, given certain constraints on the underlying data covariance matrices. We hope these results will aid in developing computational methods for evaluating observer performance under such a model.
Agreement between two versions of a CADx system: a simulation study
A simulation study was conducted to investigate the agreement between original and updated versions of a computeraided diagnosis (CADx) system. Performances of two versions of a CADx system are traditionally compared using metrics derived from the receiver operating characteristic (ROC) curve. These aggregate standalone performance measures may reveal the overall improvement of the CADx system due to the update, but do not provide information about the specific change in CADx output for individual cases. To address this issue, we used the concordance measure, which compares the ordering of scores for pairs of cases between system versions (i.e., before and after the update of the system). In this preliminary study, the system update that we investigated was an enlargement of the training data set, which is often encountered in the development of a subsequent CADx system version for improving performance. We separately studied the effect of the size of the original training set, the number of features, and the distribution and separation of the two classes in the feature space on the concordance and AUC measures. When the effect of an update was compared among datasets with differences in intrinsic class separation, concordance was in general larger when the intrinsic class separation was larger. The amount of change in AUC between the original and updated CADx system did not always predict the degree of agreement between the two system versions. A large improvement in AUC could be accompanied with either a larger or smaller agreement between the original and updated systems. Quantification of the degree of agreement in standalone performance between different versions of a CADx system may serve to define a major algorithm update, and better depict the impact of that update.
Reader characteristics linked to detection of pulmonary nodules on radiographs: ROC vs. JAFROC analyses of performance
Akshay Kohli, John W. Robinson, John Ryan, et al.
The purpose of this study is to explore whether reader characteristics are linked to heightened levels of diagnostic performance in chest radiology using receiver operating characteristic (ROC) and jackknife free response ROC (JAFROC) methodologies. A set of 40 postero-anterior chest radiographs was developed, of which 20 were abnormal containing one or more simulated nodules, of varying subtlety. Images were independently reviewed by 12 boardcertified radiologists including six chest specialists. The observer performance was measured in terms of ROC and JAFROC scores. For the ROC analysis, readers were asked to rate their degree of suspicion for the presence of nodules by using a confidence rating scale (1-6). JAFROC analysis required the readers to locate and rate as many suspicious areas as they wished using the same scale and resultant data were used to generate Az and FOM scores for ROC and JAFROC analyses respectively. Using Pearson methods, scores of performance were correlated with 7 reader characteristics recorded using a questionnaire. JAFROC analysis showed that improved reader performance was significantly (p≤0.05) linked with chest specialty (p<0.03), hours per week reading chest radiographs (p<0.03) and chest readings per year (p<0.04). ROC analyses demonstrated only one significant relationship, hours per week reading chest radiographs (p<0.02).The results of this study have shown that radiologist's performance in the detection of pulmonary nodules on radiographs is significantly linked to chest specialty, hours reading per week and number of radiographs read per year. Also, JAFROC is a more powerful predictor of performance as compared to ROC.
Estimating the parameters of a model of visual search from ROC data: an alternate method for fitting proper ROC curves
The binormal receiver operating characteristic (ROC) model often predicts an unphysical "hook" near the upperright corner (1,1) of the ROC plot. Several models for fitting proper ROC curves avoid this problem. The purpose of this work is to describe another method that involves a model of visual search that models free-response data, and to compare the search-model predicted ROC curves with those predicted by PROPROC (proper ROC) software. The highest rating rule was used to infer ROC data from FROC data. An expression for the search-model ROC likelihood function is derived, maximizing which yielded estimates of the parameters and the fitted ROC curve. The method was applied to a dual-modality 5-reader FROC data set. The relative difference between the average AUCs for the two methods was less than 1%. A linear regression of the AUCs yielded an adjusted R-squared of 0.95 indicative of strong linear correlation between the search model AUC and PROPROC AUC, although the shapes of the predicted ROC curves were qualitatively different. This study shows the feasibility of estimating parameters characterizing visual search from data acquired in a non-search paradigm.
Characterizing and optimizing rater performance for internet-based collaborative labeling
Labeling structures on medical images is crucial in determining clinically relevant correlations with morphometric and volumetric features. For the exploration of new structures and new imaging modalities, validated automated methods do not yet exist, and so researchers must rely on manually drawn landmarks. Voxel-by-voxel labeling can be extremely resource intensive, so large-scale studies are problematic. Recently, statistical approaches and software have been proposed to enable Internet-based collaborative labeling of medical images. While numerous labeling software tools have been created, the use of these packages as high-throughput labeling systems has yet to become entirely viable given training requirements. Herein, we explore two modifications to a typical mouse-based labeling system: (1) a platform independent overlay for recognition of mouse gestures and (2) an inexpensive touch-screen tracking device for nonmouse input. Through this study we characterize rater reliability in point, line, curve, and region placement. For the mouse input, we find a placement accuracy of 2.48±5.29 pixels (point), 0.630±1.81 pixels (curve), 1.234±6.99 pixels (line), and 0.058±0.027 (1 - Jaccard Index for region). The gesture software increased labeling speed by 27% overall and accuracy by approximately 30-50% on point and line tracing tasks, but the touch screen module lead to slower and more error prone labeling on all tasks, likely due to relatively poor sensitivity. In summary, the mouse gesture integration layer runs as a seamless operating system overlay and could potentially benefit any labeling software; yet, the inexpensive touch screen system requires improved usability optimization and calibration before it can provide an efficient labeling system.
Keynote and Assessment in Pathology
icon_mobile_dropdown
Changes in visual search patterns of pathology residents as they gain experience
Elizabeth A. Krupinski, Ronald S. Weinstein
The goal of this study was to examine and characterize changes in the ways that pathology residents examine digital or "virtual" slides as they gain more experience. A series of 20 digitized breast biopsy virtual slides (half benign and half malignant) were shown to 6 pathology residents at three points in time - at the beginning of their first year of residency, at the beginning of the second year, and at the beginning of the third year. Their task was to examine each image and select three areas that they would most want to zoom on in order to view the diagnostic detail at higher resolution. Eye position was recorded as they scanned each image. The data indicate that with each successive year of experience, the residents' search patterns do change. Overall it takes significantly less time to view an individual slide and decide where to zoom, significantly fewer fixations are generated overall, and there is less examination of non-diagnostic areas. Essentially, the residents' search becomes much more efficient and after only one year closely resembles that of an expert pathologist. These findings are similar to those in radiology, and support the theory that an important aspect of the development of expertise is improved pattern recognition (taking in more information during the initial Gestalt or gist view) as well as improved allocation of attention and visual processing resources.
Characterizing virtual slide exploration through the use of 'search maps'
Claudia R. Mello-Thoms, Carlos A. Mello, Olga Medvedeva, et al.
Currently very little is known about the process by which pathologists arrive at a diagnosis on a case. This process is an integration of the pathologist's slide exploration strategy, perceptual information gathering and cognitive decision making. We have developed a methodology to statically represent the pathologists' dynamic visual search of digital slides by creating a representation of visual sampling called 'search maps'. In these maps slide exploration is divided into three parts, according to the magnification range used. In other words, areas explored at low magnification (<=4x), medium magnification (>4x-10x) and high magnification (>10x-20x) are represented separately. Moreover, representation using the 'search maps' allows for quantitative analysis and pairwise comparison of slide exploration strategy. In this paper we have compared the search maps of experienced pathologists and those of Pathology residents. Our goal was to understand how search differs between the experts and the trainees.
Image Display and Presentation
icon_mobile_dropdown
Validation of a new digital breast tomosynthesis medical display
Cédric Marchessoux, Nicolas Vivien, Asli Kumcu, et al.
The main objective of this study is to evaluate and validate the new Barco medical display MDMG-5221 which has been optimized for the Digital Breast Tomosynthesis (DBT) imaging modality system, and to prove the benefit of the new DBT display in terms of image quality and clinical performance. The clinical performance is evaluated by the detection of micro-calcifications inserted in reconstructed Digital Breast Tomosynthesis slices. The slices are shown in dynamic cine loops, at two frames rates. The statistical analysis chosen for this study is the Receiver Operating Characteristic Multiple-Reader, Multiple-Case methodology, in order to measure the clinical performance of the two displays. Four experienced radiologists are involved in this study. For this clinical study, 50 normal and 50 abnormal independent datasets were used. The result is that the new display outperforms the mammography display for a signal detection task using real DBT images viewed at 25 and 50 slices per second. In the case of 50 slices per second, the p-value = 0.0664. For a cut-off where alpha=0.05, the conclusion is that the null hypothesis cannot be rejected, however the trend is that the new display performs 6% better than the old display in terms of AUC. At 25 slices per second, the difference between the two displays is very apparent. The new display outperforms the mammography display by 10% in terms of AUC, with a good statistical significance of p=0.0415.
Is image manipulation necessary to interpret digital mammographic images efficiently?
Yan Chen, Alastair Gale, Anne Turnbull, et al.
With the introduction of digital breast screening across the UK, screeners need to learn how best to inspect these images. A key advantage over mammographic film is the facility to use workstation image manipulation tools. Forty two-view FFDM screening cases, representing malignant, normal and benign appearances were examined by fourteen radiologists and advanced practitioners from two UK screening centres. For half the cases, the mammography workstation image manipulation tools could be employed and for the other half these were not used. Participants classified each case and indicated whether an abnormality was present. Throughout the study the participants' visual search behaviour as well as their image manipulations was recorded. Whether or not image manipulation tools were used made very little difference to overall performance (t-test, p>.05) as confirmed by JAFROC analysis Figure-Of-Merit values of 0.816 and 0.838 (with and without tools respectively); performance not using tools was better. However, using tools significantly increased inspection time (p<0.5) as well as participants' confidence. Detailed examination of participants' image inspection behaviour elicited that the average time on each case in the different viewing conditions differed significantly between the high experienced readers and low experienced readers. The visual data analysis revealed that the participants made similar overall pattern of errors on both modalities. The visual search behaviour on both modalities are surprisingly similar.
Performance evaluation of medical LCD displays using 3D channelized Hotelling observers
High performance of the radiologists in the task of image lesion detection is crucial for successful medical practice. One relevant factor in clinical image reading is the quality of the medical display. With the current trends of stack-mode liquid crystal displays (LCDs), the slow temporal response of the display plays a significant role in image quality assurance. In this paper, we report on the experimental study performed to evaluate the quality of a novel LCD with advanced temporal response compensation, and compare it to an existing state-of-the-art display of the same category but with no temporal response compensation. The data in the study comprise clinical digital tomosynthesis images of the breast with added simulated mass lesions. The detectability for the two displays is estimated using the recent multi-slice channelized Hotelling observer (msCHO) model which is especially designed for multi-slice image data. Our results suggest that the novel LCD allows higher detectability than the existing one. Moreover, the msCHO results are used to advise on the parameters for the follow up image reading study with real medical doctors as observers. Finally, the main findings of the msCHO study were confirmed by a human reader study (details to be published in a separate paper).
Visual cues do not improve skin lesion ABC(D) grading
Matteo Zanotto, Lucia Ballerini, Ben Aldridge, et al.
In this work evidence is presented supporting the hypothesis that observers tend to evaluate very differently the same properties of given skin-lesion images. Results from previous experiments have been compared to new ones obtained where we gave additional prototypical visual cues to the users during their evaluation trials. Each property (colour, colour uniformity, asymmetry, border regularity, roughness of texture) had to be evaluated on a 0-10 range, with both linguistic descriptors and visual references at each end and in the middle (e.g. light/medium/dark for colour). A set of 22 images covering different clinical diagnoses has been used in the comparison with previous results. Statistical testing showed that only for a few test images the inclusion of the visual anchors reduced the variability of the grading for some of the properties. Despite such reduction, though, the average variance of each property still remains high even after the inclusion of the visual anchors. When considering each property, the average variance significantly changed for the roughness of texture, where the visual references caused an increase in the variability. With these results we can conclude that the variance of the answers observed in the previous experiments was not due to the lack of a standard definition of the extrema of the scale, but rather to a high variability in the way observers perceive and understand skin-lesion images.
The effect of defect cluster size and interpolation on radiographic image quality
For digital X-ray detectors, the need to control factory yield and cost invariably leads to the presence of some defective pixels. Recently, a standard procedure was developed to identify such pixels for industrial applications. However, no quality standards exist in medical or industrial imaging regarding the maximum allowable number and size of detector defects. While the answer may be application specific, the minimum requirement for any defect specification is that the diagnostic quality of the images be maintained. A more stringent criterion is to keep any changes in the images due to defects below the visual threshold. Two highly sensitive image simulation and evaluation methods were employed to specify the fraction of allowable defects as a function of defect cluster size in general radiography. First, the most critical situation of the defect being located in the center of the disease feature was explored using image simulation tools and a previously verified human observer model, incorporating a channelized Hotelling observer. Detectability index d' was obtained as a function of defect cluster size for three different disease features on clinical lung and extremity backgrounds. Second, four concentrations of defects of four different sizes were added to clinical images with subtle disease features and then interpolated. Twenty observers evaluated the images against the original on a single display using a 2-AFC method, which was highly sensitive to small changes in image detail. Based on a 50% just-noticeable difference, the fraction of allowed defects was specified vs. cluster size.
Verification of the QUBYX perfectlum calibration software using a PR-670 spectro radiometer and associated verification facility
Hans Roehrig, Syed F. Hashmi
At the University of Arizona a research project is underway which addresses consistent color and consistent gray-scale reproduction for digital color displays used in medical image interpretation, specifically for Pathology. Now the University of Arizona can enter the field of ICC Profiling and Color Management. Verification of PerfectLum Software was successful. FIT and LUM tests were performed to verify the conformance and the deviation was quantified. The maximum GSDF Error is about 5.968 %. With respect to the results, all three objectives were met and the PerfectLum calibrated display confirmed to the AAPM TG18 standards.
Vision in Medical Imaging
icon_mobile_dropdown
A study of attentional effects of intensity transforms for mammograms
This paper presents a study of the attentional effects of two types of intensity distribution variations upon observer behaviour when viewing mammograms: equalisation (to a uniform image intensity histogram) and normalisation (to match an industry best practice image intensity histogram). For untrained observers, some consistent attraction of attention towards the strongest intensity regions of the images for the more highly contrasting equalised images as compared with the unprocessed images was detected. For the normalised images, this effect was even more marked. For a trained observer, no substantial disruption of attentional patterns during viewing was detected for equalised images, but was for normalised images. The nature and extent of the changes in the attentional behaviour for both untrained and trained observers indicates potential value in further studies and emphasizes the need to conduct clinically related studies with trained observers.
The impact of clinical indications on visual search behaviour in skeletal radiographs
A. Rutledge, M. F. McEntee, L. Rainford, et al.
The hazards associated with ionizing radiation have been documented in the literature and therefore justifying the need for X-ray examinations has come to the forefront of the radiation safety debate in recent years1. International legislation states that the referrer is responsible for the provision of sufficient clinical information to enable the justification of the medical exposure. Clinical indications are a set of systematically developed statements to assist in accurate diagnosis and appropriate patient management2. In this study, the impact of clinical indications upon fracture detection for musculoskeletal radiographs is analyzed. A group of radiographers (n=6) interpreted musculoskeletal radiology cases (n=33) with and without clinical indications. Radiographic images were selected to represent common trauma presentations of extremities and pelvis. Detection of the fracture was measured using ROC methodology. An eyetracking device was employed to record radiographers search behavior by analysing distinct fixation points and search patterns, resulting in a greater level of insight and understanding into the influence of clinical indications on observers' interpretation of radiographs. The influence of clinical information on fracture detection and search patterns was assessed. Findings of this study demonstrate that the inclusion of clinical indications result in impressionable search behavior. Differences in eye tracking parameters were also noted. This study also attempts to uncover fundamental observer search strategies and behavior with and without clinical indications, thus providing a greater understanding and insight into the image interpretation process. Results of this study suggest that availability of adequate clinical data should be emphasized for interpreting trauma radiographs.
Measurement of breast lesion display luminance and overall image display luminance relative to optimum luminance for contrast perception
Mohammad Rawashdeh, Warwick Lee, Patrick Brennan, et al.
Introduction: To minimize fatigue due to eye adaptation and maximize contrast perception, it has been suggested that lesion luminance be matched to overall image luminance to perceive the greatest number of grey level differences. This work examines whether lesion display luminance matches the overall image and breast tissue display luminance and whether these factors are positioned within the optimum luminance for maximal contrast sensitivity. Methods: A set of 42 mammograms, collected from 21 patients and containing 15 malignant and 6 benign lesions, was used to assess overall image luminance. Each image displayed on the monitor was divided into 16 equal regions. The luminance at the midpoint of each region was measured using a calibrated photometer and the overall image luminance was calculated. Average breast tissue display luminance was calculated from the subset of regions containing of only breast tissue. Lesion display luminance was compared with both overall image display luminance and average breast tissue display luminance. Results: Statistically significant differences (p<0.0001) were noted between overall image display luminance (4.3±0.7 cd/m2) and lesion display luminance (15.0±6.8 cd/m2); and between average breast tissue display luminance (6.8±1.3 cd/m2) and lesion display luminance (p<0.002). Conclusions: Lesion luminance was significantly higher than the overall image and breast tissue luminance. Luminance of lesions and general breast tissue fell below the optimum luminance range for contrast perception. Breast lesion detection sensitivity and specificity may be enhanced by use of brighter monitor displays.
Motion perception in medical imaging
Francesc Massanes, Jovan G. Brankov
A potential drawback of image noise suppression in medical image sequence processing is a possible loss of the apparent motion: making objects appears to move slower or less then they move in reality. For medical imaging application this can be of critical importance, for example myocardium motion in cardiac gated single photon emission computed tomography (SPECT) imaging can differentiate viable muscle from scar tissue. Therefore, in this work we design a set of experiments to measure how human observers perceive apparent motion in the presence of image degradations like noise and blur. In addition we will try to identify relevant image features, based on a visual attention model and a block matching motion estimation method that would allow development of an accurate numerical observer capable of predicting human observer motion perception.
Characterizing non-Gaussian properties of breast images with a noisy-Laplacian distribution
Craig K. Abbey, Anita Nosratieh, Sheng Zhang, et al.
It is generally well known that the appearance of breast tissue in a mammogram is considerably more complex in a statistical sense than a simple random Gaussian texture, even when the correlation structure of the Gaussian has been set to match the power-law power spectrum of mammograms. However there has not been a systematic way to characterize the extent of departure from a Gaussian process. We address this topic here by proposing a noisy-Laplacian distribution to model response histograms derived from digital (or digitized) mammograms. We describe the distribution in terms of the probability density function and cumulative density function, as well as moments up to fourth order. We also demonstrate the usefulness of the new distribution by fitting it to responses from digital mammography.
Technology Assessment and Impact
icon_mobile_dropdown
Improved implementation of the abnormality manipulation software tools
Collecting clinical cases for medical imaging perception studies is often challenging. We have developed a suite of software tools for manipulating medical tomographic image sets that overcome these difficulties. In our initial development, abnormalities were removed or inserted on a slice-by-slice basis. To circumvent the problem with potential artifacts in orthogonal views, we have redesigned the tools so that they operate in 3 dimensions. An operator controlled ellipsoid mask region is used to select the removal and the replacement areas. This new approach has been validated on PET data sets and has also been implemented for CT studies.
A clinical image preference study comparing digital tomosynthesis with digital radiography for pediatric spinal imaging
Jenna M. King, Idris A. Elbakri, Martin Reed, et al.
The purpose of this study was to evaluate the diagnostic quality of digital tomosynthesis (DT) images for pediatric imaging of the spine. We performed a phantom image rating study to assess the visibility of anatomical spinal structures in DT images relative to digital radiography (DR) and computed tomography (CT). We collected DT and DR images of the cervical, thoracic and lumbar spine using anthropomorphic phantoms. Four pediatric radiologists and two residents rated the visibility of structures on the DT image sets compared to DR using a four point scale (0 = not visible; 1 = visible; 2 = superior to DR; 3 = excellent, CT unnecessary). In general, the structures in the spine received ratings between 1 and 3 (cervical), or 2 and 3 (thoracic, lumbar), with a few mixed scores for structures that are usually difficult to see on diagnostic images, such as vertebrae near the cervical-thoracic joint and the apophyseal joints of the lumbar spine. The DT image sets allow most critical structures to be visualized as well or better than DR. When DR imaging is inconclusive, DT is a valuable tool to consider before sending a pediatric patient for a higher-dose CT exam.
Computer-aided detection as a decision assistant in chest radiography
Maurice R. M. Samulski, Peter R. Snoeren, Bram Platel, et al.
Background. Contrary to what may be expected, finding abnormalities in complex images like pulmonary nodules in chest radiographs is not dominated by time-consuming search strategies but by an almost immediate global interpretation. This was already known in the nineteen-seventies from experiments with briefly flashed chest radiographs. Later on, experiments with eye-trackers showed that abnormalities attracted the attention quite fast but often without further reader actions. Prolonging one's search seldom leads to newly found abnormalities and may even increase the chance of errors. The problem of reading chest radiographs is therefore not dominated by finding the abnormalities, but by interpreting them. Hypothesis. This suggests that readers could benefit from computer-aided detection (CAD) systems not so much by their ability to prompt potential abnormalities, but more from their ability to 'interpret' the potential abnormalities. In this paper, this hypothesis was investigated by an observer experiment. Experiment. In one condition, the traditional CAD condition, the most suspicious CAD locations were shown to the subjects, without telling them the levels of suspiciousness according to CAD. In the other condition, interactive CAD condition, levels of suspiciousness were given, but only when readers requested them at specified locations. These two conditions focus on decreasing search errors and decision errors, respectively. Results of reading without CAD were also recorded. Six subjects, all non-radiologists, read 223 chest radiographs in both conditions. CAD results were obtained from the OnGuard 5.0 system developed by Riverain Medical (Miamisburg, Ohio). Results. The observer data were analyzed by Location Response Operating Characteristic analysis (LROC). It was found that: 1) With the aid of CAD, the performance is significantly better than without CAD; 2) The performance with interactive CAD is significantly better than with traditional CAD at low false positive rates.
Does stereo-endoscopy improve neurosurgical targeting in 3rd ventriculostomy?
Kamyar Abhari, Sandrine de Ribaupierre, Terry Peters, et al.
Endoscopic third ventriculostomy is a minimally invasive surgical technique to treat hydrocephalus; a condition where patients suffer from excessive amounts of cerebrospinal fluid (CSF) in the ventricular system of their brain. This technique involves using a monocular endoscope to locate the third ventricle, where a hole can be made to drain excessive fluid. Since a monocular endoscope provides only a 2D view, it is difficult to make this perforation due to the lack of monocular cues and depth perception. In a previous study, we had investigated the use of a stereo-endoscope to allow neurosurgeons to locate and avoid hazardous areas on the surface of the third ventricle. In this paper, we extend our previous study by developing a new methodology to evaluate the targeting performance in piercing the hole in the membrane. We consider the accuracy of this surgical task and derive an index of performance for a task which does not have a well-defined position or width of target. Our performance metric is sensitive and can distinguish between experts and novices. We make use of this metric to demonstrate an objective learning curve on this task for each subject.
An analysis of the impact of tumor amount on the predictive power of a prostate biopsy prognostic assay
Faisal M. Khan, Stephen I. Fogarasi, Douglas Powell, et al.
The Prostate Px prognostic assay offered by Aureon Biosciences is designed to predict progression post primary treatment for prostate cancer patients based on their diagnostic biopsy specimen. The assay is driven by the automated image analysis of a diagnostic prostate needle biopsy (PNB) and incorporates pathologist acquired and digitally masked images which reflect the morphometric (Hematoxylin and Eosin, H&E) and protein expression (immunofluorescence, IF) properties of the PNB. Up to 9 images (3 H&E and 6 IF) from each of 1027 patients, with varying amounts of tumor content were included in the study. We wanted to understand what was the minimal tumor volume required to maintain assay predictive robustness as a result of overall PNB tumor content and assess the impact of pathologist tumor masking variability. 232 patients were selected who had a minimum of 80% tumor volume in a 20x magnification image. In each of the three imaging domains (2 different multiplex (Mplex) IF images and one H&E), the tumor volume was artificially reduced in increments from 80% to 2.5% of the original image area. This simulated decreasing amounts of tumor as well as variations in digital tumor masking. The univariate predictive power of individual imaging domains remained robust down to the 10% tumor level, whereas the total assay was robust through the 20% to 10% tumor level. This work presents one of the first assessments of the variety in tumor amounts on the predictive power of a commercially available prognostic assay that is reliant on multiple bioimaging domains.
Poster Session
icon_mobile_dropdown
Assessment of a CAD scheme in selecting the optimal focused microscopic scanning images of the metaphase chromosomes
Xingwei Wang, Jun Tan, Yuchen Qiu, et al.
Visually searching for analyzable metaphase chromosome cells under microscopes is a routine and timeconsuming task in genetic laboratories to diagnose cancer and genetic disorders. To improve detection efficiency, consistency, and accuracy, we developed an automated microscopic image scanning system using a 100X oil immersion objective lens to acquire images that has sufficient spatial resolution allowing clinicians to do diagnosis. Due to the highresolution, the field of image depth is very limited and multiple scans up to seven layers are required. Thus, a metaphase cell can spread over multiple images at different focal levels. Among them only one or two are adequate for the diagnosis and the others are typically fuzzy images. In this study, we developed and tested a computer-aided detection (CAD) scheme to automatically select one image with the sharpest image quality and discard all of the other fuzzy images based on the computed sharpness index. From three scanned bone marrow specimen slides, the on-line and offline metaphase finding modules automatically selected 100 chromosome cells with 534 images. These images were selected to build a testing dataset. For each cell, the CAD scheme selects one image with the maximum sharpness index. Three observers also independently visually selected one best image for diagnosis from each cell. The agreement rate between CAD and visually selected images ranges from 89% to 96%, which is also very comparable to the agreement rate between the two observers. This experiment demonstrated the feasibility of applying a CAD scheme to select the images with sharpest high-resolution metaphase chromosome cell and potentially improve diagnostic efficiency and accuracy in the future clinical practice.
Quantitative evaluation of six graph based semi-automatic liver tumor segmentation techniques using multiple sets of reference segmentation
Zihua Su, Xiang Deng, Christophe Chefd'hotel, et al.
Graph based semi-automatic tumor segmentation techniques have demonstrated great potential in efficiently measuring tumor size from CT images. Comprehensive and quantitative validation is essential to ensure the efficacy of graph based tumor segmentation techniques in clinical applications. In this paper, we present a quantitative validation study of six graph based 3D semi-automatic tumor segmentation techniques using multiple sets of expert segmentation. The six segmentation techniques are Random Walk (RW), Watershed based Random Walk (WRW), LazySnapping (LS), GraphCut (GHC), GrabCut (GBC), and GrowCut (GWC) algorithms. The validation was conducted using clinical CT data of 29 liver tumors and four sets of expert segmentation. The performance of the six algorithms was evaluated using accuracy and reproducibility. The accuracy was quantified using Normalized Probabilistic Rand Index (NPRI), which takes into account of the variation of multiple expert segmentations. The reproducibility was evaluated by the change of the NPRI from 10 different sets of user initializations. Our results from the accuracy test demonstrated that RW (0.63) showed the highest NPRI value, compared to WRW (0.61), GWC (0.60), GHC (0.58), LS (0.57), GBC (0.27). The results from the reproducibility test indicated that GBC is more sensitive to user initialization than the other five algorithms. Compared to previous tumor segmentation validation studies using one set of reference segmentation, our evaluation methods use multiple sets of expert segmentation to address the inter or intra rater variability issue in ground truth annotation, and provide quantitative assessment for comparing different segmentation algorithms.
Assessing risk of thyroid cancer using resonance-frequency based electrical impedance measurements
Bin Zheng, Mitchell E. Tublin, Dror Lederman, et al.
The incidence of thyroid cancer has risen faster than many malignancies and has nearly doubled in the USA over the past 30 years. Palpable nodules and subclinical nodules detected by imaging are found in a large percentage of the USA population. Most of these (.>95%) are fortunately benign. This vast reservoir of nodules makes the detection and diagnosis of thyroid cancer a diagnostic dilemma. Ultrasound guided Fine Needle Aspiration Biopsy (FNAB) is excellent for triaging patients but up to 25% of FNABs are inconclusive. As a result, definitive diagnosis is often only possible with a diagnostic lobectomy; many thousands of these are performed in the USA annually for ultimately benign disease. It would be extremely beneficial if we could develop a non-invasive procedure that could assist the diagnostician in reliably predicting the likelihood of malignancy of otherwise indeterminate thyroid nodules, thereby reducing the number of these "exploratory/diagnostic" lobectomies performed under general anesthesia. Electrical Impedance Spectroscopy (EIS) was considered as a possible approach to address this problem. However, the diagnostic accuracy of EIS is too low for routine clinical use to date. In our group, we developed a substantially modified technology termed Resonance-frequency Electrical Impedance Spectroscopy (REIS), which yields usable information for classifying risk of having breast abnormalities. We preliminarily applied REIS to measure signals on participants having thyroid nodules aiming to assess whether we can assist in improving diagnosis of indeterminate thyroid nodules. In this study we present a new multi-probe based REIS device specifically designed for the assessment of indeterminate thyroid nodules. Our preliminary assessment presented here demonstrates the feasibility of using this proposed REIS device in a busy tertiary care center.
Evaluation of agreement in corneal thickness measurements obtained using optical coherence tomography and ultrasound technique and determination of its specificity in keratoconus screening
P. Gunvant, R. Darner
The aims of the present study are 1) to evaluate inter and intra observer repeatability of optical coherence tomography corneal thickness measurements 2) to investigate the agreement in corneal thickness obtained using an ultrasound pachymeter and the non-contact high resolution optical coherence tomography 3) to evaluate the false positive rate of identifying keratoconic suspects on the basis of standard machine protocol. Measurements were performed on 51 eyes of 51 individuals without any known corneal pathology. Altman and Bland plots were analyzed to determine agreement of corneal thickness measurements obtained using optical coherence tomography and ultrasound pachymeter; linear regression analysis was performed to evaluate its interchangeability. The agreement between the optical coherence tomography and ultrasonic pachymeter measurements was best for the central corneal thickness with a mean bias of 13.4 microns, with optical coherence tomography values being lower than the ultrasound pachymeter. The agreement of measurements in the mid-peripheral cornea was poor, with bias in measurements ranging from 33 to 55 microns. The optical coherence tomography measurements were repeatable with no differences in values between intra and inter observer repeat measurements. Using standard machine protocol for keratoconus screening, utilizing 1 out of 4 criteria gave a specificity of 86% and using 2 of the 4 criteria gave a specificity of 98%.
Fusion of classifiers for REIS-based detection of suspicious breast lesions
Dror Lederman, Xingwei Wang, Bin Zheng, et al.
After developing a multi-probe resonance-frequency electrical impedance spectroscopy (REIS) system aimed at detecting women with breast abnormalities that may indicate a developing breast cancer, we have been conducting a prospective clinical study to explore the feasibility of applying this REIS system to classify younger women (< 50 years old) into two groups of "higher-than-average risk" and "average risk" of having or developing breast cancer. The system comprises one central probe placed in contact with the nipple, and six additional probes uniformly distributed along an outside circle to be placed in contact with six points on the outer breast skin surface. In this preliminary study, we selected an initial set of 174 examinations on participants that have completed REIS examinations and have clinical status verification. Among these, 66 examinations were recommended for biopsy due to findings of a highly suspicious breast lesion ("positives"), and 108 were determined as negative during imaging based procedures ("negatives"). A set of REIS-based features, extracted using a mirror-matched approach, was computed and fed into five machine learning classifiers. A genetic algorithm was used to select an optimal subset of features for each of the five classifiers. Three fusion rules, namely sum rule, weighted sum rule and weighted median rule, were used to combine the results of the classifiers. Performance evaluation was performed using a leave-one-case-out cross-validation method. The results indicated that REIS may provide a new technology to identify younger women with higher than average risk of having or developing breast cancer. Furthermore, it was shown that fusion rule, such as a weighted median fusion rule and a weighted sum fusion rule may improve performance as compared with the highest performing single classifier.
A software tool to compare contrast-detail detection in uniform and in real mammographic backgrounds
A software tool is presented to merge CDMAM phantom images with real mammographic backgrounds. It allows SKE tasks in uniform and in real backgrounds. This kind of tasks can be used to compare human, human visual metric or model observer performance in detail detection using uniform or mammographic backgrounds. As it is very well known, local characteristics of the structures in real mammographic backgrounds reduce the human performance in contrast-detail detection tasks. In consequence that performance cannot be inferred from the data acquired in white noise (flat) backgrounds such as a CDMAM phantom produces. It is of interest to compare the response of a mammography system to the same set of signals, either embedded in flat or in real backgrounds. This comparison achieves two goals. The first one is to analyze the variation of the recognition threshold of the system for both backgrounds. The second one is to analyze the performance of a human observer or a model observer over the same set of signals, varying the nature of the backgrounds. The software tool presented here uses CDMAM images to merge with a region of interest selected from a real mammography. This region as well as the mixing image method (basically adding or multiplying pixels) can be freely selected by the user. In this work a set of measurements of 8 images has been analyzed. We can preview the variation of the contrast-detail detection for a human observer and a human visual system metric (R*).
Comparison of the detection rates in reduced image by difference of interpolation method
A. Horii, C. Kataoka, D. Yokoyama, et al.
In the soft copy diagnosis, each pixel of the detector is displayed to the correspondent pixel of liquid crystal display (LCD). But when the image is displayed at the first time, the entire image may be reduced. We examined the influence that the difference of image reduction rate on LCD exerts on detection performance by using observer performance experiment. Moreover, to find the best interpolation method, we investigated the several interpolation methods. We made a simulation image which is similar to Burger phantom. This image consists of 288 signals, each of a different size and contrast. The matrix size is the same as Phase Contrast Mammography (PCM). We gradated the simulation image by using an MTF of a geometric blur, and the image was added to the noise image which is uniformly exposed with PCM. Then the image was reduced by using the nearest-neighbor, the bilinear, and the bicubic methods. The reduction rates were calculated as the ratios of the number of pixels of LCDs to those of PCM. We displayed the reduced images on LCD and examined the detection performance. Results of physical evaluation examined before showed that sharpness and granularity have worsened both in proportion to the reduction rate. The detection performance deteriorated as the reduction rate becomes high. In the comparison of the interpolation methods, the detection performance of the nearestneighbor method was worse than those of other interpolation methods. The bilinear method is the most suitable for the reduction of the image.
Image processing of head CT images using neuro best contrast (NBC) and lesion detection performance
Sameer Tipnis, Diana Vincent, Zoran Rumboldt, et al.
Purpose: The purpose of this study was to objectively compare lesion detection performance of head CT images reconstructed using filtered back projection (FBP) algorithms with those reconstructed using NBC. Method: The observer study was conducted using the 2-AFC methodology. An AFC experiment consists of 128 observer choices and permits the computation of the intensity needed to achieve 92% correct (I92%). High values of I92% corresponds to a poor level of detection performance, and vice versa. Head CT images were acquired at an x-ray tube voltage of 120 kVp with a CTDIvol value of 75 mGy in a helical scan. Nine randomly selected normal images from three patients and at three anatomical head locations were reconstructed using filtered back projection (FBP) and neuro-best-contrast (NBC) processing. Circular lesions were generated by projecting spheres onto the image plane, followed by blurring function, with lesion sizes of 2.8 mm, 6.5 mm and 9.8 mm used in these experiments. Four readers were used, with 18 experiments performed by each observer (2 processing techniques × 3 lesion sizes × 3 repeats). The experimental order of the 18 experiments was randomized to eliminate learning curve and/or observer fatigue. The ratio R of the I92% value for NBC to the corresponding I92% value for FBP was calculated for each observer and each lesion size. Values of R greater than unity indicate that NBC is inferior to FBP, and vice versa. Results: Analysis of data from each observer showed that a total of four data points had R less than unity, and eight data points were greater than unity. Eleven of the twelve individual observer R values with one standard deviation of unity. When data for the four observers were pooled, the resultant average R values were 0.98 ± 0.38, 0.96 ± 0.33 and 1.15 ± 0.45, for the 2.8 mm, 6.5 mm and 9.8 mm lesions respectively. The overall average R for all three lesions sizes was 1.03 ± 0.67. Conclusion: Our AFC investigation has shown no evidence that use of Neuro Best Contrast to process head CT images improves detection of circular, low contrast lesions less than 10 mm.
The effects of anatomical information and observer expertise on abnormality detection task
L. Zhang, C. Cavaro-Ménard, P. Le Callet, et al.
This paper presents a novel study investigating the influences of Magnetic Resonance (MR) image anatomical information and observer expertise on an abnormality detection task. MRI is exquisitely sensitive for detecting brain abnormalities, particularly in the evaluation of white matter diseases, e.g. multiple sclerosis (MS). For this reason, MS lesions are simulated as the target stimuli for detection in the present study. Two different image backgrounds are used in the following experiments: a) homogeneous region of white matter tissue, and b) one slice of a healthy brain MR image. One expert radiologist (more than 10 years' experience), three radiologists (less than 5 years' experience) and eight naïve observers (without any prior medical knowledge) have performed these experiments, during which they have been asked different questions dependent upon level of experience; the three radiologists and eight naïve observers were asked if they were aware of any hyper-signal, likely to represent an MS lesion, while the most experienced consultant was asked if a clinically significant sign was present. With the percentages of response "yes" displayed on the y-axis and the lesion intensity contrasts on the x-axis, psychometric function is generated from the observer' responses. Results of psychometric functions and calculated thresholds indicate that radiologists have better hyper-signal detection ability than naïve observers, which is intuitively shown by the lower simple visibility thresholds of radiologists. However, when radiologists perform a task with clinical implications, e.g. to detect a clinically significant sign, their detection thresholds are elevated. Moreover, the study indicates that for the radiologists, the simple visibility thresholds remain the same with and without the anatomical information, which reduces the threshold for the clinically significant sign detection task. Findings provide further insight into human visual system processing for this specific task, and this study provides the foundation for a series of studies investigating numerical observer modeling to be designed, with the ultimate aim of investigating the medical image quality assessment approach by addressing the perspective of radiologist diagnostic performance.
Application of artificial neural network in simulating subjective evaluation of tumor segmentation
Dongjiao Lv, Xiang Deng
Systematic validation of tumor segmentation technique is very important in ensuring the accuracy and reproducibility of tumor segmentation algorithm in clinical applications. In this paper, we present a new method for evaluating 3D tumor segmentation using Artificial Neural Network (ANN) and combined objective metrics. In our evaluation method, a three-layer feed-forwarding backpropagation ANN is first trained to simulate radiologist's subjective rating using a set of objective metrics. The trained neural network is then used to evaluate the tumor segmentation on a five-point scale in a way similar to expert's evaluation. The accuracy of segmentation evaluation is quantified using average correct rank and frequency of the reference rating in the top ranks of simulated score list. Experimental results from 93 lesions showed that our evaluation method performs better than individual metrics. The optimal combination of metrics from normalized volume difference, volume overlap, Root Mean Square symmetric surface distance and maximum symmetric surface distance showed the smallest average correct rank (1.43) and highest frequency of the reference rating in the top two places of simulated rating list (93.55%). Our results also demonstrate that the ANN based non-linear combination method showed better evaluation accuracy than linear combination method in all performance measures. Our evaluation technique has the potential to facilitate large scale segmentation validation study by predicting radiologists rating, and to assist development of new tumor segmentation algorithms. It can also be extended to validation of segmentation algorithms for other applications.
Optimization of hepatic lesion detection with computed tomography (CT): Is randomization of lesion location necessary?
K. Dobeli, S. Lewis, S. Meikle, et al.
Purpose: The purpose of this study was to compare observer performance for the detection of randomly-positioned lesions to that of location-known lesions to determine if randomization of lesion placement is necessary for optimization of hepatic lesion detection with CT. A phantom containing fixed lesions (diameter 2.4mm, 4.8mm and 9.5mm) was scanned at various exposure and slice thickness settings. A second image set was created by electronically cutting lesions from the phantom images and pasting them into background-only images. Nine observers, blinded to lesion location in the second image set, reviewed all images under standardized viewing conditions. Visualization of lesions was scored using a four-point scale. Observer scores for the two methods were correlated for all lesions, and for each lesion size using Spearman's rank correlation coefficient (r). There was very high correlation between the observer scores for all lesions (r=0.919, p<0.0001) and for the 9.5mm lesion (r=0.963, p<0.0001). There was moderate correlation for the 4.8mm and 2.4mm lesions (r=0.509, p=0.084, r=0.640, p=0.028). Discussion: When considering all lesions, or the 9.5mm lesion independently, randomization did not alter observer scores, suggesting random location of large lesions is unnecessary for dose optimization. For the smaller lesion sizes correlation between the two methods is less robust. Conclusion: If lesion size is large or unimportant, dose optimization can be performed using a phantom with fixed lesions. For small lesions, randomized lesion location may be warranted, thus having implications for phantom design.
Impact of hybrid SPECT/CT imaging on the detection of single parathyroid adenoma
Antony Morrison, Patrick C. Brennan, Warren Reed, et al.
Objective: The aim of this investigation is to determine the impact of hybrid single photon emission computed tomography/computed tomography (SPECT/CT) on the detection of parathyroid adenoma. Materials and methods: 16 patients presented with suspected parathyroid adenoma localised within the neck. All patients were injected with Tc-99m sestamibi and were scanned with a GE Infinia Hawkeye SPECT/CT. There were six negative and ten positive confirmed cases. Five expert radiologists specializing in nuclear medicine were asked to report on the 16 planar and SPECT data sets and were then asked to report on the same randomly ordered data sets with the addition of CT. Receiver operating characteristic (ROC) analysis was performed using the Dorfman-Berbaum-Metz multireadermulticase methodology and sensitivity and specificity values were generated. A significance level of p ≤ 0.05 was set for all comparisons. Results: ROC analysis demonstrated an AUC of 0.64 and 0.69 for SPECT and SPECT/CT respectively (p = 0.31). Mean sensitivity scores increased from 0.64 to 0.80 (p = 0.17) and specificity scores decreased from 0.57 to 0.40 (p = 0.17) with the addition of the CT data. Conclusion: This preliminary investigation suggests that extra CT information may increase lesion detection as well as false positive rates for SPECT-based investigations of a single parathyroid adenoma. However the difference in diagnostic efficacy between the two groups was not found to be statistically significant therefore requiring further investigation. These findings have implications beyond the clinical situation described here.
Role of expertise and contralateral symmetry in the diagnosis of pneumoconiosis: an experimental study
Varun Jampani, Vivek Vaidya, Jayanthi Sivaswamy, et al.
Pneumoconiosis, a lung disease caused by the inhalation of dust, is mainly diagnosed using chest radiographs. The effects of using contralateral symmetric (CS) information present in chest radiographs in the diagnosis of pneumoconiosis are studied using an eye tracking experimental study. The role of expertise and the influence of CS information on the performance of readers with different expertise level are also of interest. Experimental subjects ranging from novices & medical students to staff radiologists were presented with 17 double and 16 single lung images, and were asked to give profusion ratings for each lung zone. Eye movements and the time for their diagnosis were also recorded. Kruskal-Wallis test (χ2(6) = 13.38, p = .038), showed that the observer error (average sum of absolute differences) in double lung images differed significantly across the different expertise categories when considering all the participants. Wilcoxon-signed rank test indicated that the observer error was significantly higher for single-lung images (Z = 3.13, p < .001) than for the double-lung images for all the participants. Mann-Whitney test (U = 28, p = .038) showed that the differential error between single and double lung images is significantly higher in doctors [staff & residents] than in non-doctors [others]. Thus, Expertise & CS information plays a significant role in the diagnosis of pneumoconiosis. CS information helps in diagnosing pneumoconiosis by reducing the general tendency of giving less profusion ratings. Training and experience appear to play important roles in learning to use the CS information present in the chest radiographs.
Analysis of the number of distinct findings obtained by multiple readers in an MRMC study: When do findings obtained from the addition of new readers become redundant, or otherwise negligible?
Sophie Paquerault, Berkman Sahiner, Anna Kettermann, et al.
The ultimate goal of this project is to investigate whether the effect of a computer-aided detection (CAD) system on readers' performance (especially, in situation of an upgrade of the CAD system, or between two different CAD systems with similar design) can be accurately predicted without having to perform a multi-reader multi-case (MRMC) observer study and, if such prediction is possible, to establish the underlying methodology. Our current study is intended to provide evidence that would substantiate efforts toward such investigation. The objectives of this study were 1) to investigate the relationship between the number of radiologists reading a dataset of thoracic computed tomography (CT) images to identify lung nodules and the number of distinct findings and 2) to determine the number of readers needed to identify almost all clinically distinct findings in a dataset. We used data from a multi-reader multi-case (MRMC) observer study that consisted of six radiologists interpreting 85 thoracic CT examinations. To further illustrate our approach, we also utilized simulated data consisting of twelve readers interpreting 198 samples equally distributed between three levels of detection difficulty. For each possible reader grouping, the number of distinct findings identified by the readers in the group was calculated. Five types of regression models used to describe the relationship between the average number of distinct findings per case and the number of readers needed were compared. The result showed that the logistic model best fitted both the thoracic CT data and the simulated data. Our assumption is that adding more readers after a certain reader set size would mostly add redundant findings and, therefore, the benefit would be negligible. Using this model, the predicted number of readers was found to depend on the type of findings considered. Our study showed that the number of clinically distinct findings that can be identified by radiologists on CT lung examinations without the use of a CAD system may be limited and that identifying almost all of these findings may only require a limited number of readers.
Reproducibility of an imaging based prostate cancer prognostic assay
Faisal M. Khan, Douglas Powell, Valentina Bayer-Zubek, et al.
The Prostate Px prognostic assay offered by Aureon Biosciences is designed to predict progression post primary treatment for prostate cancer patients based on their diagnostic biopsy specimen. The assay is driven by the automated image analysis of biological specimens. Three different histological sections are analyzed for morphometric as well as immunofluorescence protein expression properties within areas of tumor digitally masked by expert pathologists. The assay was developed on a multi-institution cohort of up to 9 images from each of 1027 patients. The variation in histological sections, staining, pathologist tumor masking and the region of image acquisition all have the potential to significantly impact imaging features and consequently the reproducibility of the assay's results for the same patient. This study analyzed the reproducibility of the assay in 50 patients who were re-processed within 3 months in a blinded fashion as de-novo patients. The key assay results reported were in agreement in 94% of the cases. The two independent endpoints of risk classification reproduced results in 90% and 92% of the predictions. This work presents one of the first assessments of the reproducibility of a commercial assay's results given the inherent variations in images and quantitative imaging characteristics in a commercial setting.
Assessment of updated CAD without a new reader study: effect of calibration of computer output on the computer-aided reader performance in CADx
It is very resource-demanding to assess each new version of a CAD system through a new reader study. We conjecture that the aided reader performance on a new version can be predicted by using certain characteristics of the computer output and the reader study conducted when the CAD system was initially introduced. This would likely reduce the need for additional reader studies. However, investigations are needed to develop a sound scientific foundation to test this conjecture. In this work, we consider a CADx system that outputs a disease score to aid the physician in making a diagnostic decision on a located lesion. Our major contribution is to show that calibration, reflected as a change in scale, is a characteristic of the computer output that needs to be considered in order to predict the aided reader performance in a new CADx version without a reader study. We used a bivariate bi-beta distribution to model the joint distribution of the decision variable underlying the reader without aid and the decision variable underlying the version 1 computer output in the initial version. We then applied a monotonic transformation to the computer output to simulate the computer output in a new version, i.e., the scores in the two versions differ only in calibration (specifically a change in scale). By further modeling certain mechanisms that the human reader may use for combining the computer output and the reader-alone scores, we computed the aided reader performance in terms of AUC for the new version of the CADx system. Our results show that the aided reader performance could depend on the degree of calibration difference between the two CAD system outputs. We conclude that for the purpose of predicting the aided reader performance of a new version of the CADx system, ROC performance (or any other rank-based metric) of the stand-alone CADx system may not be sufficient by itself.
Streak artefact quantification for abdominal CT
Michael Figl, Romana Fragner, Patrick Heimel, et al.
Streaking artefacts in computed tomography (CT) can be caused by photon starvation caused by highly attenuating regions. Patient positioning can influence the attenuation e.g. by arms raised or down in an abdominal CT scan. Positioning the arms alongside the body increases attenuation, therefore higher dose can be expected. Additionally the artefacts can cause a decrease in image quality. Measuring this quality decrease is the purpose of this article. We implemented different methods to quantise streaking artefacts and correlated them to the judgement of two radiologists in a study of 80 patients. High significance was found for a correlation coefficient of 0.57. This correlation from measurements and clinical usability (represented by the radiologists' ratings) enables to predict the usability by means of image processing alone. This can be included in the patient image as a correlate to the diagnostic usability, resp. a new volume can be made depending on the number.
High luminance monochrome vs. color displays: impact on performance and search
To determine if diagnostic accuracy and visual search efficiency with a high luminance medical-grade color display are equivalent to a high luminance medical-grade monochrome display. Six radiologists viewed DR chest images, half with a solitary pulmonary nodule and half without. Observers reported whether or not a nodule was present and their confidence in that decision. Total viewing time per image was recorded. On a subset of 15 cases eye-position was recorded. Confidence data were analyzed using MRMC ROC techniques. There was no statistically significant difference (F = 0.0136, p = 0.9078) between color (mean Az = 0.8981, se = 0.0065) and monochrome (mean Az = 0.8945, se = 0.0148) diagnostic performance. Total viewing time per image did not differ significantly (F = 0.392, p = 0.5315) as a function of color (mean = 27.36 sec, sd = 12.95) vs monochrome (mean = 28.04, sd = 14.36) display. There were no significant differences in decision dwell times (true and false, positive and negative) overall for color vs monochrome displays (F = 0.133, p = 0.7154). The true positive (TP) and false positive (FP) decisions were associated with the longest dwell times, the false negatives (FN) with slightly shorter dwell times, and the true negative decisions (TN) with the shortest (F = 50.552, p < 0.0001) and these trends were consistent for both color and monochrome displays. Current color medical-grade displays are suitable for primary diagnostic interpretation in clinical radiology.
Study of signal-to-noise ratios considered human visual characteristics
The effects of imaging parameters on detectability have not yet been clarified. Therefore, we investigated the usefulness of signal-to-noise ratios (SNRs) considered as human visual characteristics, such as the visual spatial frequency response and the internal noise in the eye-brain system. We examined the amplitude model (SNRa), matched filter model (SNRm), and internal noise model (SNRi) to study the relationship between these SNRs and the visual image quality for signal detection. The test images were simulated by the superimposition of low-contrast signals on a uniform noisy background. The SNRs were obtained for 15 imaging cases with various signal sizes, signal contrasts, exposure levels, and number of acrylic plates used as breast phantoms. The SNRs were calculated by measuring the spatial frequency characteristics of the signal, modulation transfer function (MTF) of the system, display MTF, and overall Wiener spectrum (WS). In the perceptual evaluation, we applied the 16-alternative forced choice (16-AFC) method. The signal detectability was defined as the number of detected signals divided by the total number of signals. We studied the relationship between SNR and signal detectability using Spearman's rank correlation coefficient. The correlation coefficient of SNRi was 0.93, making it the highest among the three SNR types. That of SNRm was 0.91; it correlated at the same level as SNRi although it is not considered human visual characteristics. That of SNRa was 0.45. SNRi, which incorporated the visual characteristics, explained the visual image quality well.
Radiation dose reduction in digital radiography using wavelet-based image processing methods
Haruyuki Watanabe, Du-Yih Tsai, Yongbum Lee, et al.
In this paper, we investigate the effect of the use of wavelet transform for image processing on radiation dose reduction in computed radiography (CR), by measuring various physical characteristics of the wavelet-transformed images. Moreover, we propose a wavelet-based method for offering a possibility to reduce radiation dose while maintaining a clinically acceptable image quality. The proposed method integrates the advantages of a previously proposed technique, i.e., sigmoid-type transfer curve for wavelet coefficient weighting adjustment technique, as well as a wavelet soft-thresholding technique. The former can improve contrast and spatial resolution of CR images, the latter is able to improve the performance of image noise. In the investigation of physical characteristics, modulation transfer function, noise power spectrum, and contrast-to-noise ratio of CR images processed by the proposed method and other different methods were measured and compared. Furthermore, visual evaluation was performed using Scheffe's pair comparison method. Experimental results showed that the proposed method could improve overall image quality as compared to other methods. Our visual evaluation showed that an approximately 40% reduction in exposure dose might be achieved in hip joint radiography by using the proposed method.