Show all abstracts
View Session
- Front Matter: Volume 7966
- Perception in Screening Exams
- Human Performance
- Model Observers
- ROC and Decision Metrics
- Keynote and Assessment in Pathology
- Image Display and Presentation
- Vision in Medical Imaging
- Technology Assessment and Impact
- Poster Session
Front Matter: Volume 7966
Front Matter: Volume 7966
Show abstract
This PDF file contains the front matter associated with SPIE Proceedings Volume 7966, including the Title Page, Copyright information, Table of Contents, Introduction, and the Conference Committee listing.
Perception in Screening Exams
Optimizing viewing procedures of breast tomosynthesis image volumes using eye tracking combined with a free response human observer study
Kristina Lång,
Sophia Zackrisson,
Kenneth Holmqvist,
et al.
Show abstract
The purpose of this study was to evaluate four different viewing procedures as part of improving viewing conditions of
breast tomosynthesis (BT) image volumes. The procedures consisted of free scroll volume browsing, and a combination
of initial cine loops at three different frame rates (9, 14 and 25 fps) terminated upon request followed by free scroll
volume browsing. Fifty-five normal BT image volumes in MLO view were collected. In these, simulated lesions (20
masses and 20 clusters of microcalcifications) were randomly inserted, creating four unique image sets for each
procedure. Four readers interpreted the cases in a random order. Their task was to locate a lesion, mark and assign a five
level confidence scale. The diagnostic accuracy was analyzed using Jackknife Free Receiver Operating Characteristics
(JAFROC). Time efficiency and visual search behavior were also investigated using eye tracking. The results indicate
that there was no statistically significant difference in JAFROC FOM between the different viewing procedures,
however the medium cine loop speed seemed to be the preferred viewing procedure in terms of total analyze time and
dwell time.
Assessment of breast density: reader performance using synthetic mammographic images
Show abstract
The quantity and appearance of dense breast tissue in mammograms is related to the risk of developing breast cancer, the
sensitivity of mammographic interpretation, and the likelihood of local recurrence of cancer following surgery. Visual
assessment of breast density is widely used, often with readers indicating the percentage of dense tissue in a
mammogram. Although real mammograms can be used to investigate intra- and inter-observer variability, ground truth
is difficult to ascertain, so to investigate reader accuracy, we created 60 synthetic, mammogram-like images with
densities comparable in area to those found in screening. The images contained either a single dense area, multiple or
linear densities, or a variable breast size with a single density. The images were randomized and assessed by 9 expert and
6 non-expert readers who marked percentage area of density on a visual analogue scale. Non-expert readers' estimates of
percentage area of density were closer to the truth (6-11% mean absolute difference) than the experts' estimates (10-
19%). The readers were most accurate when the density formed a single area in the image, and least accurate when the
dense area was composed of linear structures. In almost every case, the dense area was overestimated by the expert
readers. When experts were ranked according to the degree of overestimation, this broadly reflected their relative
performance on real mammograms.
Health professionals' agreement on density judgements and successful abnormality identification within the UK Breast Screening Programme
Show abstract
Higher breast density is associated with a greater chance of developing breast cancer. Additionally, it is well known that
higher mammographic breast density is associated with increased difficulty in accurately identifying breast cancer.
However, comparatively little is known of the reliability of breast density judgements. All UK breast screeners
(primarily radiologists and technologists) annually participate in the PERFORMS self-assessment scheme where they
make several judgements about series of challenging recent screening cases of known outcomes. As part of this process,
for each case, they provide a radiological assessment of the likelihood of cancer on a confidence scale, alongside an
assessment of case density using a three point scale. Analysis of the data from two years of the scheme found that the
degree of agreement on case density was significantly greater than no agreement (p < .001). However, only a moderate
degree of inter-rater reliability was exhibited (κ = .44) with significant differences between the occupational groups. The
reasons for differences between the occupational groups and the relationship between agreement on density rating and
case reading ability are explored.
The time course of cancer detection performance
Show abstract
The purpose of this study was to measure how mammography readers' performance varies with time of day and time
spent reading. This was investigated in screening practice and when reading an enriched case set. In screening practice
records of time and date that each case was read, along with outcome (whether the woman was recalled for further tests,
and biopsy results where performed) was extracted from records from one breast screening centre in UK (4 readers).
Patterns of performance with time spent reading was also measured using an enriched test set (160 cases, 41% malignant,
read three times by eight radiologists). Recall rates varied with time of day, with different patterns for each reader. Recall
rates decreased as the reading session progressed both when reading the enriched test set and in screening practice.
Further work is needed to expand this work to a greater number of breast screening centres, and to determine whether
these patterns of performance over time can be used to optimize overall performance.
Can horizontally oriented breast tomosynthesis image volumes or the use of a systematic search strategy improve interpretation? An eye tracking and free response human observer study
Kristina Lång,
Sophia Zackrisson,
Kenneth Holmqvist,
et al.
Show abstract
Our aim was to evaluate if there is a benefit in diagnostic accuracy and efficiency of viewing breast tomosynthesis (BT)
image volumes presented horizontally oriented, but also to evaluate the use of a systematic search strategy where the
breast is divided, and analyzed consecutively, into two sections. These image presentations were compared to regular
vertical image presentation. All methods were investigated using viewing procedures consisting of free scroll volume
browsing, and a combination of initial cine loops at three different frame rates (9, 14, 25 fps) terminated upon request
followed by free scroll volume browsing if needed. Fifty-five normal BT image volumes in MLO view were collected.
In these, simulated lesions (20 masses and 20 clusters of microcalcifications) were randomly inserted, creating four
unique image sets for each procedure. Four readers interpreted the cases in a random order. Their task was to locate the
lesions, mark and assign a five level confidence scale. The diagnostic accuracy was analyzed using Jackknife Free
Receiver Operating Characteristics (JAFROC). Time efficiency and visual search behavior were also investigated using
eye tracking. Results indicate there was no statistically significant difference in JAFROC FOM between the different
image presentations, although visual search was more time efficient when viewing horizontally oriented image volumes
in medium cine loops.
Human Performance
Modeling error in assessment of mammographic image features for improved computer-aided mammography training: initial experience
Show abstract
In this study we investigate the hypothesis that there exist patterns in erroneous assessment of BI-RADS
image features among radiology trainees when performing diagnostic interpretation of mammograms.
We also investigate whether these error making patterns can be captured by individual user models. To
test our hypothesis we propose a user modeling algorithm that uses the previous readings of a trainee
to identify whether certain BI-RADS feature values (e.g. "spiculated" value for "margin" feature)
are associated with higher than usual likelihood that the feature will be assessed incorrectly. In our
experiments we used readings of 3 radiology residents and 7 breast imaging experts for 33 breast masses
for the following BI-RADS features: parenchyma density, mass margin, mass shape and mass density.
The expert readings were considered as the gold standard. Rule-based individual user models were
developed and tested using the leave one-one-out crossvalidation scheme. Our experimental evaluation
showed that the individual user models are accurate in identifying cases for which errors are more
likely to be made. The user models captured regularities in error making for all 3 residents. This
finding supports our hypothesis about existence of individual error making patterns in assessment
of mammographic image features using the BI-RADS lexicon. Explicit user models identifying the
weaknesses of each resident could be of great use when developing and adapting a personalized training
plan to meet the resident's individual needs. Such approach fits well with the framework of adaptive
computer-aided educational systems in mammography we have proposed before.
Time of day does not affect radiologists' accuracy in breast lesion detection
Show abstract
Mammographic image reporting accuracy among radiologists varies. This study examines whether radiologists' accuracy
in detecting breast lesions varies at different times throughout the day. Observers comprised of 69 experienced breast
radiologists who reviewed 50 mammograms, consisting of 4 images each, of which 15 cases were abnormal. All the
observers were grouped and assigned a specific hour, starting at 7:00am and finishing 8:00pm. They were asked to detect
the lesion if present and mark their confidence rating (1-5) in a provided booklet. Demographic details were recorded
including age, experience and average number of mammographic readings undertaken per year. Radiologists'
performance was measured and compared in terms of sensitivity, specificity and receiver operating characteristic (ROC)
scores. Kruskal-Wallis methods with Dunn's post-hoc test was performed. Mean ROC scores demonstrated no
significant differences (p≥0.46) between groups performing at different times of the day. Also, no significant differences
were noted for sensitivity (p≥0.78) or specificity (p≥0.99) when groups were compared with each other. The findings
from the study suggest that although radiologists' performance varies slightly throughout the day, the exact time of day
has no significant effect on radiologists' detection accuracy. The results found suggest further studies are required for
these to investigate this effect.
Extended analysis of the effect of learning with feedback on the detectability of pulmonary nodules in chest tomosynthesis
Show abstract
In chest tomosynthesis, low-dose projections collected over a limited angular range are used for reconstruction of section
images of the chest, resulting in a reduction of disturbing anatomy at a moderate increase in radiation dose compared to
chest radiography. In a previous study, we investigated the effects of learning with feedback on the detection of
pulmonary nodules in chest tomosynthesis. Six observers with varying degrees of experience of chest tomosynthesis
analyzed tomosynthesis cases for presence of pulmonary nodules. The cases were analyzed before and after learning with
feedback. Multidetector computed tomography (MDCT) was used as reference. The differences in performance between
the two readings were calculated using the jackknife alternative free-response receiver operating characteristics
(JAFROC-2) as primary measure of detectability. Significant differences between the readings were found only for
observers inexperienced in chest tomosynthesis. The purpose of the present study was to extend the statistical analysis of
the results of the previous study, including JAFROC-1 analysis and FROC curves in the analysis. The results are
consistent with the results of the previous study and, furthermore, JAFROC-1 gave lower p-values than JAFROC-2 for
the observers who improved their performance after learning with feedback.
Classification of radiological errors in chest radiographs, using support vector machine on the spatial frequency features of false- negative and false-positive regions
Show abstract
Aim: To optimize automated classification of radiological errors during lung nodule detection from chest radiographs
(CxR) using a support vector machine (SVM) run on the spatial frequency features extracted from the local background
of selected regions. Background: The majority of the unreported pulmonary nodules are visually detected but not
recognized; shown by the prolonged dwell time values at false-negative regions. Similarly, overestimated nodule
locations are capturing substantial amounts of foveal attention. Spatial frequency properties of selected local
backgrounds are correlated with human observer responses either in terms of accuracy in indicating abnormality position
or in the precision of visual sampling the medical images. Methods: Seven radiologists participated in the eye tracking
experiments conducted under conditions of pulmonary nodule detection from a set of 20 postero-anterior CxR. The most
dwelled locations have been identified and subjected to spatial frequency (SF) analysis. The image-based features of
selected ROI were extracted with un-decimated Wavelet Packet Transform. An analysis of variance was run to select SF
features and a SVM schema was implemented to classify False-Negative and False-Positive from all ROI. Results: A
relative high overall accuracy was obtained for each individually developed Wavelet-SVM algorithm, with over 90%
average correct ratio for errors recognition from all prolonged dwell locations. Conclusion: The preliminary results
show that combined eye-tracking and image-based features can be used for automated detection of radiological error
with SVM. The work is still in progress and not all analytical procedures have been completed, which might have an
effect on the specificity of the algorithm.
A novel platform to simplify human observer performance experiments in clinical reading environments
Show abstract
Human observer performance experiments (HOPE) are frequently carried out in controlled environments in order to maximize the
influence of the performance parameter under study. As an example, the amount of ambient reading variables can be kept as low as
possible during HOPE. This is contrasting with the dynamic nature of a clinical reading environment that may therefore be
suboptimal for the majority of the experiments. The aim of current work was to extend our previously developed software platform
Sara² to cope with the influences of the reading environment on HOPE experiments. Generic modules for ROC, LROC, FROC, MAFC
and visual grading analysis/image quality criteria (VGA/IQC) experiments were developed for 2D and 3D input images.
Additional modules were included in the platform for finding unexpected interruptions due to clinical emergencies by means of idle
time and for mouse trajectory monitoring. Also a generic approach towards the inclusion of reading questionnaires and a RFID
enabled secured login system was added. Next, we created a sensor network consisting of off-the-shelf components which
continuously monitor ambient reading conditions like: temperature, ambient lighting, humidity, ambient noise levels and observer
reading distance. These measured parameters can be synchronized with the reading findings. Finally we included a link to incorporate
the use of specialized 3rd party PACS viewers in our software framework. Using the proposed software and hardware solution, we
could simplify the setup and the performing of HOPE in clinical reading environments and we can now properly control our reading
experiments.
Analysis of physiological impact while reading stereoscopic radiographs
Show abstract
A stereoscopic viewing technology is expected to improve diagnostic performance in terms of reading efficiency by
adding one more dimension to the conventional 2D images. Although a stereoscopic technology has been applied to
many different field including TV, movies and medical applications, physiological fatigue through reading stereoscopic
radiographs has been concerned although no established physiological fatigue data have been provided. In this study,
we measured the α-amylase concentration in saliva, heart rates and normalized tissue hemoglobin index (nTHI) in blood
of frontal area to estimate physiological fatigue through reading both stereoscopic radiographs and the conventional 2D
radiographs. In addition, subjective assessments were also performed.
As a result, the pupil contraction occurred just after the reading of the stereoscopic images, but the subjective
assessments regarding visual fatigue were nearly identical for the reading the conventional 2D and stereoscopic
radiographs. The α-amylase concentration and the nTHI continued to decline while examinees read both 2D and
stereoscopic images, which reflected the result of subjective assessment that almost half of the examinees reported to feel
sleepy after reading. The subjective assessments regarding brain fatigue showed that there were little differences
between 2D and stereoscopic reading.
In summary, this study shows that the physiological fatigue caused by stereoscopic reading is equivalent to the
conventional 2D reading including ocular fatigue and burden imposed on brain.
Model Observers
Incorporating holistic visual search concepts into a SPECT myocardial perfusion imaging numerical observer
Show abstract
Previous Single Photon Emission Computed Tomography (SPECT) myocardial perfusion imaging (MPI) research has
explored the utility of numerical observers. One previous study proposed that the model of holistic visual search of a
myocardial perfusion image by an expert human observer might improve the development of a SPECT MPI numerical
observer. Further examination of numerical processing techniques that seem to be analogous to initial stage of human
holistic image search has helped to further refine the numerical observer. The current numerical observer considers
some fundamental issues in the refinement of the numerical observer: the need for background estimation, the
determination of blobs and the 'search-like' selection of a few blobs for subsequent decision analysis.
Channelized relevance vector machine as a numerical observer for cardiac perfusion defect detection task
Show abstract
In this paper, we present a numerical observer for image quality assessment, aiming to predict human observer accuracy
in a cardiac perfusion defect detection task for single-photon emission computed tomography (SPECT). In medical
imaging, image quality should be assessed by evaluating the human observer accuracy for a specific diagnostic task.
This approach is known as task-based assessment. Such evaluations are important for optimizing and testing imaging
devices and algorithms. Unfortunately, human observer studies with expert readers are costly and time-demanding. To
address this problem, numerical observers have been developed as a surrogate for human readers to predict human
diagnostic performance. The channelized Hotelling observer (CHO) with internal noise model has been found to predict
human performance well in some situations, but does not always generalize well to unseen data. We have argued in the
past that finding a model to predict human observers could be viewed as a machine learning problem. Following this
approach, in this paper we propose a channelized relevance vector machine (CRVM) to predict human diagnostic scores
in a detection task. We have previously used channelized support vector machines (CSVM) to predict human scores and
have shown that this approach offers better and more robust predictions than the classical CHO method. The comparison
of the proposed CRVM with our previously introduced CSVM method suggests that CRVM can achieve similar
generalization accuracy, while dramatically reducing model complexity and computation time.
Development of model observers applied to 3D breast tomosynthesis microcalcifications and masses
Show abstract
The development of model observers for mimicking human detection strategies has followed from symmetric signals in
simple noise to increasingly complex backgrounds. In this study we implement different model observers for the
complex task of detecting a signal in a 3D image stack. The backgrounds come from real breast tomosynthesis
acquisitions and the signals were simulated and reconstructed within the volume. Two different tasks relevant to the
early detection of breast cancer were considered: detecting an 8 mm mass and detecting a cluster of microcalcifications.
The model observers were calculated using a channelized Hotelling observer (CHO) with dense difference-of-Gaussian
channels, and a modified (Partial prewhitening [PPW]) observer which was adapted to realistic signals which are not
circularly symmetric. The sustained temporal sensitivity function was used to filter the images before applying the
spatial templates. For a frame rate of five frames per second, the only CHO that we calculated performed worse than the
humans in a 4-AFC experiment. The other observers were variations of PPW and outperformed human observers in
every single case. This initial frame rate was a rather low speed and the temporal filtering did not affect the results
compared to a data set with no human temporal effects taken into account. We subsequently investigated two higher
speeds at 5, 15 and 30 frames per second. We observed that for large masses, the two types of model observers
investigated outperformed the human observers and would be suitable with the appropriate addition of internal noise.
However, for microcalcifications both only the PPW observer consistently outperformed the humans. The study
demonstrated the possibility of using a model observer which takes into account the temporal effects of scrolling through
an image stack while being able to effectively detect a range of mass sizes and distributions.
Numerical observer for cardiac motion assessment using machine learning
Show abstract
In medical imaging, image quality is commonly assessed by measuring the performance of a human observer performing
a specific diagnostic task. However, in practice studies involving human observers are time consuming and difficult to
implement. Therefore, numerical observers have been developed, aiming to predict human diagnostic performance to
facilitate image quality assessment. In this paper, we present a numerical observer for assessment of cardiac motion in
cardiac-gated SPECT images. Cardiac-gated SPECT is a nuclear medicine modality used routinely in the evaluation of
coronary artery disease. Numerical observers have been developed for image quality assessment via analysis of
detectability of myocardial perfusion defects (e.g., the channelized Hotelling observer), but no numerical observer for
cardiac motion assessment has been reported. In this work, we present a method to design a numerical observer aiming
to predict human performance in detection of cardiac motion defects. Cardiac motion is estimated from reconstructed
gated images using a deformable mesh model. Motion features are then extracted from the estimated motion field and
used to train a support vector machine regression model predicting human scores (human observers' confidence in the
presence of the defect). Results show that the proposed method could accurately predict human detection performance
and achieve good generalization properties when tested on data with different levels of post-reconstruction filtering.
Accounting for anatomical noise in SPECT with a visual-search human-model observer
Show abstract
Reliable human-model observers for clinically realistic detection studies are of considerable interest in medical
imaging research, but current model observers require frequent revalidation with human data. A visual-search
(VS) observer framework may improve reliability by better simulating realistic etection-localization tasks. Under
this framework, model observers execute a holistic search to identify tumor-like candidates and then perform
careful analysis of these candidates. With emission tomography, anatomical noise in the form of elevated uptake
in neighboring tissue often complicates the task. Some scanning model observers simulate the human ability to
read around such noise by presubtracting the mean normal background from the test image, but this backgroundknown-
exactly (BKE) assumption has several drawbacks. The extent to which the VS observer can overcome
these drawbacks was investigated by comparing it against humans and a scanning observer for detection of
solitary pulmonary nodules in a simulated SPECT lung study. Our results indicate that the VS observer offers
a robust alternative to the scanning observer for modeling humans.
ROC and Decision Metrics
Support of the decision variable densities of the three-class ideal observer for bivariate trinormal data
Show abstract
Despite theoretical and practical difficulties, we are attempting to extend receiver operating characteristic (ROC)
analysis to tasks with more than two classes. Previously we investigated a univariate trinormal model for the
underlying data of a three-class ideal observer. Although analytically tractable, this is less realistic than a
multivariate data model. We have developed expressions for the region of support of the decision variable
probability density functions for bivariate trinormal underlying data, given certain constraints on the underlying
data covariance matrices. We hope these results will aid in developing computational methods for evaluating
observer performance under such a model.
Agreement between two versions of a CADx system: a simulation study
Show abstract
A simulation study was conducted to investigate the agreement between original and updated versions of a computeraided
diagnosis (CADx) system. Performances of two versions of a CADx system are traditionally compared using
metrics derived from the receiver operating characteristic (ROC) curve. These aggregate standalone performance
measures may reveal the overall improvement of the CADx system due to the update, but do not provide information
about the specific change in CADx output for individual cases. To address this issue, we used the concordance measure,
which compares the ordering of scores for pairs of cases between system versions (i.e., before and after the update of the
system). In this preliminary study, the system update that we investigated was an enlargement of the training data set,
which is often encountered in the development of a subsequent CADx system version for improving performance. We
separately studied the effect of the size of the original training set, the number of features, and the distribution and
separation of the two classes in the feature space on the concordance and AUC measures. When the effect of an update
was compared among datasets with differences in intrinsic class separation, concordance was in general larger when the
intrinsic class separation was larger. The amount of change in AUC between the original and updated CADx system did
not always predict the degree of agreement between the two system versions. A large improvement in AUC could be
accompanied with either a larger or smaller agreement between the original and updated systems. Quantification of the
degree of agreement in standalone performance between different versions of a CADx system may serve to define a
major algorithm update, and better depict the impact of that update.
Reader characteristics linked to detection of pulmonary nodules on radiographs: ROC vs. JAFROC analyses of performance
Show abstract
The purpose of this study is to explore whether reader characteristics are linked to heightened levels of diagnostic
performance in chest radiology using receiver operating characteristic (ROC) and jackknife free response ROC
(JAFROC) methodologies. A set of 40 postero-anterior chest radiographs was developed, of which 20 were abnormal
containing one or more simulated nodules, of varying subtlety. Images were independently reviewed by 12 boardcertified
radiologists including six chest specialists. The observer performance was measured in terms of ROC and
JAFROC scores. For the ROC analysis, readers were asked to rate their degree of suspicion for the presence of nodules
by using a confidence rating scale (1-6). JAFROC analysis required the readers to locate and rate as many suspicious
areas as they wished using the same scale and resultant data were used to generate Az and FOM scores for ROC and
JAFROC analyses respectively. Using Pearson methods, scores of performance were correlated with 7 reader
characteristics recorded using a questionnaire. JAFROC analysis showed that improved reader performance was
significantly (p≤0.05) linked with chest specialty (p<0.03), hours per week reading chest radiographs (p<0.03) and chest
readings per year (p<0.04). ROC analyses demonstrated only one significant relationship, hours per week reading chest
radiographs (p<0.02).The results of this study have shown that radiologist's performance in the detection of pulmonary
nodules on radiographs is significantly linked to chest specialty, hours reading per week and number of radiographs read
per year. Also, JAFROC is a more powerful predictor of performance as compared to ROC.
Estimating the parameters of a model of visual search from ROC data: an alternate method for fitting proper ROC curves
Show abstract
The binormal receiver operating characteristic (ROC) model often predicts an unphysical "hook" near the upperright
corner (1,1) of the ROC plot. Several models for fitting proper ROC curves avoid this problem. The purpose of
this work is to describe another method that involves a model of visual search that models free-response data, and to
compare the search-model predicted ROC curves with those predicted by PROPROC (proper ROC) software. The
highest rating rule was used to infer ROC data from FROC data. An expression for the search-model ROC
likelihood function is derived, maximizing which yielded estimates of the parameters and the fitted ROC curve. The
method was applied to a dual-modality 5-reader FROC data set. The relative difference between the average AUCs
for the two methods was less than 1%. A linear regression of the AUCs yielded an adjusted R-squared of 0.95
indicative of strong linear correlation between the search model AUC and PROPROC AUC, although the shapes of
the predicted ROC curves were qualitatively different. This study shows the feasibility of estimating parameters
characterizing visual search from data acquired in a non-search paradigm.
Characterizing and optimizing rater performance for internet-based collaborative labeling
Show abstract
Labeling structures on medical images is crucial in determining clinically relevant correlations with morphometric and
volumetric features. For the exploration of new structures and new imaging modalities, validated automated methods do
not yet exist, and so researchers must rely on manually drawn landmarks. Voxel-by-voxel labeling can be extremely
resource intensive, so large-scale studies are problematic. Recently, statistical approaches and software have been
proposed to enable Internet-based collaborative labeling of medical images. While numerous labeling software tools
have been created, the use of these packages as high-throughput labeling systems has yet to become entirely viable given
training requirements. Herein, we explore two modifications to a typical mouse-based labeling system: (1) a platform
independent overlay for recognition of mouse gestures and (2) an inexpensive touch-screen tracking device for nonmouse
input. Through this study we characterize rater reliability in point, line, curve, and region placement. For the
mouse input, we find a placement accuracy of 2.48±5.29 pixels (point), 0.630±1.81 pixels (curve), 1.234±6.99 pixels
(line), and 0.058±0.027 (1 - Jaccard Index for region). The gesture software increased labeling speed by 27% overall
and accuracy by approximately 30-50% on point and line tracing tasks, but the touch screen module lead to slower and
more error prone labeling on all tasks, likely due to relatively poor sensitivity. In summary, the mouse gesture
integration layer runs as a seamless operating system overlay and could potentially benefit any labeling software; yet, the
inexpensive touch screen system requires improved usability optimization and calibration before it can provide an
efficient labeling system.
Keynote and Assessment in Pathology
Changes in visual search patterns of pathology residents as they gain experience
Show abstract
The goal of this study was to examine and characterize changes in the ways that pathology residents examine digital or
"virtual" slides as they gain more experience. A series of 20 digitized breast biopsy virtual slides (half benign and half
malignant) were shown to 6 pathology residents at three points in time - at the beginning of their first year of residency,
at the beginning of the second year, and at the beginning of the third year. Their task was to examine each image and
select three areas that they would most want to zoom on in order to view the diagnostic detail at higher resolution. Eye
position was recorded as they scanned each image. The data indicate that with each successive year of experience, the
residents' search patterns do change. Overall it takes significantly less time to view an individual slide and decide where
to zoom, significantly fewer fixations are generated overall, and there is less examination of non-diagnostic areas.
Essentially, the residents' search becomes much more efficient and after only one year closely resembles that of an expert
pathologist. These findings are similar to those in radiology, and support the theory that an important aspect of the
development of expertise is improved pattern recognition (taking in more information during the initial Gestalt or gist
view) as well as improved allocation of attention and visual processing resources.
Characterizing virtual slide exploration through the use of 'search maps'
Show abstract
Currently very little is known about the process by which pathologists arrive at a diagnosis
on a case. This process is an integration of the pathologist's slide exploration strategy, perceptual
information gathering and cognitive decision making. We have developed a methodology to
statically represent the pathologists' dynamic visual search of digital slides by creating a
representation of visual sampling called 'search maps'. In these maps slide exploration is divided
into three parts, according to the magnification range used. In other words, areas explored at low
magnification (<=4x), medium magnification (>4x-10x) and high magnification (>10x-20x) are
represented separately. Moreover, representation using the 'search maps' allows for quantitative
analysis and pairwise comparison of slide exploration strategy. In this paper we have compared the
search maps of experienced pathologists and those of Pathology residents. Our goal was to
understand how search differs between the experts and the trainees.
Image Display and Presentation
Validation of a new digital breast tomosynthesis medical display
Show abstract
The main objective of this study is to evaluate and validate the new Barco medical display MDMG-5221 which has been
optimized for the Digital Breast Tomosynthesis (DBT) imaging modality system, and to prove the benefit of the new
DBT display in terms of image quality and clinical performance. The clinical performance is evaluated by the detection
of micro-calcifications inserted in reconstructed Digital Breast Tomosynthesis slices. The slices are shown in dynamic
cine loops, at two frames rates. The statistical analysis chosen for this study is the Receiver Operating Characteristic
Multiple-Reader, Multiple-Case methodology, in order to measure the clinical performance of the two displays. Four
experienced radiologists are involved in this study. For this clinical study, 50 normal and 50 abnormal independent
datasets were used. The result is that the new display outperforms the mammography display for a signal detection task
using real DBT images viewed at 25 and 50 slices per second. In the case of 50 slices per second, the p-value = 0.0664.
For a cut-off where alpha=0.05, the conclusion is that the null hypothesis cannot be rejected, however the trend is that
the new display performs 6% better than the old display in terms of AUC. At 25 slices per second, the difference
between the two displays is very apparent. The new display outperforms the mammography display by 10% in terms of
AUC, with a good statistical significance of p=0.0415.
Is image manipulation necessary to interpret digital mammographic images efficiently?
Show abstract
With the introduction of digital breast screening across the UK, screeners need to learn how best to inspect these images.
A key advantage over mammographic film is the facility to use workstation image manipulation tools. Forty two-view
FFDM screening cases, representing malignant, normal and benign appearances were examined by fourteen radiologists
and advanced practitioners from two UK screening centres. For half the cases, the mammography workstation image
manipulation tools could be employed and for the other half these were not used. Participants classified each case and
indicated whether an abnormality was present. Throughout the study the participants' visual search behaviour as well as
their image manipulations was recorded. Whether or not image manipulation tools were used made very little difference
to overall performance (t-test, p>.05) as confirmed by JAFROC analysis Figure-Of-Merit values of 0.816 and 0.838
(with and without tools respectively); performance not using tools was better. However, using tools significantly
increased inspection time (p<0.5) as well as participants' confidence. Detailed examination of participants' image
inspection behaviour elicited that the average time on each case in the different viewing conditions differed significantly
between the high experienced readers and low experienced readers. The visual data analysis revealed that the participants
made similar overall pattern of errors on both modalities. The visual search behaviour on both modalities are surprisingly
similar.
Performance evaluation of medical LCD displays using 3D channelized Hotelling observers
Show abstract
High performance of the radiologists in the task of image lesion detection is crucial for successful medical practice.
One relevant factor in clinical image reading is the quality of the medical display. With the current trends of
stack-mode liquid crystal displays (LCDs), the slow temporal response of the display plays a significant role in
image quality assurance. In this paper, we report on the experimental study performed to evaluate the quality
of a novel LCD with advanced temporal response compensation, and compare it to an existing state-of-the-art
display of the same category but with no temporal response compensation. The data in the study comprise
clinical digital tomosynthesis images of the breast with added simulated mass lesions. The detectability for the
two displays is estimated using the recent multi-slice channelized Hotelling observer (msCHO) model which is
especially designed for multi-slice image data. Our results suggest that the novel LCD allows higher detectability
than the existing one. Moreover, the msCHO results are used to advise on the parameters for the follow up
image reading study with real medical doctors as observers. Finally, the main findings of the msCHO study were
confirmed by a human reader study (details to be published in a separate paper).
Visual cues do not improve skin lesion ABC(D) grading
Show abstract
In this work evidence is presented supporting the hypothesis that observers tend to evaluate very differently
the same properties of given skin-lesion images. Results from previous experiments have been compared to new
ones obtained where we gave additional prototypical visual cues to the users during their evaluation trials. Each
property (colour, colour uniformity, asymmetry, border regularity, roughness of texture) had to be evaluated
on a 0-10 range, with both linguistic descriptors and visual references at each end and in the middle (e.g.
light/medium/dark for colour). A set of 22 images covering different clinical diagnoses has been used in the
comparison with previous results. Statistical testing showed that only for a few test images the inclusion of the
visual anchors reduced the variability of the grading for some of the properties. Despite such reduction, though,
the average variance of each property still remains high even after the inclusion of the visual anchors. When
considering each property, the average variance significantly changed for the roughness of texture, where the
visual references caused an increase in the variability. With these results we can conclude that the variance of
the answers observed in the previous experiments was not due to the lack of a standard definition of the extrema
of the scale, but rather to a high variability in the way observers perceive and understand skin-lesion images.
The effect of defect cluster size and interpolation on radiographic image quality
Show abstract
For digital X-ray detectors, the need to control factory yield and cost invariably leads to the presence of some defective
pixels. Recently, a standard procedure was developed to identify such pixels for industrial applications. However, no
quality standards exist in medical or industrial imaging regarding the maximum allowable number and size of detector
defects. While the answer may be application specific, the minimum requirement for any defect specification is that the
diagnostic quality of the images be maintained. A more stringent criterion is to keep any changes in the images due to
defects below the visual threshold. Two highly sensitive image simulation and evaluation methods were employed to
specify the fraction of allowable defects as a function of defect cluster size in general radiography. First, the most critical
situation of the defect being located in the center of the disease feature was explored using image simulation tools and a
previously verified human observer model, incorporating a channelized Hotelling observer. Detectability index d' was
obtained as a function of defect cluster size for three different disease features on clinical lung and extremity
backgrounds. Second, four concentrations of defects of four different sizes were added to clinical images with subtle
disease features and then interpolated. Twenty observers evaluated the images against the original on a single display
using a 2-AFC method, which was highly sensitive to small changes in image detail. Based on a 50% just-noticeable
difference, the fraction of allowed defects was specified vs. cluster size.
Verification of the QUBYX perfectlum calibration software using a PR-670 spectro radiometer and associated verification facility
Show abstract
At the University of Arizona a research project is underway which addresses consistent color and consistent gray-scale reproduction for digital color displays used in medical image interpretation, specifically for Pathology. Now the University of Arizona can enter the field of ICC Profiling and Color Management. Verification of PerfectLum Software was successful. FIT and LUM tests were performed to verify the conformance and the deviation was quantified. The maximum GSDF Error is about 5.968 %. With respect to the results, all three objectives were met and the PerfectLum calibrated display confirmed to the AAPM TG18 standards.
Vision in Medical Imaging
A study of attentional effects of intensity transforms for mammograms
Show abstract
This paper presents a study of the attentional effects of two types of intensity distribution variations upon observer
behaviour when viewing mammograms: equalisation (to a uniform image intensity histogram) and normalisation (to
match an industry best practice image intensity histogram). For untrained observers, some consistent attraction of
attention towards the strongest intensity regions of the images for the more highly contrasting equalised images as
compared with the unprocessed images was detected. For the normalised images, this effect was even more marked. For
a trained observer, no substantial disruption of attentional patterns during viewing was detected for equalised images, but
was for normalised images. The nature and extent of the changes in the attentional behaviour for both untrained and
trained observers indicates potential value in further studies and emphasizes the need to conduct clinically related studies
with trained observers.
The impact of clinical indications on visual search behaviour in skeletal radiographs
Show abstract
The hazards associated with ionizing radiation have been documented in the literature and therefore justifying the need
for X-ray examinations has come to the forefront of the radiation safety debate in recent years1. International legislation
states that the referrer is responsible for the provision of sufficient clinical information to enable the justification of the
medical exposure. Clinical indications are a set of systematically developed statements to assist in accurate diagnosis and
appropriate patient management2. In this study, the impact of clinical indications upon fracture detection for
musculoskeletal radiographs is analyzed. A group of radiographers (n=6) interpreted musculoskeletal radiology cases
(n=33) with and without clinical indications. Radiographic images were selected to represent common trauma
presentations of extremities and pelvis. Detection of the fracture was measured using ROC methodology. An eyetracking
device was employed to record radiographers search behavior by analysing distinct fixation points and search
patterns, resulting in a greater level of insight and understanding into the influence of clinical indications on observers'
interpretation of radiographs. The influence of clinical information on fracture detection and search patterns was
assessed. Findings of this study demonstrate that the inclusion of clinical indications result in impressionable search
behavior. Differences in eye tracking parameters were also noted. This study also attempts to uncover fundamental
observer search strategies and behavior with and without clinical indications, thus providing a greater understanding and
insight into the image interpretation process. Results of this study suggest that availability of adequate clinical data
should be emphasized for interpreting trauma radiographs.
Measurement of breast lesion display luminance and overall image display luminance relative to optimum luminance for contrast perception
Show abstract
Introduction: To minimize fatigue due to eye adaptation and maximize contrast perception, it has been suggested that
lesion luminance be matched to overall image luminance to perceive the greatest number of grey level differences. This
work examines whether lesion display luminance matches the overall image and breast tissue display luminance and
whether these factors are positioned within the optimum luminance for maximal contrast sensitivity.
Methods: A set of 42 mammograms, collected from 21 patients and containing 15 malignant and 6 benign lesions, was
used to assess overall image luminance. Each image displayed on the monitor was divided into 16 equal regions. The
luminance at the midpoint of each region was measured using a calibrated photometer and the overall image luminance
was calculated. Average breast tissue display luminance was calculated from the subset of regions containing of only
breast tissue. Lesion display luminance was compared with both overall image display luminance and average breast
tissue display luminance.
Results: Statistically significant differences (p<0.0001) were noted between overall image display luminance (4.3±0.7
cd/m2) and lesion display luminance (15.0±6.8 cd/m2); and between average breast tissue display luminance (6.8±1.3
cd/m2) and lesion display luminance (p<0.002).
Conclusions: Lesion luminance was significantly higher than the overall image and breast tissue luminance. Luminance
of lesions and general breast tissue fell below the optimum luminance range for contrast perception. Breast lesion
detection sensitivity and specificity may be enhanced by use of brighter monitor displays.
Motion perception in medical imaging
Show abstract
A potential drawback of image noise suppression in medical image sequence processing is a possible loss of the apparent
motion: making objects appears to move slower or less then they move in reality. For medical imaging application this
can be of critical importance, for example myocardium motion in cardiac gated single photon emission computed
tomography (SPECT) imaging can differentiate viable muscle from scar tissue. Therefore, in this work we design a set
of experiments to measure how human observers perceive apparent motion in the presence of image degradations like
noise and blur. In addition we will try to identify relevant image features, based on a visual attention model and a block
matching motion estimation method that would allow development of an accurate numerical observer capable of
predicting human observer motion perception.
Characterizing non-Gaussian properties of breast images with a noisy-Laplacian distribution
Show abstract
It is generally well known that the appearance of breast tissue in a mammogram is considerably more complex
in a statistical sense than a simple random Gaussian texture, even when the correlation structure of the Gaussian has
been set to match the power-law power spectrum of mammograms. However there has not been a systematic way to
characterize the extent of departure from a Gaussian process. We address this topic here by proposing a noisy-Laplacian
distribution to model response histograms derived from digital (or digitized) mammograms.
We describe the distribution in terms of the probability density function and cumulative density function, as
well as moments up to fourth order. We also demonstrate the usefulness of the new distribution by fitting it to responses
from digital mammography.
Technology Assessment and Impact
Improved implementation of the abnormality manipulation software tools
Show abstract
Collecting clinical cases for medical imaging perception studies is often challenging. We have developed a suite of
software tools for manipulating medical tomographic image sets that overcome these difficulties. In our initial
development, abnormalities were removed or inserted on a slice-by-slice basis. To circumvent the problem with potential
artifacts in orthogonal views, we have redesigned the tools so that they operate in 3 dimensions. An operator controlled
ellipsoid mask region is used to select the removal and the replacement areas. This new approach has been validated on
PET data sets and has also been implemented for CT studies.
A clinical image preference study comparing digital tomosynthesis with digital radiography for pediatric spinal imaging
Show abstract
The purpose of this study was to evaluate the diagnostic quality of digital tomosynthesis (DT) images for pediatric
imaging of the spine. We performed a phantom image rating study to assess the visibility of anatomical spinal structures
in DT images relative to digital radiography (DR) and computed tomography (CT). We collected DT and DR images of
the cervical, thoracic and lumbar spine using anthropomorphic phantoms. Four pediatric radiologists and two residents
rated the visibility of structures on the DT image sets compared to DR using a four point scale (0 = not visible; 1 =
visible; 2 = superior to DR; 3 = excellent, CT unnecessary). In general, the structures in the spine received ratings
between 1 and 3 (cervical), or 2 and 3 (thoracic, lumbar), with a few mixed scores for structures that are usually difficult
to see on diagnostic images, such as vertebrae near the cervical-thoracic joint and the apophyseal joints of the lumbar
spine. The DT image sets allow most critical structures to be visualized as well or better than DR. When DR imaging is
inconclusive, DT is a valuable tool to consider before sending a pediatric patient for a higher-dose CT exam.
Computer-aided detection as a decision assistant in chest radiography
Show abstract
Background. Contrary to what may be expected, finding abnormalities in complex images like pulmonary
nodules in chest radiographs is not dominated by time-consuming search strategies but by an almost immediate
global interpretation. This was already known in the nineteen-seventies from experiments with briefly flashed
chest radiographs. Later on, experiments with eye-trackers showed that abnormalities attracted the attention
quite fast but often without further reader actions. Prolonging one's search seldom leads to newly found abnormalities
and may even increase the chance of errors. The problem of reading chest radiographs is therefore
not dominated by finding the abnormalities, but by interpreting them. Hypothesis. This suggests that readers
could benefit from computer-aided detection (CAD) systems not so much by their ability to prompt potential
abnormalities, but more from their ability to 'interpret' the potential abnormalities. In this paper, this hypothesis
was investigated by an observer experiment. Experiment. In one condition, the traditional CAD condition,
the most suspicious CAD locations were shown to the subjects, without telling them the levels of suspiciousness
according to CAD. In the other condition, interactive CAD condition, levels of suspiciousness were given,
but only when readers requested them at specified locations. These two conditions focus on decreasing search
errors and decision errors, respectively. Results of reading without CAD were also recorded. Six subjects, all
non-radiologists, read 223 chest radiographs in both conditions. CAD results were obtained from the OnGuard
5.0 system developed by Riverain Medical (Miamisburg, Ohio). Results. The observer data were analyzed by
Location Response Operating Characteristic analysis (LROC). It was found that: 1) With the aid of CAD, the
performance is significantly better than without CAD; 2) The performance with interactive CAD is significantly
better than with traditional CAD at low false positive rates.
Does stereo-endoscopy improve neurosurgical targeting in 3rd ventriculostomy?
Show abstract
Endoscopic third ventriculostomy is a minimally invasive surgical technique to treat hydrocephalus; a condition
where patients suffer from excessive amounts of cerebrospinal fluid (CSF) in the ventricular system of their
brain. This technique involves using a monocular endoscope to locate the third ventricle, where a hole can be
made to drain excessive fluid. Since a monocular endoscope provides only a 2D view, it is difficult to make this
perforation due to the lack of monocular cues and depth perception. In a previous study, we had investigated
the use of a stereo-endoscope to allow neurosurgeons to locate and avoid hazardous areas on the surface of the
third ventricle. In this paper, we extend our previous study by developing a new methodology to evaluate the
targeting performance in piercing the hole in the membrane. We consider the accuracy of this surgical task and
derive an index of performance for a task which does not have a well-defined position or width of target. Our
performance metric is sensitive and can distinguish between experts and novices. We make use of this metric to
demonstrate an objective learning curve on this task for each subject.
An analysis of the impact of tumor amount on the predictive power of a prostate biopsy prognostic assay
Faisal M. Khan,
Stephen I. Fogarasi,
Douglas Powell,
et al.
Show abstract
The Prostate Px prognostic assay offered by Aureon Biosciences is designed to predict progression post primary
treatment for prostate cancer patients based on their diagnostic biopsy specimen. The assay is driven by the automated
image analysis of a diagnostic prostate needle biopsy (PNB) and incorporates pathologist acquired and digitally masked
images which reflect the morphometric (Hematoxylin and Eosin, H&E) and protein expression (immunofluorescence,
IF) properties of the PNB. Up to 9 images (3 H&E and 6 IF) from each of 1027 patients, with varying amounts of
tumor content were included in the study. We wanted to understand what was the minimal tumor volume required to
maintain assay predictive robustness as a result of overall PNB tumor content and assess the impact of pathologist tumor
masking variability.
232 patients were selected who had a minimum of 80% tumor volume in a 20x magnification image. In each of the three
imaging domains (2 different multiplex (Mplex) IF images and one H&E), the tumor volume was artificially reduced in
increments from 80% to 2.5% of the original image area. This simulated decreasing amounts of tumor as well as
variations in digital tumor masking.
The univariate predictive power of individual imaging domains remained robust down to the 10% tumor level, whereas
the total assay was robust through the 20% to 10% tumor level. This work presents one of the first assessments of the
variety in tumor amounts on the predictive power of a commercially available prognostic assay that is reliant on multiple
bioimaging domains.
Poster Session
Assessment of a CAD scheme in selecting the optimal focused microscopic scanning images of the metaphase chromosomes
Show abstract
Visually searching for analyzable metaphase chromosome cells under microscopes is a routine and timeconsuming
task in genetic laboratories to diagnose cancer and genetic disorders. To improve detection efficiency,
consistency, and accuracy, we developed an automated microscopic image scanning system using a 100X oil immersion
objective lens to acquire images that has sufficient spatial resolution allowing clinicians to do diagnosis. Due to the highresolution,
the field of image depth is very limited and multiple scans up to seven layers are required. Thus, a metaphase
cell can spread over multiple images at different focal levels. Among them only one or two are adequate for the
diagnosis and the others are typically fuzzy images. In this study, we developed and tested a computer-aided detection
(CAD) scheme to automatically select one image with the sharpest image quality and discard all of the other fuzzy
images based on the computed sharpness index. From three scanned bone marrow specimen slides, the on-line and offline
metaphase finding modules automatically selected 100 chromosome cells with 534 images. These images were
selected to build a testing dataset. For each cell, the CAD scheme selects one image with the maximum sharpness index.
Three observers also independently visually selected one best image for diagnosis from each cell. The agreement rate
between CAD and visually selected images ranges from 89% to 96%, which is also very comparable to the agreement
rate between the two observers. This experiment demonstrated the feasibility of applying a CAD scheme to select the
images with sharpest high-resolution metaphase chromosome cell and potentially improve diagnostic efficiency and
accuracy in the future clinical practice.
Quantitative evaluation of six graph based semi-automatic liver tumor segmentation techniques using multiple sets of reference segmentation
Show abstract
Graph based semi-automatic tumor segmentation techniques have demonstrated great potential in efficiently measuring
tumor size from CT images. Comprehensive and quantitative validation is essential to ensure the efficacy of graph based
tumor segmentation techniques in clinical applications. In this paper, we present a quantitative validation study of six
graph based 3D semi-automatic tumor segmentation techniques using multiple sets of expert segmentation. The six
segmentation techniques are Random Walk (RW), Watershed based Random Walk (WRW), LazySnapping (LS),
GraphCut (GHC), GrabCut (GBC), and GrowCut (GWC) algorithms. The validation was conducted using clinical CT
data of 29 liver tumors and four sets of expert segmentation. The performance of the six algorithms was evaluated using
accuracy and reproducibility. The accuracy was quantified using Normalized Probabilistic Rand Index (NPRI), which
takes into account of the variation of multiple expert segmentations. The reproducibility was evaluated by the change of
the NPRI from 10 different sets of user initializations. Our results from the accuracy test demonstrated that RW (0.63)
showed the highest NPRI value, compared to WRW (0.61), GWC (0.60), GHC (0.58), LS (0.57), GBC (0.27). The
results from the reproducibility test indicated that GBC is more sensitive to user initialization than the other five
algorithms. Compared to previous tumor segmentation validation studies using one set of reference segmentation, our
evaluation methods use multiple sets of expert segmentation to address the inter or intra rater variability issue in ground
truth annotation, and provide quantitative assessment for comparing different segmentation algorithms.
Assessing risk of thyroid cancer using resonance-frequency based electrical impedance measurements
Show abstract
The incidence of thyroid cancer has risen faster than many malignancies and has nearly doubled in the USA
over the past 30 years. Palpable nodules and subclinical nodules detected by imaging are found in a large percentage of
the USA population. Most of these (.>95%) are fortunately benign. This vast reservoir of nodules makes the detection
and diagnosis of thyroid cancer a diagnostic dilemma. Ultrasound guided Fine Needle Aspiration Biopsy (FNAB) is
excellent for triaging patients but up to 25% of FNABs are inconclusive. As a result, definitive diagnosis is often only
possible with a diagnostic lobectomy; many thousands of these are performed in the USA annually for ultimately benign
disease. It would be extremely beneficial if we could develop a non-invasive procedure that could assist the
diagnostician in reliably predicting the likelihood of malignancy of otherwise indeterminate thyroid nodules, thereby
reducing the number of these "exploratory/diagnostic" lobectomies performed under general anesthesia. Electrical
Impedance Spectroscopy (EIS) was considered as a possible approach to address this problem. However, the diagnostic
accuracy of EIS is too low for routine clinical use to date. In our group, we developed a substantially modified
technology termed Resonance-frequency Electrical Impedance Spectroscopy (REIS), which yields usable information
for classifying risk of having breast abnormalities. We preliminarily applied REIS to measure signals on participants
having thyroid nodules aiming to assess whether we can assist in improving diagnosis of indeterminate thyroid nodules.
In this study we present a new multi-probe based REIS device specifically designed for the assessment of indeterminate
thyroid nodules. Our preliminary assessment presented here demonstrates the feasibility of using this proposed REIS
device in a busy tertiary care center.
Evaluation of agreement in corneal thickness measurements obtained using optical coherence tomography and ultrasound technique and determination of its specificity in keratoconus screening
P. Gunvant,
R. Darner
Show abstract
The aims of the present study are 1) to evaluate inter and intra observer repeatability of optical coherence tomography
corneal thickness measurements 2) to investigate the agreement in corneal thickness obtained using an ultrasound
pachymeter and the non-contact high resolution optical coherence tomography 3) to evaluate the false positive rate of
identifying keratoconic suspects on the basis of standard machine protocol. Measurements were performed on 51 eyes of
51 individuals without any known corneal pathology. Altman and Bland plots were analyzed to determine agreement of
corneal thickness measurements obtained using optical coherence tomography and ultrasound pachymeter; linear
regression analysis was performed to evaluate its interchangeability. The agreement between the optical coherence
tomography and ultrasonic pachymeter measurements was best for the central corneal thickness with a mean bias of 13.4
microns, with optical coherence tomography values being lower than the ultrasound pachymeter. The agreement of
measurements in the mid-peripheral cornea was poor, with bias in measurements ranging from 33 to 55 microns. The
optical coherence tomography measurements were repeatable with no differences in values between intra and inter
observer repeat measurements. Using standard machine protocol for keratoconus screening, utilizing 1 out of 4 criteria
gave a specificity of 86% and using 2 of the 4 criteria gave a specificity of 98%.
Fusion of classifiers for REIS-based detection of suspicious breast lesions
Show abstract
After developing a multi-probe resonance-frequency electrical impedance spectroscopy (REIS) system aimed at
detecting women with breast abnormalities that may indicate a developing breast cancer, we have been conducting a
prospective clinical study to explore the feasibility of applying this REIS system to classify younger women (< 50 years
old) into two groups of "higher-than-average risk" and "average risk" of having or developing breast cancer. The system
comprises one central probe placed in contact with the nipple, and six additional probes uniformly distributed along an
outside circle to be placed in contact with six points on the outer breast skin surface. In this preliminary study, we
selected an initial set of 174 examinations on participants that have completed REIS examinations and have clinical
status verification. Among these, 66 examinations were recommended for biopsy due to findings of a highly suspicious
breast lesion ("positives"), and 108 were determined as negative during imaging based procedures ("negatives"). A set
of REIS-based features, extracted using a mirror-matched approach, was computed and fed into five machine learning
classifiers. A genetic algorithm was used to select an optimal subset of features for each of the five classifiers. Three
fusion rules, namely sum rule, weighted sum rule and weighted median rule, were used to combine the results of the
classifiers. Performance evaluation was performed using a leave-one-case-out cross-validation method. The results
indicated that REIS may provide a new technology to identify younger women with higher than average risk of having
or developing breast cancer. Furthermore, it was shown that fusion rule, such as a weighted median fusion rule and a
weighted sum fusion rule may improve performance as compared with the highest performing single classifier.
A software tool to compare contrast-detail detection in uniform and in real mammographic backgrounds
Show abstract
A software tool is presented to merge CDMAM phantom images with real mammographic backgrounds. It allows SKE
tasks in uniform and in real backgrounds. This kind of tasks can be used to compare human, human visual metric or
model observer performance in detail detection using uniform or mammographic backgrounds.
As it is very well known, local characteristics of the structures in real mammographic backgrounds reduce the human
performance in contrast-detail detection tasks. In consequence that performance cannot be inferred from the data
acquired in white noise (flat) backgrounds such as a CDMAM phantom produces.
It is of interest to compare the response of a mammography system to the same set of signals, either embedded in flat or
in real backgrounds. This comparison achieves two goals. The first one is to analyze the variation of the recognition
threshold of the system for both backgrounds. The second one is to analyze the performance of a human observer or a
model observer over the same set of signals, varying the nature of the backgrounds.
The software tool presented here uses CDMAM images to merge with a region of interest selected from a real
mammography. This region as well as the mixing image method (basically adding or multiplying pixels) can be freely
selected by the user. In this work a set of measurements of 8 images has been analyzed. We can preview the variation of
the contrast-detail detection for a human observer and a human visual system metric (R*).
Comparison of the detection rates in reduced image by difference of interpolation method
Show abstract
In the soft copy diagnosis, each pixel of the detector is displayed to the correspondent pixel of liquid crystal display
(LCD). But when the image is displayed at the first time, the entire image may be reduced. We examined the influence
that the difference of image reduction rate on LCD exerts on detection performance by using observer performance
experiment. Moreover, to find the best interpolation method, we investigated the several interpolation methods. We
made a simulation image which is similar to Burger phantom. This image consists of 288 signals, each of a different size
and contrast. The matrix size is the same as Phase Contrast Mammography (PCM). We gradated the simulation image by
using an MTF of a geometric blur, and the image was added to the noise image which is uniformly exposed with PCM.
Then the image was reduced by using the nearest-neighbor, the bilinear, and the bicubic methods. The reduction rates
were calculated as the ratios of the number of pixels of LCDs to those of PCM. We displayed the reduced images on
LCD and examined the detection performance. Results of physical evaluation examined before showed that sharpness
and granularity have worsened both in proportion to the reduction rate. The detection performance deteriorated as the
reduction rate becomes high. In the comparison of the interpolation methods, the detection performance of the nearestneighbor
method was worse than those of other interpolation methods. The bilinear method is the most suitable for the
reduction of the image.
Image processing of head CT images using neuro best contrast (NBC) and lesion detection performance
Show abstract
Purpose: The purpose of this study was to objectively compare lesion detection performance of head CT
images reconstructed using filtered back projection (FBP) algorithms with those reconstructed using NBC.
Method: The observer study was conducted using the 2-AFC methodology. An AFC experiment consists of
128 observer choices and permits the computation of the intensity needed to achieve 92% correct (I92%).
High values of I92% corresponds to a poor level of detection performance, and vice versa. Head CT images
were acquired at an x-ray tube voltage of 120 kVp with a CTDIvol value of 75 mGy in a helical scan. Nine
randomly selected normal images from three patients and at three anatomical head locations were
reconstructed using filtered back projection (FBP) and neuro-best-contrast (NBC) processing. Circular
lesions were generated by projecting spheres onto the image plane, followed by blurring function, with
lesion sizes of 2.8 mm, 6.5 mm and 9.8 mm used in these experiments. Four readers were used, with 18
experiments performed by each observer (2 processing techniques × 3 lesion sizes × 3 repeats). The
experimental order of the 18 experiments was randomized to eliminate learning curve and/or observer
fatigue. The ratio R of the I92% value for NBC to the corresponding I92% value for FBP was calculated for
each observer and each lesion size. Values of R greater than unity indicate that NBC is inferior to FBP, and
vice versa.
Results: Analysis of data from each observer showed that a total of four data points had R less than unity,
and eight data points were greater than unity. Eleven of the twelve individual observer R values with one
standard deviation of unity. When data for the four observers were pooled, the resultant average R values
were 0.98 ± 0.38, 0.96 ± 0.33 and 1.15 ± 0.45, for the 2.8 mm, 6.5 mm and 9.8 mm lesions respectively.
The overall average R for all three lesions sizes was 1.03 ± 0.67.
Conclusion: Our AFC investigation has shown no evidence that use of Neuro Best Contrast to process head
CT images improves detection of circular, low contrast lesions less than 10 mm.
The effects of anatomical information and observer expertise on abnormality detection task
Show abstract
This paper presents a novel study investigating the influences of Magnetic Resonance (MR) image anatomical
information and observer expertise on an abnormality detection task.
MRI is exquisitely sensitive for detecting brain abnormalities, particularly in the evaluation of white matter diseases, e.g.
multiple sclerosis (MS). For this reason, MS lesions are simulated as the target stimuli for detection in the present study.
Two different image backgrounds are used in the following experiments: a) homogeneous region of white matter tissue,
and b) one slice of a healthy brain MR image. One expert radiologist (more than 10 years' experience), three radiologists
(less than 5 years' experience) and eight naïve observers (without any prior medical knowledge) have performed these
experiments, during which they have been asked different questions dependent upon level of experience; the three
radiologists and eight naïve observers were asked if they were aware of any hyper-signal, likely to represent an MS
lesion, while the most experienced consultant was asked if a clinically significant sign was present. With the percentages
of response "yes" displayed on the y-axis and the lesion intensity contrasts on the x-axis, psychometric function is
generated from the observer' responses.
Results of psychometric functions and calculated thresholds indicate that radiologists have better hyper-signal detection
ability than naïve observers, which is intuitively shown by the lower simple visibility thresholds of radiologists.
However, when radiologists perform a task with clinical implications, e.g. to detect a clinically significant sign, their
detection thresholds are elevated. Moreover, the study indicates that for the radiologists, the simple visibility thresholds
remain the same with and without the anatomical information, which reduces the threshold for the clinically significant
sign detection task.
Findings provide further insight into human visual system processing for this specific task, and this study provides the
foundation for a series of studies investigating numerical observer modeling to be designed, with the ultimate aim of
investigating the medical image quality assessment approach by addressing the perspective of radiologist diagnostic
performance.
Application of artificial neural network in simulating subjective evaluation of tumor segmentation
Show abstract
Systematic validation of tumor segmentation technique is very important in ensuring the accuracy and reproducibility
of tumor segmentation algorithm in clinical applications. In this paper, we present a new method for
evaluating 3D tumor segmentation using Artificial Neural Network (ANN) and combined objective metrics. In
our evaluation method, a three-layer feed-forwarding backpropagation ANN is first trained to simulate radiologist's
subjective rating using a set of objective metrics. The trained neural network is then used to evaluate the
tumor segmentation on a five-point scale in a way similar to expert's evaluation. The accuracy of segmentation
evaluation is quantified using average correct rank and frequency of the reference rating in the top ranks of
simulated score list. Experimental results from 93 lesions showed that our evaluation method performs better
than individual metrics. The optimal combination of metrics from normalized volume difference, volume overlap,
Root Mean Square symmetric surface distance and maximum symmetric surface distance showed the smallest
average correct rank (1.43) and highest frequency of the reference rating in the top two places of simulated rating
list (93.55%). Our results also demonstrate that the ANN based non-linear combination method showed better
evaluation accuracy than linear combination method in all performance measures. Our evaluation technique
has the potential to facilitate large scale segmentation validation study by predicting radiologists rating, and to
assist development of new tumor segmentation algorithms. It can also be extended to validation of segmentation
algorithms for other applications.
Optimization of hepatic lesion detection with computed tomography (CT): Is randomization of lesion location necessary?
Show abstract
Purpose: The purpose of this study was to compare observer performance for the detection of
randomly-positioned lesions to that of location-known lesions to determine if randomization of
lesion placement is necessary for optimization of hepatic lesion detection with CT. A phantom
containing fixed lesions (diameter 2.4mm, 4.8mm and 9.5mm) was scanned at various exposure and
slice thickness settings. A second image set was created by electronically cutting lesions from the
phantom images and pasting them into background-only images. Nine observers, blinded to lesion
location in the second image set, reviewed all images under standardized viewing conditions.
Visualization of lesions was scored using a four-point scale. Observer scores for the two methods
were correlated for all lesions, and for each lesion size using Spearman's rank correlation coefficient
(r). There was very high correlation between the observer scores for all lesions (r=0.919, p<0.0001)
and for the 9.5mm lesion (r=0.963, p<0.0001). There was moderate correlation for the 4.8mm and
2.4mm lesions (r=0.509, p=0.084, r=0.640, p=0.028). Discussion: When considering all lesions, or
the 9.5mm lesion independently, randomization did not alter observer scores, suggesting random
location of large lesions is unnecessary for dose optimization. For the smaller lesion sizes correlation
between the two methods is less robust. Conclusion: If lesion size is large or unimportant, dose
optimization can be performed using a phantom with fixed lesions. For small lesions, randomized
lesion location may be warranted, thus having implications for phantom design.
Impact of hybrid SPECT/CT imaging on the detection of single parathyroid adenoma
Show abstract
Objective: The aim of this investigation is to determine the impact of hybrid single photon emission computed
tomography/computed tomography (SPECT/CT) on the detection of parathyroid adenoma.
Materials and methods: 16 patients presented with suspected parathyroid adenoma localised within the neck. All patients
were injected with Tc-99m sestamibi and were scanned with a GE Infinia Hawkeye SPECT/CT. There were six negative
and ten positive confirmed cases. Five expert radiologists specializing in nuclear medicine were asked to report on the 16
planar and SPECT data sets and were then asked to report on the same randomly ordered data sets with the addition of
CT. Receiver operating characteristic (ROC) analysis was performed using the Dorfman-Berbaum-Metz multireadermulticase
methodology and sensitivity and specificity values were generated. A significance level of p ≤ 0.05 was set for
all comparisons.
Results: ROC analysis demonstrated an AUC of 0.64 and 0.69 for SPECT and SPECT/CT respectively (p = 0.31). Mean
sensitivity scores increased from 0.64 to 0.80 (p = 0.17) and specificity scores decreased from 0.57 to 0.40 (p = 0.17)
with the addition of the CT data.
Conclusion: This preliminary investigation suggests that extra CT information may increase lesion detection as well as
false positive rates for SPECT-based investigations of a single parathyroid adenoma. However the difference in
diagnostic efficacy between the two groups was not found to be statistically significant therefore requiring further
investigation. These findings have implications beyond the clinical situation described here.
Role of expertise and contralateral symmetry in the diagnosis of pneumoconiosis: an experimental study
Show abstract
Pneumoconiosis, a lung disease caused by the inhalation of dust, is mainly diagnosed using chest radiographs. The
effects of using contralateral symmetric (CS) information present in chest radiographs in the diagnosis of
pneumoconiosis are studied using an eye tracking experimental study. The role of expertise and the influence of CS
information on the performance of readers with different expertise level are also of interest. Experimental subjects
ranging from novices & medical students to staff radiologists were presented with 17 double and 16 single lung images,
and were asked to give profusion ratings for each lung zone. Eye movements and the time for their diagnosis were also
recorded. Kruskal-Wallis test (χ2(6) = 13.38, p = .038), showed that the observer error (average sum of absolute
differences) in double lung images differed significantly across the different expertise categories when considering all
the participants. Wilcoxon-signed rank test indicated that the observer error was significantly higher for single-lung
images (Z = 3.13, p < .001) than for the double-lung images for all the participants. Mann-Whitney test (U = 28, p =
.038) showed that the differential error between single and double lung images is significantly higher in doctors [staff &
residents] than in non-doctors [others]. Thus, Expertise & CS information plays a significant role in the diagnosis of
pneumoconiosis. CS information helps in diagnosing pneumoconiosis by reducing the general tendency of giving less
profusion ratings. Training and experience appear to play important roles in learning to use the CS information present in
the chest radiographs.
Analysis of the number of distinct findings obtained by multiple readers in an MRMC study: When do findings obtained from the addition of new readers become redundant, or otherwise negligible?
Show abstract
The ultimate goal of this project is to investigate whether the effect of a computer-aided detection (CAD) system on
readers' performance (especially, in situation of an upgrade of the CAD system, or between two different CAD systems
with similar design) can be accurately predicted without having to perform a multi-reader multi-case (MRMC) observer
study and, if such prediction is possible, to establish the underlying methodology. Our current study is intended to
provide evidence that would substantiate efforts toward such investigation. The objectives of this study were 1) to
investigate the relationship between the number of radiologists reading a dataset of thoracic computed tomography (CT)
images to identify lung nodules and the number of distinct findings and 2) to determine the number of readers needed to
identify almost all clinically distinct findings in a dataset. We used data from a multi-reader multi-case (MRMC)
observer study that consisted of six radiologists interpreting 85 thoracic CT examinations. To further illustrate our
approach, we also utilized simulated data consisting of twelve readers interpreting 198 samples equally distributed
between three levels of detection difficulty. For each possible reader grouping, the number of distinct findings identified
by the readers in the group was calculated. Five types of regression models used to describe the relationship between the
average number of distinct findings per case and the number of readers needed were compared. The result showed that
the logistic model best fitted both the thoracic CT data and the simulated data. Our assumption is that adding more
readers after a certain reader set size would mostly add redundant findings and, therefore, the benefit would be
negligible. Using this model, the predicted number of readers was found to depend on the type of findings considered.
Our study showed that the number of clinically distinct findings that can be identified by radiologists on CT lung
examinations without the use of a CAD system may be limited and that identifying almost all of these findings may only
require a limited number of readers.
Reproducibility of an imaging based prostate cancer prognostic assay
Faisal M. Khan,
Douglas Powell,
Valentina Bayer-Zubek,
et al.
Show abstract
The Prostate Px prognostic assay offered by Aureon Biosciences is designed to predict progression post primary
treatment for prostate cancer patients based on their diagnostic biopsy specimen. The assay is driven by the automated
image analysis of biological specimens. Three different histological sections are analyzed for morphometric as well as
immunofluorescence protein expression properties within areas of tumor digitally masked by expert pathologists.
The assay was developed on a multi-institution cohort of up to 9 images from each of 1027 patients. The variation in
histological sections, staining, pathologist tumor masking and the region of image acquisition all have the potential to
significantly impact imaging features and consequently the reproducibility of the assay's results for the same patient.
This study analyzed the reproducibility of the assay in 50 patients who were re-processed within 3 months in a blinded
fashion as de-novo patients.
The key assay results reported were in agreement in 94% of the cases. The two independent endpoints of risk
classification reproduced results in 90% and 92% of the predictions. This work presents one of the first assessments of
the reproducibility of a commercial assay's results given the inherent variations in images and quantitative imaging
characteristics in a commercial setting.
Assessment of updated CAD without a new reader study: effect of calibration of computer output on the computer-aided reader performance in CADx
Show abstract
It is very resource-demanding to assess each new version of a CAD system through a new reader study. We conjecture that the aided reader performance on a new version can be predicted by using certain characteristics of the computer output and the reader study conducted when the CAD system was initially introduced. This would likely reduce the need for additional reader studies. However, investigations are needed to develop a sound scientific foundation to test this conjecture. In this work, we consider a CADx system that outputs a disease score to aid the physician in making a diagnostic decision on a located lesion. Our major contribution is to show that calibration, reflected as a change in scale, is a characteristic of the computer output that needs to be considered in order to predict the aided reader performance in a new CADx version without a reader study. We used a bivariate bi-beta distribution to model the joint distribution of the decision variable underlying the reader without aid and the decision variable underlying the version 1 computer output in the initial version. We then applied a monotonic transformation to the computer output to simulate the computer output in a new version, i.e., the scores in the two versions differ only in calibration (specifically a change in scale). By further modeling certain mechanisms that the human reader may use for combining the computer output and the reader-alone scores, we computed the aided reader performance in terms of AUC for the new version of the CADx system. Our results show that the aided reader performance could depend on the degree of calibration difference between the two CAD system outputs. We conclude that for the purpose of predicting the aided reader performance of a new version of the CADx system, ROC performance (or any other rank-based metric) of the stand-alone CADx system may not be sufficient by itself.
Streak artefact quantification for abdominal CT
Show abstract
Streaking artefacts in computed tomography (CT) can be caused by photon starvation caused by highly attenuating
regions. Patient positioning can influence the attenuation e.g. by arms raised or down in an abdominal
CT scan.
Positioning the arms alongside the body increases attenuation, therefore higher dose can be expected. Additionally
the artefacts can cause a decrease in image quality.
Measuring this quality decrease is the purpose of this article. We implemented different methods to quantise
streaking artefacts and correlated them to the judgement of two radiologists in a study of 80 patients. High
significance was found for a correlation coefficient of 0.57.
This correlation from measurements and clinical usability (represented by the radiologists' ratings) enables
to predict the usability by means of image processing alone. This can be included in the patient image as a
correlate to the diagnostic usability, resp. a new volume can be made depending on the number.
High luminance monochrome vs. color displays: impact on performance and search
Show abstract
To determine if diagnostic accuracy and visual search efficiency with a high luminance medical-grade color display are
equivalent to a high luminance medical-grade monochrome display. Six radiologists viewed DR chest images, half with
a solitary pulmonary nodule and half without. Observers reported whether or not a nodule was present and their
confidence in that decision. Total viewing time per image was recorded. On a subset of 15 cases eye-position was
recorded. Confidence data were analyzed using MRMC ROC techniques. There was no statistically significant difference
(F = 0.0136, p = 0.9078) between color (mean Az = 0.8981, se = 0.0065) and monochrome (mean Az = 0.8945, se =
0.0148) diagnostic performance. Total viewing time per image did not differ significantly (F = 0.392, p = 0.5315) as a
function of color (mean = 27.36 sec, sd = 12.95) vs monochrome (mean = 28.04, sd = 14.36) display. There were no
significant differences in decision dwell times (true and false, positive and negative) overall for color vs monochrome
displays (F = 0.133, p = 0.7154). The true positive (TP) and false positive (FP) decisions were associated with the
longest dwell times, the false negatives (FN) with slightly shorter dwell times, and the true negative decisions (TN) with
the shortest (F = 50.552, p < 0.0001) and these trends were consistent for both color and monochrome displays. Current
color medical-grade displays are suitable for primary diagnostic interpretation in clinical radiology.
Study of signal-to-noise ratios considered human visual characteristics
Show abstract
The effects of imaging parameters on detectability have not yet been clarified. Therefore, we investigated the
usefulness of signal-to-noise ratios (SNRs) considered as human visual characteristics, such as the visual spatial
frequency response and the internal noise in the eye-brain system.
We examined the amplitude model (SNRa), matched filter model (SNRm), and internal noise model (SNRi) to study
the relationship between these SNRs and the visual image quality for signal detection. The test images were simulated by
the superimposition of low-contrast signals on a uniform noisy background. The SNRs were obtained for 15 imaging
cases with various signal sizes, signal contrasts, exposure levels, and number of acrylic plates used as breast phantoms.
The SNRs were calculated by measuring the spatial frequency characteristics of the signal, modulation transfer
function (MTF) of the system, display MTF, and overall Wiener spectrum (WS).
In the perceptual evaluation, we applied the 16-alternative forced choice (16-AFC) method. The signal detectability
was defined as the number of detected signals divided by the total number of signals. We studied the relationship
between SNR and signal detectability using Spearman's rank correlation coefficient.
The correlation coefficient of SNRi was 0.93, making it the highest among the three SNR types. That of SNRm was
0.91; it correlated at the same level as SNRi although it is not considered human visual characteristics. That of SNRa
was 0.45. SNRi, which incorporated the visual characteristics, explained the visual image quality well.
Radiation dose reduction in digital radiography using wavelet-based image processing methods
Show abstract
In this paper, we investigate the effect of the use of wavelet transform for image processing on radiation dose reduction
in computed radiography (CR), by measuring various physical characteristics of the wavelet-transformed images.
Moreover, we propose a wavelet-based method for offering a possibility to reduce radiation dose while maintaining a
clinically acceptable image quality. The proposed method integrates the advantages of a previously proposed technique,
i.e., sigmoid-type transfer curve for wavelet coefficient weighting adjustment technique, as well as a wavelet soft-thresholding
technique. The former can improve contrast and spatial resolution of CR images, the latter is able to
improve the performance of image noise. In the investigation of physical characteristics, modulation transfer function,
noise power spectrum, and contrast-to-noise ratio of CR images processed by the proposed method and other different
methods were measured and compared. Furthermore, visual evaluation was performed using Scheffe's pair comparison
method. Experimental results showed that the proposed method could improve overall image quality as compared to
other methods. Our visual evaluation showed that an approximately 40% reduction in exposure dose might be achieved
in hip joint radiography by using the proposed method.