SPIE Membership Get updates from SPIE Newsroom
  • Newsroom Home
  • Astronomy
  • Biomedical Optics & Medical Imaging
  • Defense & Security
  • Electronic Imaging & Signal Processing
  • Illumination & Displays
  • Lasers & Sources
  • Micro/Nano Lithography
  • Nanotechnology
  • Optical Design & Engineering
  • Optoelectronics & Communications
  • Remote Sensing
  • Sensing & Measurement
  • Solar & Alternative Energy
  • Sign up for Newsroom E-Alerts
  • Information for:
    Advertisers
SPIE Journals OPEN ACCESS

SPIE PRESS

Print PageEmail PageView PDF

Biomedical Optics & Medical Imaging

A novel convolutional neural network for deep-learning classification

Preliminary results from a single-trial rapid serial visual representation task demonstrate the potential for enabling generalized human-autonomy sensor fusion across multiple subjects.
13 August 2016, SPIE Newsroom. DOI: 10.1117/2.1201607.006632

Brain–computer interfaces (BCIs) have traditionally been used to enable communication and control for paralyzed patients.1 However, it is also thought that BCIs hold promise for fulfilling the longstanding goal of creating artificial systems (i.e., which can perform with the adaptability, robustness, and general intelligence of humans). To augment the sensing and processing capabilities of such artificial systems, BCI systems can thus be used on healthy individuals. In this way, the biological machinery that enables human cognition can be leveraged. Image triage—a visual target search over a set of images—is a prime application for this new class of BCI. Humans can effortlessly identify target objects in scenes that stymie even the best machine vision techniques. Manual inspection by humans, however, is limited by the speed at which targets can be consciously detected and reported by a behavioral response. For example, when targets are identified by pressing a button, the button is typically pressed two to five images after the target image is shown (when the image stream is presented at 5Hz). This forces the observer to assume a distribution of images for the several images that precede the button press.2 In addition, humans perform inconsistently because of exogenous distractions or endogenous factors (e.g., fatigue), whereas computer vision algorithms offer constant and predictable performance.

Purchase SPIE Field Guide to MicroscopyAs an alternative, machine learning approaches can be applied to raw human neurophysiological data and thus reveal signals that are relevant to the detection of target images. Ultimately, this can increase both the accuracy and the response rate of image triage classification tasks. Indeed, in recent work, it has been shown that the classification performance in a rapid serial visual presentation (RSVP) image triage task (see Figure 1) can be improved by combining human neurophysiological data with machine vision classifiers.2 To date, such methods have relied on the late fusion of human and machine-generated classifier outputs. In other words, the classifiers for image and human data are trained separately and their outputs are later fused. It may be possible to improve the classification performance even further if the complementary information carried in the human signals and the image data can be trained in tandem. To realize this aim, however, relevant neurophysiological data (which carries a discriminatory signal) and the ability to process and convert these signals to useful task determinants is required. In addition, it is necessary to have a common framework, within which it is possible to train a classifier that directly learns combined models of human neurophysiological data and image data.


Figure 1. Illustrating the display of images during a rapid serial visual presentation (RSVP) task. In this case, the images are presented at 5Hz and the subjects are required to indicate (via a behavioral response) when an—infrequently occurring—target image is shown.

The aim of our work3 is to improve human-autonomy classification performance by developing a single framework that builds codependent models of human neuophysiological information and image data to generate fused target estimates. CNNs are a type of supervised deep-learning architecture that have set record benchmarks in many domains, including speech recognition, drug discovery, genomics, and visual object recognition.4 CNNs enable automatic feature selection and extraction from raw data. This is achieved by hierarchically stacking linear and nonlinear filtering modules to form a network, where each layer transforms an input into a representation at a higher and more abstract level. The resulting non-convex optimization is performed through the iterative application of a back-propagation algorithm until the maximum performance is achieved or the network converges. Given the success of using CNNs in machine vision for object recognition, as well as recent work in which CNNs are used for multimodal fusion,5–7 we believe that CNNs are thus promising for early fusion models of EEG data and computer vision.

As a first step in our study, we investigated the use of CNNs for multiclass single-trial classification of EEG recordings across multiple subjects during an RSVP task. Our results suggest that the EEG RSVP CNN classifier was able to meet—and exceed—the performance of other classifiers. This was the case for a single generalized model across 18 subjects (where our method learned one model and each of the other classifiers learned 18 models). Our classifier also out-performed the other classifiers for automatically selecting features that do not explicitly rely on the detection of any known features. A sample of the filters learned by our network in these tests is shown in Figure 2.


Figure 2. Three sample layers of spatial filters learned by the electroencephalograph (EEG) RSVP convolution neural network. The filters are shown mapped to the position of EEG electrodes that were placed on the scalp of subjects. Dark red and dark blue indicate larger magnitudes of activation (no units). Black dots denote the true spatial location of each electrode on the scalp.

Our CNN design includes four convolutional layers, two fully connected layers, and a readout layer. We provide a more detailed rationale for this network architecture, and its hyperparameters, elsewhere.3 Our preliminary results (see Figure 3) show that—compared with the other classifiers—our CNN achieves the highest performance level. The area-under-the-curve (AUC) value we obtain (0.72) is higher than for the second-best classifier (0.71), despite the fact that we use a generalized model and individual models are used for the other classifier. In addition, when we train our classifier to convergence, we see overfitting and a corresponding increase in loss. To prevent this, we only train the classifier until the loss on the validation set starts to increase.


Figure 3. Area under the curve (AUC) and loss of the EEG RSVP CNN classifier during training iteration on the evaluation testing set. The dashed red line indicates when the maximum AUC value was reached.

In summary, we have designed a CNN deep-learning classifier that learns a single generalized model across multiple subjects for single-trial RSVP EEG classification. We have demonstrated that our CNN is a viable alternative to existing neural classifiers, by showing that it meets and exceeds the classification performance of several leading classifiers. By designing a CNN classifier to automatically detect the features that maximally separate target and non-target samples from raw data, we hypothesized that our framework would learn a feature set that enabled high performance. We also predicted that the classifier would have an increased robustness across subjects. Our preliminary analysis, however, is not sufficient to make meaningful statistical comparisons between the performance of our network and the state-of-the-art methods. Nonetheless, we are able to achieve a similar performance with our CNN (i.e., without requiring individualized models for each subject or explicit reliance on known features in the raw data). We are currently extending this work to statistically validate the performance of our CNN. We will quantify the contributions of different spatial weights and analyze the significance of the features that the network has learned.


Jared Shamwell, Hyungtae Lee, Heesung Kwon, Vernon Lawhern, William Nothwang
US Army Research Laboratory (ARL)
Adelphi, MD

Jared Shamwell earned his BA in economics and philosophy from Columbia University in 2009, and is currently pursuing a PhD in neuroscience at the University of Maryland, College Park. He is also a researcher at ARL, with research interests in machine learning, computational neuroscience, and robotics.

Hyungtae Lee received his BS in electrical engineering and mechanical engineering from Sogang University, Republic of Korea, in 2006, his MS in electrical engineering from the Korea Advanced Institute of Science and Technology in 2008, and his PhD in electrical and computer engineering from the University of Maryland, College Park, in 2014. He is currently employed as an electrical engineering senior consultant by Booz Allen Hamilton Inc. (working at ARL). His research interests include object, action, event, and pose recognition, computer vision, and pattern recognition.

Heesung Kwon is a team leader of the imagery analytics team in the Image Processing Branch at ARL. His current research interests include image/video analytics, human-autonomy interactions, deep learning, and machine learning. He has published about 100 journal articles, book chapters, and conference papers on these topics.

Vernon Lawhern is currently working as a mathematical statistician in the Human Research and Engineering Directorate at ARL. He is interested in machine learning, statistical signal processing, and data mining of large neurophysiological data collections for the development of improved brain–computer interfaces.

William Nothwang is currently the team leader for the Electronics for Sense and Control Team within the Sensors and Electron Devices Directorate at ARL. His team conducts basic and applied scientific research in distributed state estimation for the dismounted soldier and microair vehicles (specifically applied to microautonomous air systems and human physiological state monitoring).

Amar R. Marathe
US ARL
Aberdeen, MD

Amar Marathe is currently a biomedical engineer in the Human Research and Engineering Directorate at ARL. He is interested in using modern machine learning approaches to characterize and quantify human variability.


References:
1. J. R. Wolpaw, D. J. McFarland, Control of a two-dimensional movement signal by a noninvasive brain-computer interface in humans, Proc. Nat'l Acad. Sci. USA 101, p. 17849-17854, 2004.
2. R. M. Robinson, H. Lee, M. J. McCourt, A. R. Marathe, H. Kwon, C. Ton, W. D. Nothwang, Human-autonomy sensor fusion for rapid object detection, p. 305-312, 2015. doi:10.1109/IROS.2015.7353390
3. J. Shamwell, H. Lee, H. Kwon, A. R. Marathe, V. Lawhern, W. Nothwang, Single-trial EEG RSVP classification using convolutional neural networks, Proc. SPIE 9836, p. 983622, 2016. doi:10.1117/12.2224172
4. Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521, p. 436-444, 2015.
5. O. Mangin, D. Filliat, L. ten Bosch, P.-Y. Oudeyer, MCA-NMF: multimodal concept acquisition with non-negative matrix factorization, PLOS ONE 10, p. e0140732, 2015. doi:10.1371/journal.pone.0140732
6. L. Ma, Z. Lu, L. Shang, H. Li, Multimodal convolutional neural networks for matching image and sentence, IEEE Int'l Conf. Comp. Vision, p. 2623-2631, 2015. doi:10.1109/ICCV.2015.301
7. H. P. Martínez, G. N. Yannakakis, Deep multimodal fusion: combining discrete events and continuous signals, Proc. 16th Int'l Conf. Multimodal Interaction, p. 34-41, 2014. doi:10.1145/2663204.2663236