Share Email Print

Proceedings Paper

Transforming a research-oriented dataset for evaluation of tactical information extraction technologies
Author(s): Heather Roy; Sue E. Kase; Joanne Knight
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

The most representative and accurate data for testing and evaluating information extraction technologies is real-world data. Real-world operational data can provide important insights into human and sensor characteristics, interactions, and behavior. However, several challenges limit the feasibility of experimentation with real-world operational data. Realworld data lacks the precise knowledge of a “ground truth,” a critical factor for benchmarking progress of developing automated information processing technologies. Additionally, the use of real-world data is often limited by classification restrictions due to the methods of collection, procedures for processing, and tactical sensitivities related to the sources, events, or objects of interest. These challenges, along with an increase in the development of automated information extraction technologies, are fueling an emerging demand for operationally-realistic datasets for benchmarking. An approach to meet this demand is to create synthetic datasets, which are operationally-realistic yet unclassified in content. The unclassified nature of these unclassified synthetic datasets facilitates the sharing of data between military and academic researchers thus increasing coordinated testing efforts. This paper describes the expansion and augmentation of two synthetic text datasets, one initially developed through academic research collaborations with the Army. Both datasets feature simulated tactical intelligence reports regarding fictitious terrorist activity occurring within a counterinsurgency (COIN) operation. The datasets were expanded and augmented to create two military relevant datasets. The first resulting dataset was created by augmenting and merging the two to create a single larger dataset containing ground-truth. The second resulting dataset was restructured to more realistically represent the format and content of intelligence reports. The dataset transformation effort, the final datasets, and their applicability for research are presented.

Paper Details

Date Published: 12 May 2016
PDF: 15 pages
Proc. SPIE 9851, Next-Generation Analyst IV, 98510O (12 May 2016); doi: 10.1117/12.2224032
Show Author Affiliations
Heather Roy, U.S. Army Research Lab. (United States)
Sue E. Kase, U.S. Army Research Lab. (United States)
Joanne Knight, U.S. Army Research Lab. (United States)

Published in SPIE Proceedings Vol. 9851:
Next-Generation Analyst IV
Barbara D. Broome; Timothy P. Hanratty; David L. Hall; James Llinas, Editor(s)

© SPIE. Terms of Use
Back to Top