MEDLINE, a biomedical literature database compiled by the US National Library of Medicine, contains 15 million records from approximately 5000 selected journals, and is searched over 3 million times a day worldwide. With more journal articles being published online in hypertext markup language (HTML), the automatic extraction of bibliographic data from HTML articles is important for creating MEDLINE records affordably. Generating MEDLINE data requires extracting heterogeneous information such as metadata (e.g., title, author, affiliation), recognizing named entities (e.g., grant number or accession number to various bioinformation repositories), and assigning medical subject heading (MeSH) indexing terms.
Most existing methods usually deal with each of the above tasks separately, and the information is extracted through text processing only. In the HTML journal article domain, besides text, other cues, such as physical layout, can and should also be used.1 More importantly, all MEDLINE data-generation tasks are related and therefore should be solved in one paradigm.
We propose systematic processing of journal articles for MEDLINE data extraction. This technique starts with layout analysis, that is, segmenting the articles into logical zones. Next, anything that can be further broken down, such as references, is semantically parsed. Finally, all the information is fused to solve the most challenging tasks, such as MeSH indexing term assignment, which is a difficult multiclass, multilabel hierarchical text-categorization problem.
Document Object Model (DOM) is a tree structure specified by the World Wide Web Consortium for accessing and manipulating HTML documents. Our logical layout analysis starts from the HTML DOM tree, and fuses four kinds of information. ‘Text feature,’ including word frequency and orthographic characteristics (e.g., capitalization), is important for distinguishing different logical components. ‘Geometric feature’ is another important cue that is usually overlooked. As shown in Figure 1, the logical components are obviously grouped by their geometric relations. After rendering the HTML file in a browser, the geometric features can easily be extracted from the browser application programming interface (API). ‘Common subtrees,’ corresponding to the duplicates in the HTML page, are a feature of typical articles. The two zones with black bounding boxes in Figure 1 are examples. ‘Internal links’ (<A href> and <A name> pairs) are also found in typical HTML journal articles. Figure 1 illustrates the internal links from body text to references with red arrows. DOM tree and internal links naturally form a graphic model of the HTML article. Through statistical machine learning methods of text, geometry, internal links, and common subtree analysis, the HTML journal articles can be reliably decomposed into logical components.2,3
Figure 1. The logical layout of a medical journal article published in the HTML can be parsed via text, geometry, internal link, and common subtree analysis.
Because logical layout analysis takes care of labeling the metadata, including title, author, affiliation, and abstract, the named entity recognition step can concentrate solely on body text paragraphs. We implemented a coarse-to-fine hierarchical approach to extract databank accession numbers and grant numbers.4 Through text analysis with a support vector machine (SVM) classifier, we eliminated many body text paragraphs. More careful analysis of the few candidate paragraphs was then conducted to extract the named entities. Figure 2 illustrates the idea. We also further parsed the references into authors, article titles, journal titles, volume, year, and pagination with an SVM classifier, which provides valuable information for identifying commentary article pairs and for assigning MeSH indexing terms.5
Figure 2. Coarse-to-fine analysis can extract sparsely located grant numbers. The grant number zone is marked with a thick black bounding box. The numbers themselves are highlighted with solid red boxes, and the informative words, which are helpful in detecting relevant paragraphs, are highlighted with dotted blue boxes.
We evaluated the logical layout analysis of 129 articles from 22 medical journals that follow very different HTML implementation styles. The F measures (an information retrieval metric) for title, author, affiliation, abstract, and references were 99.6, 100.0, 88.6, 91.7, and 99.5%. Databank accession number and grant number recognition conducted using the MEDLINE 2006 database (citations of articles indexed in 2006) similarly achieved high precision and recall rates. Finally, our reference-parsing algorithm was applied to 1000 references randomly collected from MEDLINE-indexed articles with a parsing accuracy of 99.52%.
We have described a systematic approach for creating MEDLINE bibliographic data, and have demonstrated that HTML medical journal articles can be successfully decomposed through text, geometry, internal link, and common subtree analysis. This semantic parsing of an article can significantly improve the performance of subsequent bibliographic data extraction. Indeed, so equipped, we are ready to tackle the most challenging task: MeSH indexing term assignment.
This research was supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine, and Lister Hill National Center for Biomedical Communications.
Lister Hill National Center for Biomedical Communications
National Library of Medicine
Jie Zou received his PhD degree in computer engineering from Rensselaer Polytechnic Institute in 2004. Since 2005, he has been with the Lister Hill National Center for Biomedical Communications at National Library of Medicine. His research interest is in the areas of information retrieval, pattern recognition, image processing, and computer vision.