Share Email Print

Proceedings Paper

Web mining for topics defined by complex and precise predicates
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

The enormous growth of the World Wide Web has made it important to perform resource discovery efficiently for any given topic. Several new techniques have been proposed in the recent years for this kind of topic specific web-mining, and among them a key new technique called focused crawling which is able to crawl topic-specific portions of the web without having to explore all pages. Most existing research on focused crawling considers a simple topic definition that typically consists of one or more keywords connected by an OR operator. However this kind of simple topic definition may result in too many irrelevant pages in which the same keyword appears in a wrong context. In this research we explore new strategies for crawling topic specific portions of the web using complex and precise predicates. A complex predicate will allow the user to precisely specify a topic using Boolean operators such as "AND", "OR" and "NOT". Our work will concentrate on defining a format to specify this kind of a complex topic definition and secondly on devising a crawl strategy to crawl the topic specific portions of the web defined by the complex predicate, efficiently and with minimal overhead. Our new crawl strategy will improve the performance of topic-specific web crawling by reducing the number of irrelevant pages crawled. In order to demonstrate the effectiveness of the above approach, we have built a complete focused crawler called "Eureka" with complex predicate support, and a search engine that indexes and supports end-user searches on the crawled pages.

Paper Details

Date Published: 12 April 2004
PDF: 8 pages
Proc. SPIE 5433, Data Mining and Knowledge Discovery: Theory, Tools, and Technology VI, (12 April 2004); doi: 10.1117/12.542359
Show Author Affiliations
Ching-Cheng Lee, California State Univ./Hayward (United States)
Sushma Sampathkumar, California State Univ./Hayward (United States)

Published in SPIE Proceedings Vol. 5433:
Data Mining and Knowledge Discovery: Theory, Tools, and Technology VI
Belur V. Dasarathy, Editor(s)

© SPIE. Terms of Use
Back to Top