Data Mining and Knowledge Discovery: Theory, Tools, and Technology II

Metric sensitivity of reciprocal relationship bonds in the knowledge discovery process

Belur V. Dasarathy

Show abstract

The concept of reciprocal relationship (RR) bonds, conceived and developed as a tool in the knowledge discovery process, was reported and illustrated at this conference last year. Various distance metrics for measuring the distances in the database multi-dimensional attribute space were reviewed in this context identifying their relationship to the type of the attributes, depending on whether they are continuous valued, integer valued, or symbolic valued or a mix thereof. In this study, we explore the sensitivity of the bonding process to the metrics that are mostly suited to continuous valued attributes only, such as Euclidean, city-block, and chess board, using once again the previously used data set for illustrative purposes. This sensitivity to the metric employed is investigated at various stages of the RR bond delineation process. Results obtained under the different metrics are compared and fused to identify the degree of consensus that in turn can provide a measure of confidence in the results of the knowledge discovery process.

Discovering fuzzy clusters in databases using an evolutionary approach

Lewis L. H. Chung, Keith C. C. Chan, Henry Leung

Show abstract

In this paper, we present a fuzzy clustering technique for relational database for data mining task. Clustering task for data mining application can be performed more effective if the technique is able to handle both continuous- and discrete- valued data commonly found in real-life relational databases. However, many of fuzzy clustering techniques such as fuzzy c- means are developed only for continuous-valued data due to their distance measure defined in the Euclidean space. When attributes are also characterized by discrete-valued attribute, they are unable to perform their task. Besides, how to deal with fuzzy input data in addition to mixed continuous and discrete is not clearly discussed. Instead of using a distance measure for defining similarity between records, we propose a technique based on a genetic algorithm (GA). By representing a specific grouping of records in a chromosome and using an objective measure as a fitness measure to determine if such grouping is meaningful and interesting, our technique is able to handle continuous, discrete, and even fuzzy input data. Unlike many of the existing clustering techniques, which can only produce the result of grouping with no interpretation, our proposed algorithm is able to generate a set of rules describing the interestingness of the discovered clusters. This feature, in turn, eases the understandability of the discovered result.

Distance functions in dynamic integration of data mining techniques

Seppo Jumani Puuronen, Alexey Tsymbal, Vagan Terziyan

Show abstract

One of the most important directions in the improvement of data mining and knowledge discovery is the integration of multiple data mining techniques. An integration method needs to be able either to evaluate and select the most appropriate data mining technique or to combine two or more techniques efficiently. A recent integration method for the dynamic integration of multiple data mining techniques is based on the assumption that each of the data mining techniques is the best one inside a certain subarea of the whole domain area. This method uses an instance-based learning approach to collect information about the competence areas of the mining techniques and applies a distance function to determine how close a new instance is to each instance of the training set. The nearest instance or instances are used to predict the performance of the data mining techniques. Because the quality of the integration depends heavily on the suitability of the used distance function, our goal is to analyze the characteristics of different distance functions. In this paper we investigate several distance functions as the very commonly used Euclidean distance function, the Heterogeneous Euclidean- Overlap Metric (HEOM), and the Heterogeneous Value Difference Metric (HVDM), among others. We analyze the effects of the use of different distance functions to the accuracy achieved by dynamic integration when the parameters describing datasets vary. We include also results of our experiments with different datasets which include both nominal and continuous attributes.

Value-based customer grouping from large retail data sets

Alexander Strehl, Joydeep Ghosh

Show abstract

In this paper, we propose OPOSSUM, a novel similarity-based clustering algorithm using constrained, weighted graph- partitioning. Instead of binary presence or absence of products in a market-basket, we use an extended 'revenue per product' measure to better account for management objectives. Typically the number of clusters desired in a database marketing application is only in the teens or less. OPOSSUM proceeds top-down, which is more efficient and takes a small number of steps to attain the desired number of clusters as compared to bottom-up agglomerative clustering approaches. OPOSSUM delivers clusters that are balanced in terms of either customers (samples) or revenue (value). To facilitate data exploration and validation of results we introduce CLUSION, a visualization toolkit for high-dimensional clustering problems. To enable closed loop deployment of the algorithm, OPOSSUM has no user-specified parameters. Thresholding heuristics are avoided and the optimal number of clusters is automatically determined by a search for maximum performance. Results are presented on a real retail industry data-set of several thousand customers and products, to demonstrate the power of the proposed technique.

Classification analysis of customer satisfaction and repeat buyers survey data

Girish J. Kulkarni

Show abstract

Abstract not available.

Design of a hybrid model classifier for data mining applications

Sayee Sumathi, S. N. Sivanandam, R. Ravindran

Show abstract

Abstract not available.

Genetic-algorithm-based optimization of a fuzzy logic resource manager for electronic attack

James F. Smith III, Robert D. Rhyne II

Show abstract

A fuzzy logic based expert system has been developed that automatically allocates electronic attack (EA) resources in real-time over many dissimilar platforms. The platforms can be very general, e.g., ships, planes, robots, land based facilities, etc. Potential foes the platforms deal with can also be general. This paper describes data mining activities related to development of the resource manager with a focus on genetic algorithm based optimization. A genetic algorithm requires the construction of a fitness function, a function that must be maximized to give optimal or near optimal results. The fitness functions are in general non- differentiable at many points and highly non-linear, neither property providing difficulty for a genetic algorithm. The fitness functions are constructed using insights from geometry, physics, engineering, and military doctrine. Examples are given as to how fitness functions are constructed including how the fitness function is averaged over a database of military scenarios. The use of a database of scenarios prevents the algorithm from having too narrow a range of behaviors, i.e., it creates a more robust solution.

Discovery of approximate concepts in clinical databases based on a rough set model

Shusaku Tsumoto

Show abstract

Rule discovery methods have been introduced to find useful and unexpected patterns from databases. However, one of the most important problems on these methods is that extracted rules have only positive knowledge, which do not include negative information that medical experts need to confirm whether a patient will suffer from symptoms caused by drug side-effect. This paper first discusses the characteristics of medical reasoning and defines positive and negative rules based on rough set model. Then, algorithms for induction of positive and negative rules are introduced. Then, the proposed method was evaluated on clinical databases, the experimental results of which shows several interesting patterns were discovered, such as a rule describing a relation between urticaria caused by antibiotics and food.

Granularity refined by knowledge: contingency tables and rough sets as tools of discovery

Jan M. Zytkow

Show abstract

Contingency tables represent data in a granular way and are a well-established tool for inductive generalization of knowledge from data. We show that the basic concepts of rough sets, such as concept approximation, indiscernibility, and reduct can be expressed in the language of contingency tables. We further demonstrate the relevance to rough sets theory of additional probabilistic information available in contingency tables and in particular of statistical tests of significance and predictive strength applied to contingency tables. Tests of both type can help the evaluation mechanisms used in inductive generalization based on rough sets. Granularity of attributes can be improved in feedback with knowledge discovered in data. We demonstrate how 49er's facilities for (1) contingency table refinement, for (2) column and row grouping based on correspondence analysis, and (3) the search for equivalence relations between attributes improve both granularization of attributes and the quality of knowledge. Finally we demonstrate the limitations of knowledge viewed as concept approximation, which is the focus of rough sets. Transcending that focus and reorienting towards the predictive knowledge and towards the related distinction between possible and impossible (or statistically improbable) situations will be very useful in expanding the rough sets approach to more expressive forms of knowledge.

Numerical-linguistic knowledge discovery using granular neural networks

Yanqing Zhang

Show abstract

In this paper, a granular-neural-network-based Knowledge Discovery and Data Mining (KDDM) method based on granular computing, neural computing, fuzzy computing, linguistic computing and pattern recognition is presented. The major issues include (1) how to use neural networks to discover granular knowledge from numerical-linguistic databases, and (2) how to use discovered granular knowledge to predict missing data. A Granular Neural Network (GNN) is designed to deal with numerical-linguistic data fusion and granular knowledge discovery in numerical-linguistic databases. From a data granulation point of view, the GNN can process granular data in a database. From a data fusion point of view, the GNN makes decisions based on different kinds of granular data. From a KDDM point of view, the GNN is able to learn internal granular relations between numerical-linguistic inputs and outputs, and predict new relations in a database.

Fuzzification of attribute-oriented generalization and its application to medicine

Shusaku Tsumoto

Show abstract

Conventional studies on knowledge discovery in databases (KDD) shows that combination of rule induction methods and attribute-oriented generalization is very useful to extract knowledge from data. However, attribute-oriented generalization in which concept hierarchy is used for transformation of attributes assumes that a given hierarchy is consistent. Thus, if this condition is violated, application of hierarchical knowledge generates inconsistent rules. In this paper, first, we show that this phenomenon is easily found in data mining contexts: when we apply attribute- oriented generalization to attributes in databases, generalized attributes will have fuzziness for classification. Then, we introduce two approaches to solve this problem, one process of which suggests that combination of rule induction and attribute-oriented generalization can be used to validate concept hierarchy. Finally, we briefly discuss the mathematical generalization of this solution in which context- free fuzzy sets is a key idea.

Information tables with neighborhood semantics

Yiyu Yao

Show abstract

Information tables provide a convenient and useful tool for representing a set of objects using a group of attributes. This notion is enriched by introducing neighborhood systems on attribute values. The neighborhood systems represent the semantics relationships between, and knowledge about, attribute values. With added semantics, neighborhood based information tables may provide a more general framework for knowledge discovery, data mining, and information retrieval.

Knowledge acquisition: neural network learning

Guoyin Wang, Paul S. Fisher

Show abstract

As the amount of information in the world is steadily increasing, there is a growing demand for tools for analyzing the information. Many scholars have been working hard to study machine learning in order to obtain knowledge from domain data sets. They hope to find patterns in terms of implicit dependencies in data. Artificial neural networks are efficient computing models which have shown their strengths in solving hard problems in artificial intelligence. They have also been shown to be universal approximators. Some scholars have done much work to interpret neural networks so that they will no longer be seen as black boxes and provided some plots and methods for knowledge acquisition using neural networks. These can be classified into three categories: fuzzy neural networks, CF (certainty factor) based neural networks, and logical neurons. We review some of these research works in this paper.

Auditing health insurance reimbursement by constructing association rules

I-Jen Chiang

Show abstract

Two months of reimbursement claim data of the admission patients from National Taiwan University Hospital have been used to be the training set (200 MB or so), a quick method has been used to find out the association rules among the illness, the examinations and treatments, the drugs, and the equipment. The filtered rules by setting the minimum support and the minimum confidence are used to screen out a month claimed data from the other hospital. Some unproper orders to the patients are able to checked out. In this paper, we will discuss the algorithm for generalizing association rule and the experiments of using the association rules to screen out the unproper orders in the health reimbursement claims.

Discovering spatial associations in images

Osmar R. Zaiane, Jiawei Han

Show abstract

In this paper, our focus in data mining is concerned with the discovery of spatial associations within images. Our work concentrates on the problem of finding associations between visual content in large image databases. Discovering association rules has been the focus of many studies in the last few years. However, for multimedia data such as images or video frames, the algorithms proposed in the literature are not sufficient since they miss relevant frequent item-sets due to the peculiarity of visual data, like repetition of features, resolution levels, etc. We present in this paper an approach for mining spatial relationships from large visual data repositories. The approach proceeds in three steps: feature localization, spatial relationship abstraction, and spatial association discovery. The mining process considers the issue of scalability and contemplates various feature localization abstractions at different resolution levels.

Data mining approach using machine-oriented modeling: finding association rules using canonical names

Eric Louie, Tsau Young Lin

Show abstract

An attribute value, in a relational model, is a meaningful label of a collection of objects; the collection is referred to as a granule of the universe of discourse. The granule itself can be regarded a label of the collection (granule); it will be referred to as the canonical name of the granule. A relational model using these canonical names themselves as attribute values (their bit patterns or lists of members) is called a machine oriented data model. For moderate size databases, finding association rules, decision rules, and etc., are reduced to easy computation of set theoretical operations of these collections. In this paper, a very fast computing algorithm is presented.

Using closed itemsets in association rule mining with taxonomies

Petteri Sevon

Show abstract

Taxonomies or item hierarchies are often useful in association rule mining. The most time consuming subtask in rule mining is the discovery of the frequent itemsets. A standard algorithm for frequent itemset discovery, such as Apriori, would however produce many redundant itemsets if taxonomies are used, because any two itemsets that share their most specific items always match the same set of transactions. An efficient algorithm must therefore chose a canonical representative for each equivalence class of itemsets. Srikant and Agrawal solved the problem of redundancy in their Cumulate and Stratify algorithms by not allowing an itemset to contain an item and its ancestor. In this paper I will present a new algorithm, Closed Sets, for finding frequent itemsets in the presence of taxonomies. The algorithm requires itemsets to be closed under the taxonomies. If an item is a member of a closed itemset, then all its ancestors also are. The algorithm makes fewer passes over the database than Stratify and it prunes the search space optimally. This is not the case with Cumulate, and not even with Stratify, if there are items in the taxonomy with multiple parents. Furthermore, only modest modifications to Apriori are needed.

Distributed data mining: an attribute-oriented key-preserving method

Maybin Muyeba, John A. Keane

Show abstract

Data mining algorithms are constantly being challenged by the need to process large data volumes efficiently. Attribute- Oriented Induction (AOI) is an inductive set-oriented technique used to mine large data by reducing its search space through attribute generalization and form summary rules. Most data mining techniques only end at producing rules for user analysis. The Key-Preserving method of AOI (AOI-KP) allows users to query data related to the learning task efficiently by using keys (attributes that index relations) to relations in the database and the generated rules. However, the initial problem is loading the whole data set into memory on a single memory machine. As data input size increases, the preserved keys and the data itself use up memory. Further, to solve the file I/O bottleneck for writing preserved keys, concurrency mechanisms were used on a single cluster of a Windows NT machine and improvements in execution time were obtained. One of the major solutions is to employ parallelism i.e. utilizing a distributed memory machine with explicit message passing. A Network of Workstations offers attractive scalability in terms of computational power and memory availability. We analyze performance of our algorithm on NOW and compare speed-up and scalability, which showed significant improvements.

Customer and household matching: resolving entity identity in data warehouses

Donald J. Berndt, Ronald K. Satterfield

Show abstract

The data preparation and cleansing tasks necessary to ensure high quality data are among the most difficult challenges faced in data warehousing and data mining projects. The extraction of source data, transformation into new forms, and loading into a data warehouse environment are all time consuming tasks that can be supported by methodologies and tools. This paper focuses on the problem of record linkage or entity matching, tasks that can be very important in providing high quality data. Merging two or more large databases into a single integrated system is a difficult problem in many industries, especially in the wake of acquisitions. For example, managing customer lists can be challenging when duplicate entries, data entry problems, and changing information conspire to make data quality an elusive target. Common tasks with regard to customer lists include customer matching to reduce duplicate entries and household matching to group customers. These often O(n²) problems can consume significant resources, both in computing infrastructure and human oversight, and the goal of high accuracy in the final integrated database can be difficult to assure. This paper distinguishes between attribute corruption and entity corruption, discussing the various impacts on quality. A metajoin operator is proposed and used to organize past and current entity matching techniques. Finally, a logistic regression approach to implementing the metajoin operator is discussed and illustrated with an example. The metajoin can be used to determine whether two records match, don't match, or require further evaluation by human experts. Properly implemented, the metajoin operator could allow the integration of individual databases with greater accuracy and lower cost.

Rule generation based on rough set theory

Guoyin Wang, Yu Wu, Paul S. Fisher

Show abstract

In this paper, we propose an approach that can generate logical rules from an information system. It is based on Pawlak's rough set theory. There are two steps in our rule generation approach. First, attribute reduction is done on an information table according to Skowron's discernibility matrix and logic function simplification, some important and valuable attributes are extracted. Then, value reduction is performed and corresponding logic rules are generated. All reducts including the minimal reduct of an information system can be obtained through these two reductions. Our approach can generate both the maximal generalized decision rules as well as potential interesting and useful rules according to requirements.

Theoretical sampling for data mining

Tsau Young Lin

Show abstract

Given a finite sequence of vectors (numerical tuples), there is a complexity associated to it, called data complexity. The 'simplest' pattern that is supported by this data set has a complexity, called pattern complexity. Then the 'smallest' sub-sequence, whose pattern complexity and data complexity are both equal to the pattern complexity of the original sequence, is the smallest sample, called theoretical sample. This paper investigates such samples.

Trend similarity and prediction in time-series databases

Jong P. Yoon, Jieun Lee, Sung-Rim Kim

Show abstract

Many algorithms for discovering similar patterns from time- series databases involve three phases: First, sequential data in time domain is transformed into frequency domain using DFT. Then, the first few data points are considered to depict in an R*-tree. Those points in an R*-tree are compared by their distance. Any pair of data points, if the distance between them is within a certain threshold, are found to be similar. This approach results in performance problem due to emphasis on each data point itself. This paper proposes a novel method of finding similar trend patterns, rather than similar data patterns, from time-series database. As opposed to similar data patterns in the frequency domain, a limited number of points, in the time series, that play a dominant role to make a movement direction are taken into account. Those data points are called a trend sequence. Trend sequences will be defined in various ways. Of many, we focus more on considering trend sequences by a data smoothing technique. We know that a trend sequence contains far fewer data points than an original data sequence, but entails abstract level of sequence movements. To some extent, given a trend sequence, we apply the smoothing algorithm to predict the very next trend data. It is likely that once a trend sequence is found, the very next trend data point is expected. This paper also shows a method for trend prediction. We observed that our approach presented in this paper can be applied to finding similarity among many large time-series data sequences to the prediction of next possible data points to follow.

Mining sequential patterns including time intervals

Mariko Yoshida, Tetsuya Iizuka, Hisako Shiohara, et al.

Show abstract

We introduce the problem of mining sequential patterns among items in a large database of sequences. For example, let us consider a database recording storm patterns, in such an area, at such a given time. An example of the patterns we are interested in is: '10% of storms go through area C 3 days after they strike areas A and B.' Previous research would have considered some equivalent patterns, but such work would use only 'after' (a succession in time) and omit '3 days after' (a period). Obtaining such patterns is very useful because we know when actions should be taken. To address this issue, we are studying an algorithm for discovering ordered lists of itemsets (a sets of items) with the time intervals between itemsets that occur in a sufficient number of sequences of transactions, we call these patterns 'delta pattern.' In this algorithm, we cluster time intervals between two neighboring itemsets using the CF-tree method while scanning the database and counting the number of occurrences of each candidate pattern. Extensive simulations are being conducted to evaluate patterns and to discover the power and performance of this algorithm. This algorithm has very good scale-up properties in execution time with respect to the number of data-sequences.

How can knowledge discovery methods uncover spatio-temporal patterns in environmental data?

Monica Wachowicz

Show abstract

This paper proposes the integration of KDD, GVis and STDB as a long-term strategy, which will allow users to apply knowledge discovery methods for uncovering spatio-temporal patterns in environmental data. The main goal is to combine innovative techniques and associated tools for exploring very large environmental data sets in order to arrive at valid, novel, potentially useful, and ultimately understandable spatio-temporal patterns. The GeoInsight approach is described using the principles and key developments in the research domains of KDD, GVis, and STDB. The GeoInsight approach aims at the integration of these research domains in order to provide tools for performing information retrieval, exploration, analysis, and visualization. The result is a knowledge-based design, which involves visual thinking (perceptual-cognitive process) and automated information processing (computer-analytical process).

Cycle mining in active database environments

Jennifer Seitzer, James P. Buckley

Show abstract

Traditional data mining algorithms identify patterns in data that are not explicit. These patterns are denoted in the form of IF-THEN rules (IF antecedent THEN consequent), where the antecedent and consequent are logical conjunctions of propositions or first-order predicates. Generally, the mined rules apply to all time periods and specify no temporal interval between antecedent detection and consequent firing. Cycle mining algorithms identify meta-patterns of these associations depicting inferences forming cyclic chains of rule dependencies. Because traditional rules comprise these cycles, the mined cycles also apply to all time periods and do not currently possess the temporal interval of applicability. An active database is one that responds to stimuli in real time, operating in the event-condition-action (ECA) paradigm where a specific event is monitored, a condition is evaluated, and one or more actions are taken. The actions often involve real-time modification of the database. In this paper, we introduce the concepts and present algorithms for mining rules with firing intervals, and intervals of applicability. Using an active database environment, we describe a real time framework that incorporates the active database concept in order to ascertain previously undefined cycles in data over a specific time interval and thereby introduce the concept of interval of discovery. Comprised of discovered rules with firing intervals and intervals of applicability, the encompassing discovered cycles also possess a variation of these attributes. We illustrate this framework with an example from an E-commerce endeavor where data is mined for rules with firing intervals and intervals of applicability, which amalgamate to form a cycle in its interval of discovery. We describe the computer system INDED, the author's implementation of cycle mining, which we are currently interfacing to an active Oracle database using triggers and PL/SQL stored procedures.

Autonomous visual discovery

Michael C. Burl, Dominic Lucchetti

Show abstract

This paper describes a prototype visual discovery algorithm that is designed to identify regions of an image that differ significantly from the local background. Image regions are projected into a visually-relevant subspace using a set of multi-orientation, multi-scale Gabor filters that model the receptive field properties of simple cells in the human visual cortex. Within this filter response subspace, deviant areas are identified through an adaptive statistical test that compares the filter-space description of a region against a model derived from the local background. Deviant regions are then spatially agglomerated and grouped across scale. Experimentation on a variety of archived imagery collected by JPL spacecraft and ground-based telescopes shows that the algorithm is able to autonomously 're-discover' a number of important geological objects such as impact craters, volcanoes, sand dunes, and ice geysers that are known to be of interest to planetary scientists.

Knowledge discovery in scientific data

Stephan Rudolph

Show abstract

From many industrial projects large collections of data from experiments and numerical simulations have been collected in the past. Knowledge discovery in scientific data from technical processes, i.e. the extraction of the hidden engineering knowledge in form of a mathematical model description of the experimental data is therefore a major challenge and an important part in the industrial re- engineering information processing chain for an improved future knowledge reuse. Scientific data possess special properties because of their domain of origin. Based on these properties of scientific data, a similarity transformation using the measurement unit information of the data can be performed. This similarity transformation eliminates the scale-dependence of the numerical data values and creates a multitude of dimensionless similarity numbers. Together with several reasoning strategies from artificial intelligence, such as case-based reasoning and neural networks, these similarity numbers may be used to estimate many engineering properties of the technical process under consideration.

Challenges and solutions to mining Earth science data

Rahul Ramachandran, Helen T. Conover, Sara J. Graves, et al.

Show abstract

Data Mining has an enormous potential as a processing tool for Earth Science data. It provides a solution for extracting information from massive amounts of data. However, designing a data mining system for earth science applications is complex and challenging. The two key issues that need to be addressed in the design are (1) variability of data sets and (2) operations for extracting information. Data sets not only come in different formats, types and structures; they are also typically in different states of processing such as raw data, calibrated data, validated data, derived data or interpreted data. The mining system must be designed to be flexible to handle these variations in data sets. The operations needed in the mining system vary for different application areas within earth science. Operations could range from general-purpose operations such as image processing techniques or statistical analysis to highly specialized data set-specific science algorithms. The mining system should be extensible in its ability to process new data sets and add new operations without too much effort. The ADaM (Algorithm Development and Mining) system, developed at the Information Technology and Systems Center at the University of Alabama in Huntsville, is one such mining system designed with these capabilities. The system provides knowledge discovery, content-based searching and data mining capabilities for data values, as well as for metadata. It contains over 100 different operations, which can be performed on the input data stream.

Mining remote sensing image data: an integration of fuzzy set theory and image understanding techniques for environmental change detection

Peter W. Eklund, Jane You, Peter Deer

Show abstract

This paper presents an image understanding approach to mine remotely sensed image data from different source dates for environmental change detection. It is focused on the immediate needs for knowledge discovery from large sets of image data for environmental monitoring. In contrast to the traditional approaches for change detection, we introduce a wavelet-based hierarchical scheme which integrates fuzzy set theory and image understanding techniques for knowledge discovery of the remote image data. The proposed approach includes algorithms for hierarchical change detection, region representations and classification. The effectiveness of the proposed algorithms is demonstrated throughout the completion of three tasks, namely hierarchial detection of change by fuzzy post classification comparisons, localization of change by B-spline based region representation, and categorization of change by hierarchial texture classification.

Systematic method to identify patterns in engineering data

Peter Hertkorn, Stephan Rudolph

Show abstract

In physics and engineering, data is represented as attribute- value pairs with corresponding measurement units. Due to the functional model building process in the natural sciences, any functional relationship with measurement units based on a dimensions concept can be written in terms of dimensionless groups. This is guaranteed by the Pi-Theorem of Buckingham which is based on the condition of dimensional homogeneity. These dimensionless groups are determined in the model building process by selecting the relevant problem variables and applying the Pi-Theorem. The Pi-Theorem helps in some cases to check whether the relevance list is formally complete and whether there are variables contained which are not relevant for the problem. This work focuses on the transformation of engineering data with dimensionless groups leading to a dimensionality reduction. The application of dimensionless groups to data mining problems has been observed to lead to improved results. It is shown that the similarity transformations map all physically completely similar data onto the very same point in the dimensionless space and represents a whole class of data. Using this property, the patterns found by a data mining algorithm can be verified by the physically completely similar data of an attribute-value pair in the database. An application of the knowledge discovery in database process based on dimensionless groups is demonstrated. The limitations and underlying assumptions of the approach are enumerated and discussed.

Integration inconsistency removal in data mining

Julius Stuller

Show abstract

The technological progress in the areas of the hardware, specially in the field of the (secondary) memories where the ever increasing capacities are paradoxically in the last several years available at ever decreasing prices and smaller physical sizes, and the software, continuously more and more user friendly, efficient and cheaper, together with the general expansion of the computers to almost all human activities, make it easier to realize the integration of many already existing databases. Unfortunately the process of databases integration can be accompanied by many various difficulties and problems. One of them is surely the possible occurrence of the inconsistencies appearing in this process of the integration. As we will see these inconsistencies can occur at various levels and they can be of different types. At the next stage some users go even further and try to get more from the accumulated data through data mining techniques. A data warehouse can be considered as a suitable technology for this purpose. Having in mind the data mining view of a data warehouse, one needs to know the sources of possible inconsistencies when building such a data warehouse in order to eliminate them as much as possible. In the paper we will define several existence conditions under which can occur different types of the inconsistencies in a warehouse and we will propose a classification of these inconsistencies based on the their sources. We will also propose a methodology and a procedure both of which aim at the elimination of these inconsistencies.

NASA aviation safety program: Aircraft Engine Health Management Data Mining Tools roadmap

Jonathan S. Litt, Donald L. Simon, Claudia Meyer, et al.

Show abstract

Aircraft Engine Health Management Data Mining Tools is a project led by NASA Glenn Research Center in support of the NASA Aviation Safety Program's Aviation System Monitoring and Modeling Thrust. The objective of the Glenn-led effort is to develop enhanced aircraft engine health management prognostic and diagnostic methods through the application of data mining technologies to operational data and maintenance records. This will lead to the improved safety of air transportation, optimized scheduling of engine maintenance, and optimization of engine usage. This paper presents a roadmap for achieving these goals.

Toward text understanding: classification of text documents by word map

Ari J. E. Visa, Jarmo Toivanen, Barbro Back, et al.

Show abstract

In many fields, for example in business, engineering, and law there is interest in the search and the classification of text documents in large databases. To information retrieval purposes there exist methods. They are mainly based on keywords. In cases where keywords are lacking the information retrieval is problematic. One approach is to use the whole text document as a search key. Neural networks offer an adaptive tool for this purpose. This paper suggests a new adaptive approach to the problem of clustering and search in large text document databases. The approach is a multilevel one based on word, sentence, and paragraph level maps. Here only the word map level is reported. The reported approach is based on smart encoding, on Self-Organizing Maps, and on document histograms. The results are very promising.

Application of syntactic methods of pattern recognition for data mining and knowledge discovery in medicine

Marek R. Ogiela, Ryszard Tadeusiewicz

Show abstract

This paper presents and discusses possibilities of application of selected algorithms belonging to the group of syntactic methods of patten recognition used to analyze and extract features of shapes and to diagnose morphological lesions seen on selected medical images. This method is particularly useful for specialist morphological analysis of shapes of selected organs of abdominal cavity conducted to diagnose disease symptoms occurring in the main pancreatic ducts, upper segments of ureters and renal pelvis. Analysis of the correct morphology of these organs is possible with the application of the sequential and tree method belonging to the group of syntactic methods of pattern recognition. The objective of this analysis is to support early diagnosis of disease lesions, mainly characteristic for carcinoma and pancreatitis, based on examinations of ERCP images and a diagnosis of morphological lesions in ureters as well as renal pelvis based on an analysis of urograms. In the analysis of ERCP images the main objective is to recognize morphological lesions in pancreas ducts characteristic for carcinoma and chronic pancreatitis, while in the case of kidney radiogram analysis the aim is to diagnose local irregularities of ureter lumen and to examine the morphology of renal pelvis and renal calyxes. Diagnosing the above mentioned lesion has been conducted with the use of syntactic methods of pattern recognition, in particular the languages of description of features of shapes and context-free sequential attributed grammars. These methods allow to recognize and describe in a very efficient way the aforementioned lesions on images obtained as a result of initial image processing of width diagrams of the examined structures. Additionally, in order to support the analysis of the correct structure of renal pelvis a method using the tree grammar for syntactic pattern recognition to define its correct morphological shapes has been presented.

Correlation of HIV protease structure with Indinavir resistance: a data mining and neural networks approach

Sorin Draghici, Lonnie T. Cumberland Jr., Ladislau C. Kovari

Show abstract

This paper presents some results of data mining HIV genotypic and structural data. Our aim is to try to relate structural features of HIV enzymes essential to its reproductive abilities to the drug resistance phenomenon. This paper concentrates on the HIV protease enzyme and Indinavir which is one of the FDA approved protease inhibitors. Our starting point was the current list of HIV mutations related to drug resistance. We used the fact that some molecular structures determined through high resolution X-ray crystallography were available for the protease-Indinavir complex. Starting with these structures and the known mutations, we modelled the mutant proteases and studied the pattern of atomic contacts between the protease and the drug. After suitable pre- processing, these patterns have been used as the input of our data mining process. We have used both supervised and unsupervised learning techniques with the aim of understanding the relationship between structural features at a molecular level and resistance to Indinavir. The supervised learning was aimed at predicting IC90 values for arbitrary mutants. The SOFM was aimed at identifying those structural features that are important for drug resistance and discovering a classifier based on such features. We have used validation and cross validation to test the generalization abilities of the learning paradigm we have designed. The straightforward supervised learning was able to learn very successfully but validation results are less than satisfactory. This is due to the insufficient number of patterns in the training set which in turn is due to the scarcity of the available data. The data mining using SOFM was very successful. We have managed to distinguish between resistant and non-resistant mutants using structural features. We have been able to divide all reported HIV mutants into several categories based on their 3- dimensional molecular structures and the pattern of contacts between the mutant protease and Indinavir. Our classifier shows reasonably good prediction performance being able to predict the drug resistance of previously unseen mutants with an accuracy of between 60% and 70%. We believe that this performance can be greatly improved once more data becomes available. The results presented here support the hypothesis that structural features of the molecular structure can be used in antiviral drug treatment selection and drug design.

Data mining algorithm for discovering matrix association regions (MARs)

Gautam B. Singh, Shephan A. Krawetz

Show abstract

Lately, there has been considerable interest in applying Data Mining techniques to scientific and data analysis problems in bioinformatics. Data mining research is being fueled by novel application areas that are helping the development of newer applied algorithms in the field of bioinformatics, an emerging discipline representing the integration of biological and information sciences. This is a shift in paradigm from the earlier and the continuing data mining efforts in marketing research and support for business intelligence. The problem described in this paper is along a new dimension in DNA sequence analysis research and supplements the previously studied stochastic models for evolution and variability. The discovery of novel patterns from genetic databases as described is quite significant because biological patterns play an important role in a large variety of cellular processes and constitute the basis for gene therapy. Biological databases containing the genetic codes from a wide variety of organisms, including humans, have continued their exponential growth over the last decade. At the time of this writing, the GenBank database contains over 300 million sequences and over 2.5 billion characters of sequenced nucleotides. The focus of this paper is on developing a general data mining algorithm for discovering regions of locus control, i.e. those regions that are instrumental for determining cell type. One such type of element of locus control are the MARs or the Matrix Association Regions. Our limited knowledge about MARs has hampered their detection using classical pattern recognition techniques. Consequently, their detection is formulated by utilizing a statistical interestingness measure derived from a set of empirical features that are known to be associated with MARs. This paper presents a systematic approach for finding associations between such empirical features in genomic sequences, and for utilizing this knowledge to detect biologically interesting control signals, such as MARs. This computational MAR discovery tool is implemented as a web-based software called MAR-Wiz and is available for public access. As our knowledge about the living system continues to evolve, and as the biological databases continue to grow, a pattern learning methodology similar to that described in this paper will be significant for the detection of regulatory signals embedded in genomic sequences.

Data mining approaches for information retrieval from genomic databases

Donglin Liu, Gautam B. Singh

Show abstract

Sequence retrieval in genomic databases is used for finding sequences related to a query sequence specified by a user. Comparison is the main part of the retrieval system in genomic databases. An efficient sequence comparison algorithm is critical in bioinformatics. There are several different algorithms to perform sequence comparison, such as the suffix array based database search, divergence measurement, methods that rely upon the existence of a local similarity between the query sequence and sequences in the database, or common mutual information between query and sequences in DB. In this paper we have described a new method for DNA sequence retrieval based on data mining techniques. Data mining tools generally find patterns among data and have been successfully applied in industries to improve marketing, sales, and customer support operations. We have applied the descriptive data mining techniques to find relevant patterns that are significant for comparing genetic sequences. Relevance feedback score based on common patterns is developed and employed to compute distance between sequences. The contigs of human chromosomes are used to test the retrieval accuracy and the experimental results are presented.

Application of hidden Markov models to biological data mining: a case study

Michael M. Yin, Jason Tsong-Li Wang

Show abstract

In this paper we present an example of biological data mining: the detection of splicing junction acceptors in eukaryotic genes. Identification or prediction of transcribed sequences from within genomic DNA has been a major rate-limiting step in the pursuit of genes. Programs currently available are far from being powerful enough to elucidate the gene structure completely. Here we develop a hidden Markov model (HMM) to represent the degeneracy features of splicing junction acceptor sites in eukaryotic genes. The HMM system is fully trained using an expectation maximization (EM) algorithm and the system performance is evaluated using the 10-way cross- validation method. Experimental results show that our HMM system can correctly classify more than 94% of the candidate sequences (including true and false acceptor sites) into right categories. About 90% of the true acceptor sites and 96% of the false acceptor sites in the test data are classified correctly. These results are very promising considering that only the local information in DNA is used. The proposed model will be a very important component of an effective and accurate gene structure detection system currently being developed in our lab.

Mining knowledge in medical image databases

Petra Perner

Show abstract

Availability of digital data within picture archiving and communication systems raises a possibility of health care and research enhancement associated with manipulation, processing and handling of data by computers. That is the basis for computer-assisted radiology development. Further development of computer-assisted radiology is associated with the use of new intelligent capabilities such as multimedia support and data mining in order to discover the relevant knowledge for diagnosis. In this paper, we present our work on data mining in medical picture archiving systems. We use decision tree induction in order to learn the knowledge for computer- assisted image analysis. We are applying our method to interpretation of x-ray images for lung cancer diagnosis. We are describing our methodology on how to perform data mining on picture archiving systems and our tool for data mining. Results are given. The method has shown very good results so that we are going on to apply it to other medical image diagnosis tasks such as lymph node diagnosis in MRI and investigation of breast MRI.

Data mining and intelligent queries in a knowledge-based multimedia medical database system

Shuhua Zhang, John D. Coleman

Show abstract

Multimedia medical databases have accumulated large quantities of data and information about patients and their medical conditions. Patterns and relationships within this data could provide new knowledge for making better medical decisions. Unfortunately, few technologies have been developed and applied to discover and use this hidden knowledge. We are currently developing a next generation knowledge-based multimedia medical database, named MedBase, with advanced behaviors for data analysis and data fusion. As part of this R&D effort, a knowledge-rich data model is constructed to incorporate data mining techniques/tools to assist the building of medical knowledge bases, and to facilitate intelligent answering of users' investigative and knowledge queries in the database. Techniques such as data generalization, classification, clustering, semantic structures, and concept hierarchies, are used to acquire and represent both symbolic and spatial knowledge implicit in the database. With the availability of semantic structures, concept hierarchies and generalized knowledge, queries may be posed and answered at multiple levels of abstraction. In this article we provide a general description of the approaches and efforts undertaken so far in the MedBase project.

Adapting the right web pages to the right users

Xiong Hui, Sam Yuan Sung, Stephen Huang

Show abstract

With the explosive use of the Internet, there is an ever- increasing volume of Web usage data being generated and warehoused in numerous successful Web sites. Analyzing Web usage data can help Web developers to improve the organization and presentation of their Web sites. Considering the fact that mining for patterns and rules in market basket data is well studied in data mining field, we provide a mapping approach, which can transform Web usage data into the form like market basket data. Using our model, all the methods developed by data mining research groups can be directly applied on Web usage data without much change. Existing methods for knowledge discovery in Web logs are restricted by the difficulty of getting the complete and reliable Web usage data and effectively identifying user sessions using current Web server log mechanism. The problem is due to Web caching and the existence of proxy servers. As an effort to remedy this problem, we built our own Web server log mechanism that can effectively capture user access behavior and will not be deliberately bypassed by proxy servers and end users.

Data mining for the e-business: developments and directions

Alfred Grasso, Harry Sleeper, Bhavani M. Thuraisingham, et al.

Show abstract

This paper describes data mining and e-business and then shows how data mining may be applied to e-business to gather consumer/supplier intelligence so that targeted marketing and merchandising may be carried out.

Harnessing agent technologies for data mining and knowledge discovery

Jenifer S. McCormack, Brian Wohlschlaeger

Show abstract

Data mining and knowledge discovery in databases are providing means to analyze and discover new knowledge from large datasets. The growth of the Internet has provided the average user with the ability to more easily access and gather data. Many of the existing data mining tools require users to have advanced knowledge. New graphical-based tools are needed to allow the average user to easily and quickly discover new patterns and trends from heterogenous data. SAIC is developing an agent-based data mining tool called AgentMiner_tm as part of an internal research project. AgentMiner_tm will allow the user to perform advanced information retrieval and data mining to discover patterns and relationships across multiple distributed, heterogeneous data sources. The current system prototype utilizes an ontology to define common concepts and data elements that are contained in the distributed data sources. AgentMiner_tm can access data from relational databases, structured text, web pages, and open text sources. It is a Java-based application that contains a suite of graphical tools such as the Mission Manager, Graphical Ontology Builder (GOB), and Qualified English Interpreter (QEI). In addition, AgentMiner_tm provides the capability to support both 2-D and 3-D data visualization, including animation across a selected independent variable.

Application of data mining and knowledge discovery in emergency models and simulations

Eduardo R. Sevilla

Show abstract

With the increasing amount of open information available, the problems encountered in finding the pertinent information is growing in difficulty. There are problems in the validation of the material, resolving conflicting data, and bringing a dynamic situation into a coherent and manageable data stream. With the advances in tools and the human in the loop, combined with the use of unusual sources, a good data mining team will be able to discover the majority of the information needed to populate a model's database. The application of this process is shown as it applies to the medical situation in the Consequence Assessment Tools Set.

National traffic system evaluation using data mining techniques

Edmond Chin Ping Chang

Show abstract

This paper describes a study, using the data mining technique, to examine incident response time statistics before and after the incident management systems were deployed based on the information collected in the Metropolitan ITS Infrastructure Deployment Tracking system.

Using KDD to analyze the impact of curriculum revisions in a Brazilian university

Karin Becker, Cinara Guellner Ghedini, Egidio Loch Terra

Show abstract

This work presents an experience on the use of KDD (Knowledge Discovery in Databases) to identify and understand whether curriculum revisions can affect students in a Brazilian university. Presently, there is no framework to define the notion of impact caused by curriculum revisions, and the use of KDD can bring significant contributions, given the amount of data involved. The paper describes the analysis framework defined so far for measuring the impact of curriculum revisions, and reports the results obtained after the analysis of students records related to five distinct degrees. The results obtained so far indicate that individual revisions quite often do not affect students, being sometimes even beneficial to them. However, considering the set of revisions they face during their academic lifetime, it is possible to generalize that many students are lightly harmed. This harm influences the number of extra-classes they have to take to fulfill the requirements for obtaining a given degree, but the time required to graduate is not affected by revisions.

Self-organized neural network scheme as a data mining tool

Sayee Sumathi, S. N. Sivanandam, Jagadeeswari

Show abstract

Abstract not available.

Data Mining and Knowledge Discovery: Theory, Tools, and Technology II

Volume Details

Table of Contents

Table of Contents