Data Mining and Knowledge Discovery: Theory, Tools, and Technology VI

Attributes significance elucidation as a knowledge discovery tool under case-based reasoning

Belur V. Dasarathy

Show abstract

Case based reasoning (CBR) has come to be recognized as one of the standard approaches in the context of data mining and knowledge discovery to determine the best match of a new case among the currently defined cases in the database. It is well known that CBR is highly sensitive to the attribute space in which such reasoning is carried out. A priori assessment of the attributes to determine the relevancy and effectiveness of the potential attributes is often carried out using feature selection techniques adapted from the pattern recognition world. It would be beneficial to be able to elucidate the significance of the attributes so selected on the decisions made by the CBR process on the individual cases. Towards this end, the concept of elucidation proposed recently in the context of information fusion systems is adapted here to develop a methodology for attribute significance assessment. The methodology is illustrated using a couple of well-known data sets available in the literature and/or on the Internet.

Cluster analysis of long time-series medical datasets

Shoji Hirano, Shusaku Tsumoto

Show abstract

This paper presents a comparative study about the characteristics of clustering methods for inhomogeneous time-series medical datasets. Using various combinations of comparison methods and grouping methods, we performed clustering experiments of the hepatitis data set and evaluated validity of the results. The results suggested that (1) complete-linkage (CL) criterion in agglomerative hierarchical clustering (AHC) outperformed average-linkage (AL) criterion in terms of the interpretability of a dendrogram and clustering results, (2) combination of dynamic time warping (DTW) and CL-AHC constantly produced interpretable results, (3) combination of DTW and rough clustering (RC) would be used to find the core sequences of the clusters, (4) multiscale matching may suffer from the treatment of 'no-match' pairs, however, the problem may be eluded by using RC as a subsequent grouping method.

Retrieval using texture features in high-resolution multispectral satellite imagery

Shawn D. Newsam, Chandrika Kamath

Show abstract

Texture features have long been used in remote sensing applications to represent and retrieve image regions similar to a query region. Various representations of texture have been proposed based on the Fourier power spectrum, spatial co-occurrence, wavelets, Gabor filters, etc. These representations vary in their computational complexity and their suitability for representing different region types. Much of the work done thus far has focused on panchromatic imagery at low to moderate spatial resolutions, such as images from Landsat 1-7 which have a resolution of 15-30 m/pixel, and from SPOT 1-5 which have a resolution of 2.5-20 m/pixel. However, it is not clear which texture representation works best for the new classes of high resolution panchromatic (60-100 cm/pixel) and multi-spectral (4 bands for red, green, blue, and near infra-red at 2.4-4 m/pixel) imagery. It is also not clear how the different spectral bands should be combined. In this paper, we investigate the retrieval performance of several different texture representations using multi-spectral satellite images from IKONOS. A query-by-example framework, along with a manually chosen ground truth dataset, allows different combinations of texture representations and spectral bands to be compared. We focus on the specific problem of retrieving inhabited regions from images of urban and rural scenes. Preliminary results show that 1) the use of all spectral bands improves the retrieval performance, and 2) co-occurrence, wavelet and Gabor texture features perform comparably.

Discovering social groups without having relational data

John Salerno, Zhongfei Zhang, Ronny Lewin, et al.

Show abstract

Who is associated with whom? Who communicates with whom? When two or more individuals get together is there an intended purpose? Who are the leaders/important individuals of the group? What is the organizational structure of the group? These are just a few of the questions that are covered under the topic of social network analysis. Data mining, specifically community generation, attempts to automatically discover and learn these social models. In this paper we present one class of problems which we have called the uni-party data community generation paradigm. We discuss various applications, a methodology and results from two problem domains.

Code-concatenation-based multiple classifer systems for automatic target recognition

Widhyakorn Asdornwised, Somchai Jitapunkul

Show abstract

In this paper, we propose a new multiple classifier system (MCS) based on two concatenated stages of multiple description coding models (MDC) and Support Vector Machine (SVM). This paper draws on concepts coming from a variety of disciplines that includes classical concatenated coding of error correcting codes, multiple description coding of wavelet based image compression, error correcting output codes (ECOC) of multiple classifier systems, and antithetic-common varaites of Monte Carlo Methods. In our previous work, we proposed and extended several methods in MDC to MCS with inspirations from two frameworks. First, we found that one of our methods is equivalent to one of the variance reduction techniques, called antithetic-common variates, in the Monte Carlo Methods (MCM). Second, another equivalent relation between one of our methods and transmitting data over heterogeneous network, especially wireless networks, are established. In this paper, we also include several support ideas. For example, preliminary surveys on the biological plausible of the MDC concepts are also included. One of the benefits of our approach is that it allows us to formulate a generalized class of signal processing based weak classification algorithms. This will be very applicable for MDC-SVM in high dimensional classification problems, such as image/target recognition. Performance results for automatic target recognition are presented for synthetic aperture radar (SAR) images from the MSTAR public release data set. From the experimental results, our proposed method outperform state-of-the-art multiple classifier systems, such as SVM-ECOC etc.

Self-adaptive parameters in genetic algorithms

Eric Pellerin, Luc Pigeon, Sylvain Delisle

Show abstract

Genetic algorithms are powerful search algorithms that can be applied to a wide range of problems. Generally, parameter setting is accomplished prior to running a Genetic Algorithm (GA) and this setting remains unchanged during execution. The problem of interest to us here is the self-adaptive parameters adjustment of a GA. In this research, we propose an approach in which the control of a genetic algorithm’s parameters can be encoded within the chromosome of each individual. The parameters’ values are entirely dependent on the evolution mechanism and on the problem context. Our preliminary results show that a GA is able to learn and evaluate the quality of self-set parameters according to their degree of contribution to the resolution of the problem. These results are indicative of a promising approach to the development of GAs with self-adaptive parameter settings that do not require the user to pre-adjust parameters at the outset.

A meta-learning system based on genetic algorithms

Eric Pellerin, Luc Pigeon, Sylvain Delisle

Show abstract

The design of an efficient machine learning process through self-adaptation is a great challenge. The goal of meta-learning is to build a self-adaptive learning system that is constantly adapting to its specific (and dynamic) environment. To that end, the meta-learning mechanism must improve its bias dynamically by updating the current learning strategy in accordance with its available experiences or meta-knowledge. We suggest using genetic algorithms as the basis of an adaptive system. In this work, we propose a meta-learning system based on a combination of the a priori and a posteriori concepts. A priori refers to input information and knowledge available at the beginning in order to built and evolve one or more sets of parameters by exploiting the context of the system’s information. The self-learning component is based on genetic algorithms and neural Darwinism. A posteriori refers to the implicit knowledge discovered by estimation of the future states of parameters and is also applied to the finding of optimal parameters values. The in-progress research presented here suggests a framework for the discovery of knowledge that can support human experts in their intelligence information assessment tasks. The conclusion presents avenues for further research in genetic algorithms and their capability to learn to learn.

Protein secondary structure prediction using different encoding schemes and neural network architectures

Wei Zhong, Yi Pan, Robert Harrison, et al.

Show abstract

Protein secondary structure prediction is very important for drug design, protein engineering and immunological studies. This research uses fully connected multilayer perceptron (MLP) neural network with one, two and three hidden layers to predict protein secondary structure. Orthogonal matrix, BLOSUM62 matrix and hydrophobicity matrix are used for input profiles. To increase the input information for neural networks, the combined matrix from BLOSUM62 and orthogonal matrix and the combined matrix from BLOSUM62 and hydrophobicity matrix are also experimented. Binary classifiers indicate test accuracy of one hidden layer is better than that of two and three hidden layers. This may indicate that increasing complexity of architecture may not help neural network to recognize structural pattern of protein sequence more accurately. The results also show that the combined input profile of BLOSUM62 matrix and orthogonal matrix is the best one among five encoding schemes. While accuracy of the tertiary classifier reaches 63.20%, binary classifier for H/~H is 78.70%, which is comparable to other researchers’ results.

Protein secondary structure prediction using support vector machine with advanced encoding schemes

Hae-Jin Hu, Yi Pan, Robert Harrison, et al.

Show abstract

Over the decades, many studies have been done for the prediction of the protein structure. Since the protein secondary structure is closely related to the protein tertiary structure, many approaches begin with the prediction of secondary structure and apply the results to predict the tertiary structure. The recent trend of secondary structure prediction studies is mostly based on the neural network or the support vector machine (SVM). In this study, SVM is used as a machine learning tool for the prediction of secondary structure and several new encoding schemes, including orthogonal matrix, hydrophobicity matrix, BLOSUM62 substitution matrix and combined matrix of these, are developed and optimized to improve the prediction accuracy. Based on the best encoding scheme, each protein sequence is expressed as consecutive sliding windows and each amino acid inside a window is represented with 20 different matrix values. Once the optimal window length for six SVM binary classifiers is chosen to be 13 through many experiments, the new encoding scheme is tested based on this optimal window size with the 7-fold cross validation tests. The results show 2% increase in the accuracy of the binary classifiers when compared with the instances in which the classical orthogonal matrix is used. For the training and testing of the SVM binary classifiers, RS126 data sets is used since this is the common set adopted by the previous research groups. Finally, to combine the results of the six SVM binary classifiers, several existing tertiary classifiers are applied and the efficiency of each tertiary classifier is compared.

An evolutionary algorithmic approach to learning a Bayesian network from complete data

Ferat Sahin, Jason Tillet, Raghuveer Rao, et al.

Show abstract

Discovering relationships between variables is crucial for interpreting data from large databases. Relationships between variables can be modeled using a Bayesian network. The challenge of learning a Bayesian network from a complete dataset grows exponentially with the number of variables in the database and the number of states in each variable. It therefore becomes important to identify promising heuristics for exploring the space of possible networks. This paper utilizes an evolutionary algorithmic approach, Particle Swarm Optimization (PSO) to perform this search. A fundamental problem with a search for a Bayesian network is that of handling cyclic networks, which are not allowed. This paper explores the PSO approach, handling cyclic networks in two different ways. Results of network extraction for the well-studied ALARM network are presented for PSO simulations where cycles are broken heuristically at each step of the optimization and where networks with cycles are allowed to exist as candidate solutions, but are assigned a poor fitness. The results of the two approaches are compared and it is found that allowing cyclic networks to exist in the particle swarm of candidate solutions can dramatically reduce the number of objective function evaluations required to converge to a target fitness value.

Alternative target functions for protein structure prediction with neural networks

Hai Deng, Robert Harrison, Yi Pan, et al.

Show abstract

The prediction and modeling of protein structure is a central problem in bioinformatics. Neural networks have been used extensively to predict the secondary structure of proteins. While significant progress has been made by using multiple sequence data, the ability to predict secondary structure from a single sequence and a single prediction network has stagnated with an accuracy of about 75%. This implies that there is some limit to the accuracy of the prediction. In order to understand this behavior we asked the question of what happens as we change the target function for the prediction. Instead of predicting a derived quantity, such as whether a given chain is a helix, sheet or turn, we tested whether a more directly observed quantity such as the distance between a pair of α-carbon atoms could be predicted with reasonable accuracy. The α-carbon atom position is central to each residue in the protein and the distances between them in sequence define the backbone of protein. Knowledge of the distances between the α-carbon atoms is sufficient to determine the three dimensional structure of the protein. We have trained on distance data derived from the complete protein structure database (pdb) using a multi-layered perceptron feedforward neural network with back propagation. It shows that the root of mean square error is 0.4 Å with orthogonal coding of protein primary sequence. This is comparable to the experimental error in the structures used to form the database. The effects of exploring other encoding schemes, and different complexities of neural networks as well as related target functions such as distance thresholds will be presented.

Protein secondary structure prediction using neural networks

Preeti Singh, Yan-Qing Zhang

Show abstract

The three dimensional structure of a protein affects the structural, functional and biological characteristics of the various cells and genes significantly. Therefore, prediction of the structure of a protein is the key to understanding the biological functions of proteins. In this paper, a new hybrid neural network has been proposed for predicting protein secondary structure. A new encoding scheme for representing amino acid protein sequences has also been designed which greatly reduces the convergence time of the prediction process.

A direct approach for incomplete information systems

Hai Huang, Guoyin Wang, Yu Wu

Show abstract

There are already some extensions of rough set theory for incomplete information systems, such as tolerance relation, limited tolerance relation, similarity relation, and etc. But there are no approaches and algorithms for these extensions. A direct approach for processing incomplete information systems is developed in this paper, including discretization, attribute reduction, value reduction, and rule matching. This approach can be used in all kinds of extensions of rough set theory for incomplete information systems. It is both effective in complete and incomplete information systems.

Measuring the quality of information: a metric for military intelligence, surveillance, and reconnaissance (ISR)

Ronald T. Kessel

Show abstract

Remote sensing is valuable for ISR insofar as it provides information relevant to the detection, classification, identification, and tracking of one or more targets of interest. A new system-level measure of the quality of ISR is given here. By combining information and detection theory, the measure gives a quantitative estimate of the relative proportions of reliable (good), misleading (bad), and missing information. Being probabilistic, the measure can be used to assess the quality of day-to-day surveillance, or to estimate the quality expected for more narrowly defined and speculative scenarios. The measure is intended for use in a cost-benefit study of sorts, comparing the quality of different sensors or mixes of sensors for ISR. The quality metric is demonstrated here using a maritime interdiction scenario from maritime ISR.

A new data mining tool for analyzing coumarin-based prodrugs

Hao Fang, Jun Li, Yi Sun, et al.

Show abstract

This paper focuses on using fuzzy neural network data mining techniques to analyze nonlinear relations among chemical factors. Through standardizing and rescaling the raw data, we processed the data into fuzzy neural network not only to learn chemical knowledge from large amounts of experimental data, but also predict future chemical parameters for further experimental verification. The results show that the most relative chemical factor can be obtained by analyzing the experimental errors using fuzzy rules.

Sensitive analysis and segmentation of low-contrast images: knowledge discovery based on multiparameter topological resonance method

Alexander M. Akhmetshin, Lyudmila G. Akhmetshina

Show abstract

A new method of low contrast images analysis and segmentation is outlined. The one has providers high sensitivity to detection of visually invisible low contrast areas and simultaneous stability to influence of local small brightness variations. The new method consists of the following four steps. 1) Forming a moving window of size (3x3). Analyzed image brightness pixels into the bounds of the window are considered as coefficients some virtual nonrecursive digital filter. The one is characterized by its spectral characteristic. It gives possibility for comparing to each pixel of analyzed image its virtual spectral characteristic. 2) An information features for futher analysis there are using resonance points (magnitudes and frequencies) of virtual magnitude-frequency and group-delay characteristics (namely it stipulated using the method designation as multiparameter topological resonance one). 3) Visualization new synthesized features and a separate analysis of the new images. 4) Multiparameter information fusion into one resulting image (if it is needed) on base self-organizing map method.On base our experimental investigations it had been established importance virtual group-delay function (i.e. new information characteristic) for tasks detection and segmentation low contrast fuzzy regions of analyzed images. Experiments were made with examples of real low contrast images: X-ray CT, digital mammograms, and geophysical field images. For all situations there were obtained very good practical results.

Sampling in association rule mining

Tsau Young Lin

Show abstract

A relation is a representation of a set, called the universe V, of entities by a set of tuples. Hence it is associated with a unique sub-lattice, called relation lattice, of the partition lattice of V. In this paper, we examine the relation lattices on V and a sample V'(a subset). The analysis concludes that only very special types of samples may be able to have the "same" association rules as those of the original universe; here "same" means allowing some statistical errors. In other words, finding association rules by mere random sampling may not be able to fully reflect the association rules of the original universe; special attentions on the sampling are needed.

Automatic document clustering of concept hypergraph decompositions

Tsau Young Lin, I-Jen Chiang

Show abstract

This paper presents an approach to classify/cluster the web documents by decompositions of hypergraphs. The various levels of co-occurring frequent terms, called association rules (undirected rules), of documents form a hypergraph. Clustering methods is then applied to analyze such hypergraphs; a simple and fast clustering algorithm is used to decomposing hypergraph into connected components. Each connected component represents a primitive concept within the given documents. The documents will then be classified/clustered by such primitive concepts.

Rank and independence in contingency table

Shusaku Tsumoto

Show abstract

A contingency table summarizes the conditional frequencies of two attributes and shows how these two attributes are dependent on each other. Thus, this table is a fundamental tool for pattern discovery with conditional probabilities, such as rule discovery. In this paper, a contingency table is interpreted from the viewpoint of statistical independence and granular computing. The first important observation is that a contingency table compares two attributes with respect to the number of equivalence classes. For example, a n x n table compares two attributes with the same granularity, while a m x n(m ≥ n) table compares two attributes with different granularities. The second important observation is that matrix algebra is a key point of analysis of this table. Especially, the degree of independence, rank plays a very important role in evaluating the degree of statistical independence. Relations between rank and the degree of dependence are also investigated.

Mining multiple table databases with multiple minimum support constraints

Nita Mital, Naveen Kumar, Vasudha Bhatnagar

Show abstract

Most of the current data mining algorithms handle databases consisting of a single table. Recently few algorithms are proposed that mine data stored in multiple tables for association rules. The classical association rule mining implicitly assumes that all items in the data are of the same nature and/or have similar frequencies in the data. In real life applications this is not the case. So use of only one minimum support value for the whole database may not be sufficient. If minimum support is set high the rules that involve infrequent items will not be found and if minimum support is set very low then this may result in flood of rules. In this paper we present a technique that can allow a user to specify multiple minimum supports to take care of the natures of items and their varied frequencies in multiple-table database environment. We discuss previous approaches that partly address the related issues to present a couple of algorithms to address the issue of multiple support constraints in multiple table environment.

XML algebras for data mining

Ming Zhang, JingTao Yao

Show abstract

The XML is a new standard for data representation and exchange on the Internet. There are studies on XML query languages as well as XML algebras in literature. However, attention has not been paid to research on XML algebras for data mining due to partially the fact that there is no widely accepted definition of XML mining tasks. This paper tries to examine the XML mining tasks and provide guidelines to design XML algebras for data mining. Some summarization and comparison have been done to existing XML algebras. We argue that by adding additional operators for mining tasks, XML algebras may work well for data mining with XML documents.

Data mining support systems

Yinliang Zhao, JingTao Yao, Yiyu Yao

Show abstract

The main stream of research in data mining (or knowledge discovery in databases) focuses on algorithms and automatic or semi-automatic processes for discovering knowledge hidden in data. In this paper, we adopt a more general and goal oriented view of data mining. Data mining is regarded as a field of study covering the theories, methodologies, techniques, and activities with the goal of discovering new and useful knowledge. One of its objectives is to design and implement data mining systems. A miner solves problems of data mining manually, or semi-automatically by using such systems. However, there is a lack of studies on how to assist a miner in solving data mining problems. From the experiences and lessons of decision support systems, we introduce the concept of data mining support systems (DMSS). We draw an analogy between the field of decision-making and the field of data mining, and between the role of a manager and the role of a data miner. A DMSS is an active and highly interactive computer system that assists data mining activities. The needs and the basic features of DMSS are discussed.

Web mining for topics defined by complex and precise predicates

Ching-Cheng Lee, Sushma Sampathkumar

Show abstract

The enormous growth of the World Wide Web has made it important to perform resource discovery efficiently for any given topic. Several new techniques have been proposed in the recent years for this kind of topic specific web-mining, and among them a key new technique called focused crawling which is able to crawl topic-specific portions of the web without having to explore all pages. Most existing research on focused crawling considers a simple topic definition that typically consists of one or more keywords connected by an OR operator. However this kind of simple topic definition may result in too many irrelevant pages in which the same keyword appears in a wrong context. In this research we explore new strategies for crawling topic specific portions of the web using complex and precise predicates. A complex predicate will allow the user to precisely specify a topic using Boolean operators such as "AND", "OR" and "NOT". Our work will concentrate on defining a format to specify this kind of a complex topic definition and secondly on devising a crawl strategy to crawl the topic specific portions of the web defined by the complex predicate, efficiently and with minimal overhead. Our new crawl strategy will improve the performance of topic-specific web crawling by reducing the number of irrelevant pages crawled. In order to demonstrate the effectiveness of the above approach, we have built a complete focused crawler called "Eureka" with complex predicate support, and a search engine that indexes and supports end-user searches on the crawled pages.

Toward building a comprehensive data mart

Douglas Boulware, John Salerno, Richard Bleich, et al.

Show abstract

To uncover new relationships or patterns one must first build a corpus of data or what some call a data mart. How can we make sure we have collected all the pertinent data and have maximized coverage? There are hundreds of search engines that are available for use on the Internet today. Which one is best? Is one better for one problem and a second better for another? Are meta-search engines better than individual search engines? In this paper we look at one possible approach in developing a methodology to compare a number of search engines. Before we present this methodology, we first provide our motivation towards the need for increased coverage. We next investigate how we can obtain ground truth and what the ground truth can provide us in the way of some insight into the Internet and search engine capabilities. We then conclude our discussion by developing a methodology in which we compare a number of the search engines and how we can increase overall coverage and thus a more comprehensive data mart.

Materialized view selection based on query cost in data warehouse

Lijuan Zhou, Chi Liu, Daxin Liu

Show abstract

Selecting views to materialize impacts on the efficiency as well as the total cost of establishing and running a data warehouse. One of the most important decisions in designing a data warehouse is selection of right views to be materialized. This problem is to select a right set of views that minimizes total query response time and the cost of view maintenance under a storage space constraint. In this paper, according to our practical application, the factor that refrains us from materializing all views in the data warehouse is not the space constraint but query response time. For queries fast answers may be required. So we develop algorithms to select a set of views to materialize in data warehouse in order to minimize the total view maintenance time under the constraint of a given query response time. We call it query-cost view select problem. First, we design algorithms for query-cost view select problem, we give view node matrix in order to solve it. Second , we use experiments do demonstrate the power of our approach . The results show that our algorithm works better in practical cases. We implemented our algorithms and a performance study of the algorithms shows that the proposed algorithm delivers an optimal solution. Finally, we discuss the observed behavior of the algorithms. We also identify some important issues for future investigations.

Mining fuzzy conceptual clusters and constructing the fuzzy conceptual frame lattices

Vibhu Narang, Naveen Kumar

Show abstract

The key idea here is to use formal concept analysis and fuzzy membership criterion to partition the data space into clusters and provide knowledge through fuzzy lattices. The procedures, written here, are regarded as mapping or transform of the original space (samples) onto concepts. The mapping is further given the fuzzy membership criteria for clustering from which the clustered concepts of various degrees are found. Bucket hashing measure has been used as a measure of similarity in the proposed algorithm. The concepts are evaluated on the basis of this criterion and then they are clustered. The intuitive appeal of this approach lies in the fact that once the concepts are clustered, the data analyst is equipped with the concept measure as well as the identification of the bridging points. An interactive concept map visualization technique called Fuzzy Conceptual Frame Lattice or Fuzzy Concept Lattices is presented for user-guided knowledge discovery from the knowledge base.

A new framework for intrusion detection based on rough set theory

Zhijun Li, Yu Wu, Guoyin Wang, et al.

Show abstract

Intrusion detection is an essential component of critical infrastructure protection mechanism. Since many current IDSs are constructed by manual encoding of expert knowledge, it is time-consuming to update their knowledge. In order to solve this problem, an effective method for misuse intrusion detection with low cost and high efficiency is presented. This paper gives an overview of our research in building a detection model for identifying known intrusions, their variations and novel attacks with unknown natures. The method is based on rough set theory and capable of extracting a set of detection rules from network packet features. After getting a decision table through preprocessing raw packet data, rough-set-based reduction and rule generation algorithms are applied, and useful rules for intrusion detection are obtained. In addition, a rough set and rule-tree-based incremental knowledge acquisition algorithm is presented in order to solve problems of updating rule set when new attacks appear. Compared with other methods, our method requires a smaller size of training data set and less effort to collect training data. Experimental results demonstrate that our system is effective and more suitable for online intrusion detection.

Data Mining and Knowledge Discovery: Theory, Tools, and Technology VI

Volume Details

Table of Contents

Table of Contents