Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV

Feature transformations and structure of attributes

Tsau Young Lin

Show abstract

Let V be a set real world entities, that admits a relational model. An attribute induces an equivalence relation on V and will be regarded as such. Let A* be the set of such equivalence relations and their refinements. Then the smallest sublattice L(A*) generated by A* in the parition lattice of V is called generalized relation lattice. The set of the granules in L(A*) whose cardinality is above certain threshhold is the set of all possible patterns derivable from the given attributes (features) of the given relation.

Multiclass kernel-based feature extraction

Yi Zhao, Honglin Li, Stanley C. Ahalt

Show abstract

Feature Extraction (FE) algorithms have attracted great attention in recent years. In order to improve the performance of FE algorithms, nonlinear kernel transformations (e.g., the kernel trick) and scatter matrix based class separability criteria have been introduced in Kernel-based Feature Extraction (KFE)\cite{}. However, for any L-class problem, at most L-1 nonlinear kernel features can be extracted by KFE, which is not desirable for many applications. To solve this problem, a modified kernel-based feature extraction (MKFE) based on nonparametric scatter matrices was proposed, but with the limitation of only being able to extract multiple features for 2-class problems. In this paper, we present a general MKFE algorithm for multi-class problems. The core of our algorithm is a novel expression of the nonparametric between-class matrix, which is shown to be consistent with the definition of the parametric between-class matrix in the sense of the scatter-matrix-based class separability criteria. Based on this expression of the between-class matrix our algorithm is able to extract multiple kernel features in multi-class problems. To speed up the computation, we also proposed a simplified formula. Experimental results using synthetic data are provided to demonstrate the effectiveness of our proposed algorithm.

Data mining for multiagent rules, strategies, and fuzzy decision tree structure

James F. Smith III, Robert D. Rhyne II, Kristin Fisher

Show abstract

A fuzzy logic based resource manager (RM) has been developed that automatically allocates electronic attack resources in real-time over many dissimilar platforms. Two different data mining algorithms have been developed to determine rules, strategies, and fuzzy decision tree structure. The first data mining algorithm uses a genetic algorithm as a data mining function and is called from an electronic game. The game allows a human expert to play against the resource manager in a simulated battlespace with each of the defending platforms being exclusively directed by the fuzzy resource manager and the attacking platforms being controlled by the human expert or operating autonomously under their own logic. This approach automates the data mining problem. The game automatically creates a database reflecting the domain expert's knowledge. It calls a data mining function, a genetic algorithm, for data mining of the database as required and allows easy evaluation of the information mined in the second step. The criterion for re- optimization is discussed as well as experimental results. Then a second data mining algorithm that uses a genetic program as a data mining function is introduced to automatically discover fuzzy decision tree structures. Finally, a fuzzy decision tree generated through this process is discussed.

Evolutionary approach for discovering changing patterns in historical data

Wai-Ho Au, Keith C. C. Chan

Show abstract

In this paper, we propose a new data mining approach, called dAR, for discovering interesting association rules and their changes by evolutionary computation. dAR searches through huge rule spaces effectively using a genetic algorithm. It has the following characteristics: (i) it encodes a complete set of rules in one single chromosome; (ii) each allele encodes one rule and each rule is represented by some non-binary symbolic values; (iii) the evolutionary process begins with the generation of an initial set of first-order rules (i.e., rules with one condition) using a probabilistic induction technique and based on these rules, rules of higher order (two or more conditions) are obtained iteratively; (iv) it adopts a steady-state reproduction scheme in which only two chromosomes are replaced every time; (v) when identifying interesting rules, an objective interestingness measure is used; and (vi) the fitness of a chromosome is defined in terms of the probability that the attribute values of a tuple can be correctly determined using the rules it encodes. Furthermore, dAR can also be used to mine the changes in discovered rules over time. This allows the accurate prediction of the future based on the historical data in the past. The experimental results on a synthetic database have shown that dAR is very effective at mining interesting association rules and their changes over time.

Statistical test for rough set approximation

Shusaku Tsumoto

Show abstract

Rough set based rule induction methods have been applied to knowledge discovery in databases, whose empirical results obtained show that they are very powerful and that some important knowledge has been extracted from datasets. However, quantitative evaluation of lower and upper approximation are based not on statistical evidence but on rather naive indices, such as conditional probabilities and functions of conditional probabilities. In this paper, we introduce a new approach to induced lower and upper approximation of original and variable precision rough set model for quantitative evaluation, which can be viewed as a statistical test for rough set methods. For this extension, chi-square distribution, F-test and likelihood ratio test play an important role in statistical evaluation. Chi-square test statistic measures statistical information about an information table and F-test statistic and likelihood ratio statistic are used to measure the difference between two tables.

Knowledge reduction algorithms based on rough set and conditional information entropy

Hong Yu, Guoyin Wang, Dachun Yang, et al.

Show abstract

Rough Set is a valid mathematical theory developed in recent years, which has the ability to deal with imprecise, uncertain, and vague information. It has been applied in such fields as machine learning, data mining, intelligent data analyzing and control algorithm acquiring successfully. Many researchers have studied rough sets in different view. In this paper, the authors discuss the reduction of knowledge using information entropy in rough set theory. First, the changing tendency of the conditional entropy of decision attributes given condition attributes is studied from the viewpoint of information. Then, two new algorithms based on conditional entropy are developed. These two algorithms are analyzed and compared with MIBARK algorithm. Furthermore, our simulation results show that the algorithms can find the minimal reduction in most cases.

Analysis and summarization of correlations in data cubes

Chien-Yu Chen, Shien-Ching Hwang, Yen-Jen Oyang

Show abstract

This paper presents a novel mechanism to analyze and summarize the statistical correlations among the attributes of a data cube. To perform the analysis and summarization, this paper proposes a new measure of statistical significance. The main reason for proposing the new measure of statistical significance is to have an essential closure property, which is exploited in the summarization stage of the data mining process. In addition to the closure property, the proposed measure of statistical significance has two other important properties. First, the proposed measure of statistical significance is more conservative than the well-known chi-square test in classical statistics and, therefore, inherits its statistical robustness. This paper does not simply employ the chi-square test due to lack of the desired closure property, which may lead to a precision problem in the summarization process. The second additional property is that, though the proposed measure of statistical significance is more conservative than the chi-square test, for most cases, the proposed measure yields a value that is almost equal to a conventional measurement of statistical significance based on the normal distribution. Based on the closure property addressed above, this paper develops an algorithm to summarize the results from performing statistical analysis in the data cube. Though the proposed measure of statistical significance avoids the precision problem due to having the closure property, its conservative nature may lead to a recall rate problem in the data mining process. On the other hand, if the chi-square test, which does not have the closure property, was employed, then the summarization process may suffer a precision problem.

Selection of views to materialize using simulated annealing algorithms

Lijuan Zhou, Chi Liu, Hongfeng Wang, et al.

Show abstract

A data warehouse contains lots of materialized views over the data provided by the distributed heterogeneous databases for the purpose of efficiently implementing decision-support or OLAP queries. It is important to select the right view to materialize that answer a given set of queries. The goal is the minimization of the combination of the query evaluation and view maintenance costs. In this paper, we have addressed and designed algorithms for selecting a set of views to be materialized so that the sum of processing a set of queries and maintaining the materialized views is minimized. We develop an approach using simulated annealing algorithms to solve it. First, we explore simulated annealing algorithms to optimize the selection of materialized views. Then we use experiments to demonstrate our approach. The results show that our algorithm works better. We implemented our algorithms and a performance study of the algorithms shows that the proposed algorithm gives an optimal solution.

Visualization method and tool for interactive learning of large decision trees

Trong Dung Nguyen, TuBao Ho

Show abstract

When learning from large datasets, decision tree induction programs often produce very large trees. How to visualize efficiently trees in the learning process, particularly large trees, is still questionable and currently requires efficient tools. This paper presents a visualization method and tool for interactive learning of large decision trees, that includes a new visualization technique called T2.5D (stands for Tress 2.5 Dimensions). After a brief discussion on requirements for tree visualizers and related work, the paper focuses on presenting developing techniques for the issues (1) how to visualize efficiently large decision trees; and (2) how to visualize decision trees in the learning process.

Processing Landsat TM data using complex valued neural networks

Howard E. Michel, Shyamala Kunjithapatham

Show abstract

Neural networks are massively parallel arrays of simple processing units that can be used for computationally complicated tasks such as image processing. This paper develops an efficient method for processing remote-sensing satellite data using complex valued artificial neurons as an approach to the problems associated with computer vision-region identification and classification-as they are applied to satellite data. Because of the amount of data to be processed and complexity of the tasks required, problems using ANNs arise, specifically, the very long training time required for large ANNs using conventional computers. These problems effectively prevent an average person from performing his own analysis. The solution presented here uses a recently developed complex valued artificial neuron model in this real-world problem. This model was then coded, run and verified on personal computers. Results show the CVN to be an accurate and computationally efficient model.

Computational study of developing high-quality decision trees

Zhiwei Fu

Show abstract

Recently, decision tree algorithms have been widely used in dealing with data mining problems to find out valuable rules and patterns. However, scalability, accuracy and efficiency are significant concerns regarding how to effectively deal with large and complex data sets in the implementation. In this paper, we propose an innovative machine learning approach (we call our approach GAIT), combining genetic algorithm, statistical sampling, and decision tree, to develop intelligent decision trees that can alleviate some of these problems. We design our computational experiments and run GAIT on three different data sets (namely Socio- Olympic data, Westinghouse data, and FAA data) to test its performance against standard decision tree algorithm, neural network classifier, and statistical discriminant technique, respectively. The computational results show that our approach outperforms standard decision tree algorithm profoundly at lower sampling levels, and achieves significantly better results with less effort than both neural network and discriminant classifiers.

Neural network and genetic algorithm technology in data mining of manufacturing quality information

Limei Song, Xing-Hua Qu, Shenghua Ye

Show abstract

Data Mining of Manufacturing Quality Information (MQI) is the key technology in Quality Lead Control. Of all the data mining methods, Neural Network and Genetic Algorithm is widely used for their strong advantages, such as non-linear, collateral, veracity etc. But if you singly use them, there will be some limitations preventing your research, such as convergence slowly, searching blindness etc. This paper combines their merits and use Genetic BP Algorithm in Data Mining of MQI. It has been successfully used in the key project of Natural Science Foundation of China (NSFC) - Quality Control and Zero-defect Engineering (Project No. 59735120).

Data mining in metadata repositories

Dragos Arotaritei

Show abstract

Metadata are data about data and can refer to a large type of information categories: journals, digital libraries, structured and semistructured documents, etc. Our approach refers mainly to discovery of the association rules problem in metadata repository associated with semistructured documents. Extensions to heterogeneous documents and possible application to unstructured documents are taken into account also. The metadata stored in metadata repositories are processed by translation in a table, similar to the well known basket from association rule discovery problem. A slightly modified Apriori and AprioriAll algorithms are used to discover association rules among values of metadata attributes. Experimental results over a selected collection of metadata stored in an repository is presented.

Algebraic specification of association rule queries

Raj P. Gopalan, Tariq Nuruddin, Yudho Giri Sucahyo

Show abstract

In this paper, we present an algebraic specification for association rule queries that can form the foundation for integrating data mining and database management. We first define a set of nested algebraic operators needed to specify association rule queries. Association rule discovery is then expressed as a query tree of these operators. The expressiveness of the algebra is indicated by specifying some of the variants of association rule queries as query trees. Other variants of association rule queries discussed in the literature can also be represented using the algebra. Constrained association queries (CAQs) have been proposed by researchers to limit the number of rules discovered. We discuss the representation of CAQs using the algebra. Certain sequences of algebraic operators occur together in most of the query variants. These sequences are combined as modules to simplify the presentation of query trees. While the focus of the paper is the algebraic specification of association rule queries, we briefly discuss the optimization issues in implementing the algebra for association rule mining. The grouping of algebraic operators into modules facilitate the use of existing algorithms for association rules in query optimization.

Study and improvement on hierarchical algorithm of association rule

Luo Zhong, Hongxia Xia, Jingling Yuan

Show abstract

This paper introduces the problem of data mining association rules. We adopt the iterative method to enlarge the size of the item set gradually and describe the hierarchical algorithm in detail. The hierarchical algorithm produces a larger provisional sets based on the obtained frequent item sets and make sure that those provisional sets which will never be frequent item set are ignored under the premise of the known information. Finally, an improving algorithm which is to combine the last several procedures of iteration into a single scan of the database D. Mainly because that the more backwards the iterative processes approach the end, the less the provisional sets are there.

Discovering fuzzy spatial association rules

Esen Kacar, Nihan Kesim Cicekli

Show abstract

Discovering interesting, implicit knowledge and general relationships in geographic information databases is very important to understand and use these spatial data. One of the methods for discovering this implicit knowledge is mining spatial association rules. A spatial association rule is a rule indicating certain association relationships among a set of spatial and possibly non-spatial predicates. In the mining process, data is organized in a hierarchical manner. However, in real-world applications it may not be possible to construct a crisp structure for this data, instead some fuzzy structures should be used. Fuzziness, i.e. partial belonging of an item to more than one sub-item in the hierarchy, could be applied to the data itself, and also to the hierarchy of spatial relations. This paper shows that, strong association rules can be mined from large spatial databases using fuzzy concept and spatial relation hierarchies.

Knowledge discovery about scientific papers or proceedings referenced NASA/DAAC data with a rule-based classifier

Donglian Sun, Chris Lynnes, Richard K. Kiang, et al.

Show abstract

Knowledge discovery from online journals, abstracts and citation indices, cross-referenced with the NASA Distributed Active Archive Center (DAAC) user/order database to close the data-knowledge loop. Knowledge discovery in database (KDD) has been defined as the nontrivial process of discovering valid, novel, potentially useful, and ultimately understandable patterns from data. The core step of the KDD process is data mining. Data mining is all about extracting patterns from an organization's stored or warehoused data. These patterns can be used to gain insight into aspects of the organization's operations and predict outcomes for future situations. Patterns often concern the categories to which situations belong. For example, here is the situation, to decide if a journal paper used the NASA DAAC data or not, starting from the Goddard DAAC user/order database record, a rule-based classifier was developed and rules were found firstly with training samples, then these rules were applied to recognize new patterns.

Aspects of knowledge discovery in technical data

Steffen Brueckner, Stephan Rudolph

Show abstract

In many engineering applications numerical software has reached a more or less satisfactory quality of predicting the system behavior. A major disadvantage with this kind of software is that it can only be used in a later step in the engineering design process since it requires a detailed system model, such as a finite element simulation model for structural mechanics. Finite element software can only give satisfying results when the complete geometry and all material parameters are specified. However, despite all the parameter definitions in such simulation models, still a severe validation effort with experiments is needed to investigate the model abstractions. Especially in the early conceptual design phase, a need for simplified modeling and the prediction of the system behavior using only little knowledge about a new design exists. This kind of conceptual knowledge can be given e.g. in simple algebraic equations. These equations can either be derived from first principles or from knowledge discovery in data of previous designs, the latter being the topic of this work. Huge amounts of experimental data have been recorded and stored by industry especially in the past ten years with microcomputers being available throughout the companies. Additionally the engineering domain, other than e.g. the business domain, has the advantage that at least a small number of planned experiments can be conducted to enhance the data quantity and quality and to validate the knowledge discovery results. This paper emphasizes the need for a modified knowledge discovery process for engineering (and other scientific domains) and shows the differences to the traditional knowledge discovery in data bases.

Knowledge discovery process for scientific and engineering data

Luis J. Barrios, Stephan Rudolph

Show abstract

Scientists and engineers are often confronted with the problem of modeling the physical laws that govern complex processes and systems. This task may generally be accomplished following traditional modeling procedures. However, when dealing with multivariate problems and/or huge quantities of experimental data, the modeling problem can easily become unmanageable. In such cases, knowledge discovery techniques may help to address this problem. Current knowledge discovery methods however rely mainly on inductive data mining techniques and do not make use of the structural properties of the specific physical context. Hence, they are not yet the ideal process solution for discovering functional models in science and engineering. This paper discusses a knowledge discovery process, which combines deductive and inductive reasoning techniques to find out mathematical models of physical systems. In the supplementary deductive process, the technique of dimensional analysis is used. This allows the incorporation of background knowledge of the involved domain to enrich the general process of knowledge discovery. The background knowledge forms hereby the specific context for a knowledge discovery process for concrete scientific data. As an example, the introduced method is used to find out the expression of the drag force that a viscous fluid exerts on a submersed and uniformly moving solid. The various issues that arise in the development and implementation of such a knowledge discovery system based on the method of dimensional analysis are analyzed and discussed.

Discovery of knowledge from diagnostic databases

Wojciech A. Moczulski, Pawel Kostka

Show abstract

The paper deals with acquisition of diagnostic knowledge that is relevant for detection and isolation of a special class of malfunctions of rotating machinery called shaft misalignment. To detect a misalignment of the given shaft supported by multiple journal bearings, decision trees have been applied. These trees have been discovered in a database collected in a numerical experiment performed by the well-verified simulation system. A novel approach to definition of classes of misalignment has been introduced. Several new methods of selection of attributes and evaluation of classifier's performance have been suggested and verified. Finally a new method of diagnosing misalignment of rotating machinery has been formulated. This method may be efficiently implemented for real-existing rotating machinery.

Data modeling for data mining

Tsau Young Lin

Show abstract

Data mining has been treated as a set of added operations to the classical data model. So data mining uses the same terms as database. This is an incorrect or confusing approach; though the syntax is the same, their semantics are very different. This paper presents a series of critique, and analyzes the current status and proposes a data model for data mining in both traditional relational and semantically richer databases.

Data mining approach to bipolar cognitive map development and decision analysis

Wen-Ran Zhang

Show abstract

A data mining approach to cognitive mapping is presented based on bipolar logic, bipolar relations, and bipolar clustering. It is shown that a correlation network derived from a database can be converted to a bipolar cognitive map (or bipolar relation). A transitive, symmetric, and reflexive bipolar relation (equilibrium relation) can be used to identify focal links in decision analysis. It can also be used to cluster a set of events or itemsets into three different clusters: coalition sets, conflict sets, and harmony sets. The coalition sets are positively correlated events or itemsets; each conflict set is a negatively correlated set of two coalition subsets; and a harmony set consists of events that are both negatively and positively correlated. A cognitive map and the clusters can then be used for online decision analysis. This approach combines knowledge discovery with the views of decision makers and provides an effective means for online analytical processing (OLAP) and online analytical mining (OLAM).

Data mining and decision making

Andrew Kusiak

Show abstract

Models and algorithms for effective decision-making in a data-driven environment are discussed. To enhance the quality of the extracted knowledge and decision-making, the data sets are transformed, the knowledge is extracted with multiple algorithms, the impact of the decisions on the modeled process is simulated, and the parameters optimizing process performance are recommended. The applications discussed in this paper differ from most data mining tasks, where the extracted knowledge is used to assign decision values to new objects that have not been included in the training data. For example, in a typical data mining application the equipment fault is recognized based on the failure symptoms. In this paper, a subset of rules is selected from the extracted knowledge to meet the established decision-making criteria. The parameter values represented by the conditions of this set of rules are called a decision signature. A model and algorithms for the selection of the desired parameters (decision signatures) will be developed. The parameters of this model are updated using a framework provided by the learning classifier systems.

Data mining on time series of sequential patterns

Ari J. E. Visa

Show abstract

Ordinary Time Series Analysis has long tradition in statistics [3] and it has also been considered in Data Mining [4,13]. Sequential patterns that are common in many measurements in process industry and in elsewhere have also been considered [1,14,11] in Data Mining. However, in some cases these two approaches can be merged together into a suitable transform. This kind of 2D-transform should be selected such a way that the basis functions support the Data Mining and the interpretation of results. As an example a runnability problem on a paper machine was considered. There were problems with fluctuations in paper basis weight. Data mining was successfully applied to the problem to identify and to remove the disturbances. The whole disturbance analysis was based on 86 sequential patterns consisting of 62 point-wise measurements in cross direction. These measurements were acquired from the process control system. The consecutive 86 patterns were Slant-transformed and the results were data mined. It was quite easy to find out the uneven static distribution of pressure in the head box and to find that the pressure fluctuated in the head box. Based on the considered case it can be claimed that Data Mining might be a good tool in many trouble shooting problems.

Using data mining to minimize database reverse engineering constraints

Aziz Barbar, Martine Collard

Show abstract

In this paper we propose to use data mining techniques for database reverse engineering process. A crucial problem in this process concerns the discovery of similarity between attributes before constructing the conceptual model. The essence of our approach is to mine user queries collected on the database in order to extract specific similarity measure that we call distance between 2 attributes. Indeed most database reverse engineering methods are based on the observation of several sources which generally are the existing database schema, the data themselves and application programs including queries. Unlike previous propositions which analyze only the structure of joins in queries, the main idea of this paper is to exploit the large volume of information stored in queries in order to extract some semantic properties on attributes. Thus we propose to apply a data mining algorithm on a query base collected on the data. The objective is to extract semantic links that do not appear obviously in the schema or in the data and are suggested implicitly by expert users in their queries. In this paper, we focus mainly on the problem of attribute similarity which is quite important in database reverse engineering. We describe a method by which similarities between attributes are discovered according to context measures without taking into consideration the naming policy used by database designers.

Self-similarity for data mining and predictive modeling: a case study for network data

Jafar Adibi, Wei-Min Shen, Eaman Noorbakhsh

Show abstract

Recently there are a handful study and research on observing self-similarity and fractals in natural structures and scientific database such as traffic data from networks. However, there are few works on employing such information for predictive modeling, data mining and knowledge discovery. In this paper we study, analyze our experiments and observation of self-similar structure embedded in Network data for prediction through Self Similar Layered Hidden Markov Model (SSLHMM). SSLHMM is a novel alternative of Hidden Markov Models (HMM) which proven to be useful in a variety of real world applications. SSLHMM leverage HMM power and extend such capability to self-similar structures and exploit this property to reduce the complexity of predictive modeling process. We show that SSLHMM approach can captures self-similar information and provides more accurate and interpretable model comparing to conventional techniques and one-step, flat method for model constructions such as HMM.

Comparison of induced rules based on likelihood estimation

Shusaku Tsumoto

Show abstract

Rule induction methods have been applied to knowledge discovery in databases and data mining, The empirical results obtained show that they are very powerful and that important knowledge has been extracted from datasets. However, comparison and evaluation of rules are based not on statistical evidence but on rather naive indices, such as conditional probabilities and functions of conditional probabilities. In this paper, we introduce two approaches to induced statistical comparison of induced rules. For the statistical evaluation, likelihood ratio test and Fisher's exact test play an important role: likelihood ratio statistic measures statistical information about an information table and it is used to measure the difference between two tables.

Advantage of the modified Lunn-McNeil technique over Kalbfleisch-Prentice technique in competing risks

Iing Lukman, Noor Akma Ibrahim, Isa Bin Daud, et al.

Show abstract

Survival analysis algorithm is often applied in the data mining process. Cox regression is one of the survival analysis tools that has been used in many areas, and it can be used to analyze the failure times of aircraft crashed. Another survival analysis tool is the competing risks where we have more than one cause of failure acting simultaneously. Lunn-McNeil analyzed the competing risks in the survival model using Cox regression with censored data. The modified Lunn-McNeil technique is a simplify of the Lunn-McNeil technique. The Kalbfleisch-Prentice technique is involving fitting models separately from each type of failure, treating other failure types as censored. To compare the two techniques, (the modified Lunn-McNeil and Kalbfleisch-Prentice) a simulation study was performed. Samples with various sizes and censoring percentages were generated and fitted using both techniques. The study was conducted by comparing the inference of models, using Root Mean Square Error (RMSE), the power tests, and the Schoenfeld residual analysis. The power tests in this study were likelihood ratio test, Rao-score test, and Wald statistics. The Schoenfeld residual analysis was conducted to check the proportionality of the model through its covariates. The estimated parameters were computed for the cause-specific hazard situation. Results showed that the modified Lunn-McNeil technique was better than the Kalbfleisch-Prentice technique based on the RMSE measurement and Schoenfeld residual analysis. However, the Kalbfleisch-Prentice technique was better than the modified Lunn-McNeil technique based on power tests measurement.

Runtime support for parallelizing data mining algorithms

Ruoming Jin, Gagan Agrawal

Show abstract

With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of common data mining algorithms. In addition, we propose a reduction-object based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the technique we have developed starting from a common specification of the algorithm.

Web usage data mining agent

Praveen Madiraju, Yanqing Zhang

Show abstract

When a user logs in to a website, behind the scenes the user leaves his/her impressions, usage patterns and also access patterns in the web servers log file. A web usage mining agent can analyze these web logs to help web developers to improve the organization and presentation of their websites. They can help system administrators in improving the system performance. Web logs provide invaluable help in creating adaptive web sites and also in analyzing the network traffic analysis. This paper presents the design and implementation of a Web usage mining agent for digging in to the web log files.

Using ant-behavior-based simulation model AntWeb to improve website organization

Weigang Li, Marcos Vinicius Pinheiro Dib, Wesley Martins Teles, et al.

Show abstract

Some web usage mining algorithms showed the potential application to find the difference among the organizations expected by visitors to the website. However, there are still no efficient method and criterion for a web administrator to measure the performance of the modification. In this paper, we developed an AntWeb, a model inspired by ants' behavior to simulate the sequence of visiting the website, in order to measure the efficient of the web structure. We implemented a web usage mining algorithm using backtrack to the intranet website of the Politec Informatic Ltd., Brazil. We defined throughput (the number of visitors to reach their target pages per time unit relates to the total number of visitors) as an index to measure the website's performance. We also used the link in a web page to represent the effect of visitors' pheromone trails. For every modification in the website organization, for example, putting a link from the expected location to the target object, the simulation reported the value of throughput as a quick answer about this modification. The experiment showed the stability of our simulation model, and a positive modification to the intranet website of the Politec.

Web data mining

Kasanda John Wibonele, Yanqing Zhang

Show abstract

A web data mining system using granular computing and ASP programming is proposed. This is a web based application, which allows web users to submit survey data for many different companies. This survey is a collection of questions that will help these companies develop and improve their business and customer service with their clients by analyzing survey data. This web application allows users to submit data anywhere. All the survey data is collected into a database for further analysis. An administrator of this web application can login to the system and view all the data submitted. This web application resides on a web server, and the database resides on the MS SQL server.

Identifying web usage behavior of bank customers

Sandro Araya, Mariano Silva, Richard Weber

Show abstract

The bank Banco Credito e Inversiones (BCI) started its virtual bank in 1996 and its registered customers perform currently more than 10,000 Internet transactions daily, which typically cause les than 10% of traditional transaction costs. Since most of the customers are still not registered for online banking, one of the goals of the virtual bank is to increase then umber of registered customers. Objective of the presented work was to identify customers who are likely to perform online banking but still do not use this medium for their transactions. This objective has been reached by determining profiles of registered customers who perform many transactions online. Based on these profiles the bank's Data Warehouse is explored for twins of these heavy users that are still not registered for online banking. We applied clustering in order to group the registered customers into five classes. One of these classes contained almost 30% of all registered customers and could clearly be identified as class of heavy users. Next a neural network assigned online customers to the previously found five classes. Applying the network trained on online customers to all the bank customers identified twins of heavy users that, however had not performed online transactions so far. A mailing to these candidates informing about the advantages of online banking doubled the number of registrations compared to previous campaigns.

PNP: mining of profile navigational patterns

Hua-Fu Li, Man-Kwan Shan

Show abstract

Web usage mining is a key knowledge discovery research and as such has been well researched. So far, this research has focused mainly on databases containing access log data only. However, many real-world databases contain users profile data and current solutions for this situation are still insufficient. In this paper we have a large database containing of user profile information together with user web-pages navigation patterns. The user profile data includes quantitative attributes, such as salary or age, and categorical attributes, such as sex or marital status. We introduce the concept of profile navigation patterns, which discusses the problem of relating user profile information to navigational behavior. An example of such profile navigation pattern might be 20% of married people between age 25 and 30 have the similar navigational behavior <(a,c)(c,h)(h,i)(i,h)(h,l)>, where a, c, h, i, l are web pages in a web site. The navigation patterns may contain the generic traversal behavior, e.g. trend to backward moves, cycles etc. The objective of mining profile navigation patterns is to identify browser profile for web personalization. We give an algorithm for mining such profile navigation patterns. Our method (algorithm PNP) can discover profile navigation patterns efficiently. We also present new inclination measurements to identify the interesting profile navigational patterns. Experimental results show the efficiency and scalability of PNP.

Multiagent data warehousing and multiagent data mining for cerebrum/cerebellum modeling

Wen-Ran Zhang

Show abstract

An algorithm named Neighbor-Miner is outlined for multiagent data warehousing and multiagent data mining. The algorithm is defined in an evolving dynamic environment with autonomous or semiautonomous agents. Instead of mining frequent itemsets from customer transactions, the new algorithm discovers new agents and mining agent associations in first-order logic from agent attributes and actions. While the Apriori algorithm uses frequency as a priory threshold, the new algorithm uses agent similarity as priory knowledge. The concept of agent similarity leads to the notions of agent cuboid, orthogonal multiagent data warehousing (MADWH), and multiagent data mining (MADM). Based on agent similarities and action similarities, Neighbor-Miner is proposed and illustrated in a MADWH/MADM approach to cerebrum/cerebellum modeling. It is shown that (1) semiautonomous neurofuzzy agents can be identified for uniped locomotion and gymnastic training based on attribute relevance analysis; (2) new agents can be discovered and agent cuboids can be dynamically constructed in an orthogonal MADWH, which resembles an evolving cerebrum/cerebellum system; and (3) dynamic motion laws can be discovered as association rules in first order logic. Although examples in legged robot gymnastics are used to illustrate the basic ideas, the new approach is generally suitable for a broad category of data mining tasks where knowledge can be discovered collectively by a set of agents from a geographically or geometrically distributed but relevant environment, especially in scientific and engineering data environments.

Fast discovery of structural navigational patterns from web user traversals

Man-Kwan Shan, Hua-Fu Li

Show abstract

With progressive expansion in size and complexity of Web sites on WWW, much research has been done on the discovery of useful user traversal patterns. Most existing approaches focus on finding Web association rules, traversal paths or sequential patterns from Web logs. In this paper, we present a new pattern, Web traversal walk, for analysis of the structural navigation activities of Web users. A Web traversal walk is a structural sequence of forward and backward traversal paths. An efficient algorithm, Fast-Walk, is proposed to discover the Web traversal walks. In Fast-Walk, a tree structure is constructed in memory from Web logs and the frequent Web traversal walks are generated from the tree structure. Experimental results show the efficiency and scalability of Fast-Walk.

Context-sensitive keyword selection using text data mining

Sai-Ming Li, Sanjeev Seereeram, Raman K. Mehra, et al.

Show abstract

Most information retrieval systems rely on the user to provide a set of keywords that the retrieved documents should contain. However, when the objective is to search for documents that is similar to a given document, the system has to choose the keywords from that document first. Automatic selection of keywords is not a trivial task as one word may be a keyword in one context but a very common word in others, and require significant domain specific knowledge. In this paper we describe a method for choosing keywords from a document within a given corpus automatically using text data-mining technique. The key idea is to score the words within the document based on the clustering result of the entire corpus. We applied the scheme to a Software Trouble Report (STR) corpus and obtained highly relevant keywords and search result.

Novel similarity-based clustering algorithm for grouping broadcast news

Oktay V. Ibrahimov, Ishwar K. Sethi, Nevenka Dimitrova

Show abstract

The goal of the current paper is to introduce a novel clustering algorithm that has been designed for grouping transcribed textual documents obtained out of audio, video segments. Since audio transcripts are normally highly erroneous documents, one of the major challenges at the text processing stage is to reduce the negative impacts of errors gained at the speech recognition stage. Other difficulties come from the nature of conversational speech. In the paper we describe the main difficulties of the spoken documents and suggest an approach restricting their negative effects. In our paper we also present a clustering algorithm that groups transcripts on the base of informative closeness of documents. To carry out such partitioning we give an intuitive definition of informative field of a transcript and use it in our algorithm. To assess informative closeness of the transcripts, we apply Chi-square similarity measure, which is also described in the paper. Our experiments with Chi-square similarity measure showed its robustness and high efficacy. In particular, the performance analysis that have been carried out in regard to Chi-square and three other similarity measures such as Cosine, Dice, and Jaccard showed that Chi-square is more robust to specific features of spoken documents.

Medical data mining

Rajani Katta, Yanqing Zhang

Show abstract

A medical data mining algorithm is proposed based on data mining techniques and relevant intelligent methods such as granular computing, neural computing and soft computing. It could lead to increased understanding of the causes of various diseases leading to better diagnosis.

Application of data mining techniques to identify data anomalies: a case study in the oil and gas industry

Jenifer S. McCormack, Brian Wohlschlaeger, Bryan Lanier

Show abstract

This paper presents the application of the AgentMiner^TM tool suite to improve the efficiency of detecting data anomalies in oil well log and production data sets, which have traditionally been done by hand or through the use of database business rules. There was a need to verify the data sets, once cleansed and certified to ensure that the existing data certification process was effective. There was also a need to identify more complex relational data anomalies that cannot be addressed by simple business rules. Analysis techniques including statistical clustering, correlation and 3-D data visualization techniques were successfully utilized to identify potential complex data anomalies. A data-preprocessing tool was also applied to automatically detect simple data errors such as missing, out of range, and null values. The pre-processing tools were also used to prepare the data sets for further statistical and visualization analyses. To enhance the discovery of data anomalies two different data visualization tools for the data clusters were applied.

Vegetation classification and soil moisture calculation using land surface temperature (LST) and vegetation index (VI)

Liangyun Liu, Bing Zhang, Genxing Xu, et al.

Show abstract

In this paper, the temperature-missivity separating (TES) method and normalized difference vegetation index (NDVI) are introduced, and the hyperspectral image data are analyzed using land surface temperature (LST) and NDVI channels which are acquired by Operative Module Imaging Spectral (OMIS) in Beijing Precision Agriculture Demonstration Base in Xiaotangshan town, Beijing in 26 Apr, 2001. Firstly, the 6 kinds of ground targets, which are winter wheat in booting stage and jointing stage, bare soil, water in ponds, sullage in dry ponds, aquatic grass, are well classified using LST and NDVI channels. Secondly, the triangle-like scatter-plot is built and analyzed using LST and NDVI channels, which is convenient to extract the information of vegetation growth and soil's moisture. Compared with the scatter-plot built by red and near-infrared bands, the spectral distance between different classes are larger, and the samples in the same class are more convergent. Finally, we design a logarithm VIT model to extract the surface soil water content (SWC) using LST and NDVI channel, which works well, and the coefficient of determination, R², between the measured surface SWC and the estimated is 0.634. The mapping of surface SWC in the wheat area are calculated and illustrated, which is important for scientific irrigation and precise agriculture.

Evaluation of similarity measures for analysis of databases on laboratory examinations

Xiaoguang Sun, Shoji Hirano, Shusaku Tsumoto

Show abstract

One of the key concepts in data mining is to give a suitable partition of datasets in an automatic way. On one hand, classification method is to find the partitions given by combinations of attribute-value pairs which are best fit to the partition given by target concepts. On the other hand, clustering method is to find the partitions which best characterize given datasets by using a similarity measure. Therefore, the choice of distance or similarity measures are one of the most important research topics in data mining. However, such empirical comparisons have never been studied in the literature. In this paper, several types of similarity measures were compared in the following three clinical contexts: the first one is for datasets composed of only categorical attributes. The second one is for those of mixture of categorical and numerical attributes. The final one is for those of only numerical attributes. Experimental results show that simple similarity measures perform as well as new proposed measures.

Fuzzy propositions weighted by veracities or how to relate fuzzy logic and probability theory for segmentation of ultrasound images

Renaud Debon, Basel Solaiman, Jean-Michel Cauvin, et al.

Show abstract

In medical imaging, and more generally in medical information, researches go towards fusion systems. Nowadays, the steps of information source definition, the pertinent data extraction and the fusion need to be conducted as a whole. In this work, our interest is related to the esophagus wall segmentation from ultrasound images sequences. We aim to elaborate a general methodology of data mining that coherently links works on data selection and fusion architectures, in order to extract useful information from raw data and to integrate efficiently the physician a prior. In the presented method, based on fuzzy logic, some fuzzy propositions are defined using physicians a prior knowledge. The use of probabilistic distributions, estimated thanks to a learning base of pathologic and non-pathologic cases, enables the veracity of these propositions to be qualified. This promising idea enables information to be managed through the consideration of both information imprecision and uncertainty. In the same time, the obtained benefit, when a prior knowledge source is injected in a fusion based decision system, can be quantified. By considering that, the fuzzyfication stage is optimized relatively to a given criteria using a genetic algorithm. By this manner, fuzzy sets corresponding to the physicians ambiguous a prior are defined objectively. At this level, we successively compare performances obtained when fuzzy functions are defined empirically and when they are optimized. We conclude this paper with the first results on esophagus wall segmentation and outline some further works.

One-to-one modeling and simulation: a new approach in customer-relationship management for grocery retail

Cem M. Baydar

Show abstract

The ever-increasing competition in retail industry puts pressure on retailers to deal with their customers more efficiently. Currently most companies use Customer Relationship Management (CRM) systems to maximize the customer satisfaction level by trying to understand more about their behaviors. However, one disadvantage of the current approaches is that they focus on the segmentation of customers into homogenous groups and they disregard examining the one-to-one relationship of each individual's behavior toward each product. Therefore, individual behavior cannot be captured in detail. Modeling individual behavior for each product enables several strategies of pricing by keeping the customer satisfaction at the maximum level. One example is offering a personal discount on a particular item to a customer who is price sensitive to that particular product. Therefore, you can still sell other products at the non-discounted level to this customer by keeping him satisfied. In this paper, individual pricing approach is discussed. The aim of this study is to develop a conceptual framework to analyze the feasibility of individual pricing. Customer behaviors can be modeled individually with respect to each product for a grocery store. Several factors can be used to determine these behaviors such as customer's need, brand loyalty and price sensitivity. Each customer can be modeled as an adaptive agent using qualitative descriptions of behaviors (i.e., highly price sensitive). Then, the overall shopping behavior can be simulated using a multi-agent Monte-Carlo simulation. It is expected that with this approach, retailers will be able to determine better strategies to obtain more profits, better sales and better customer satisfaction.

Data mining for personal navigation

Gurushyam Hariharan, Pasi Franti, Sandeep Mehta

Show abstract

Relevance is the key in defining what data is to be extracted from the Internet. Traditionally, relevance has been defined mainly by keywords and user profiles. In this paper we discuss a fairly untouched dimension to relevance: location. Any navigational information sought by a user at large on earth is evidently governed by his location. We believe that task oriented data mining of the web amalgamated with location information is the key to providing relevant information for personal navigation. We explore the existential hurdles and propose novel approaches to tackle them. We also present naive, task-oriented data mining based approaches and their implementations in Java, to extract location based information. Ad-hoc pairing of data with coordinates (x, y) is very rare on the web. But if the same co-ordinates are converted to a logical address (state/city/street), a wide spectrum of location-based information base opens up. Hence, given the coordinates (x, y) on the earth, the scheme points to the logical address of the user. Location based information could either be picked up from fixed and known service providers (e.g. Yellow Pages) or from any arbitrary website on the Web. Once the web servers providing information relevant to the logical address are located, task oriented data mining is performed over these sites keeping in mind what information is interesting to the contemporary user. After all this, a simple data stream is provided to the user with information scaled to his convenience. The scheme has been implemented for cities of Finland.

Learning to change taxonomies

Elena Eneva, Valery A. Petrushin

Show abstract

Taxonomies are valuable tools for structuring and representing our knowledge about the world. They are widely used in many domains, where information about species, products, customers, publications, etc. needs to be organized. In the absence of standards, many taxonomies of the same entities can co-exist. A problem arises when data categorized in a particular taxonomy needs to be used by a procedure (methodology or algorithm) that uses a different taxonomy. Usually, a labor-intensive manual approach is used to solve this problem. This paper describes a machine learning approach which aids domain experts in changing taxonomies. It allows learning relationships between two taxonomies and mapping the data from one taxonomy into another. The proposed approach uses decision trees and bootstrapping for learning mappings of instances from the source to the target taxonomies. A C4.5 decision tree classifier is trained on a small manually labeled training set and applied to a randomly selected sample from the unlabeled data. The classification results are analyzed and the misclassified items are corrected and all items are added to the training set. This procedure is iterated until unlabeled data is available or an acceptable error rate is reached. In the latter case the last classifier is used to label all the remaining data. We test our approach on a database of products obtained from as grocery store chain and find that it performs well, reaching 92.6% accuracy while requiring the human expert to explicitly label only 18% of the entire data.

DMT-TAFM: a data mining tool for technical analysis of futures market

Vladimir Stepanov, Archana Sathaye

Show abstract

Technical analysis of financial markets describes many patterns of market behavior. For practical use, all these descriptions need to be adjusted for each particular trading session. In this paper, we develop a data mining tool for technical analysis of the futures markets (DMT-TAFM), which dynamically generates rules based on the notion of the price pattern similarity. The tool consists of three main components. The first component provides visualization of data series on a chart with different ranges, scales, and chart sizes and types. The second component constructs pattern descriptions using sets of polynomials. The third component specifies the training set for mining, defines the similarity notion, and searches for a set of similar patterns. DMT-TAFM is useful to prepare the data, and then reveal and systemize statistical information about similar patterns found in any type of historical price series. We performed experiments with our tool on three decades of trading data fro hundred types of futures. Our results for this data set shows that, we can prove or disprove many well-known patterns based on real data, as well as reveal new ones, and use the set of relatively consistent patterns found during data mining for developing better futures trading strategies.

Interactive mining of schema for semistructured data

Yubao Liu, Yucai Feng

Show abstract

Semistructured data such as HTML, SGML and XML documents are specified in lack of any fixed and rigid schema, but typically some implicit structures appear in the data. The crucial problem of mining of schema is to discover the similarly hidden structures of the semistructured data. The huge of amounts of on-line applications make it important to mine schemas for semistructured data. Notice that the user may have to dynamically tune the minimum support of schema, in the course of mining, since the minimum support always describes the user's special interests, we present the problem of interactive mining of schema for semistructured data in this paper. In the course of interactive mining, as the old minimum support of schema is tuned by the user, one possible way of discovering the interesting schemas is to re-run the mining algorithm of schema on the new minimum from scratch. However this approach is not efficient for it does not utilize the already mined results. Hence an incremental mining algorithm is presented. In addition, an improved algorithm for finding the maximal schema tree sets is also given. The experimental results show that the incremental algorithm is more efficient than the non-incrementally A-priori-like algorithm.

Software tool for data mining and its applications

Jie Yang, Chenzhou Ye, Nianyi Chen

Show abstract

A software tool for data mining is introduced, which integrates pattern recognition (PCA, Fisher, clustering, hyperenvelop, regression), artificial intelligence (knowledge representation, decision trees), statistical learning (rough set, support vector machine), computational intelligence (neural network, genetic algorithm, fuzzy systems). It consists of nine function models: pattern recognition, decision trees, association rule, fuzzy rule, neural network, genetic algorithm, Hyper Envelop, support vector machine, visualization. The principle and knowledge representation of some function models of data mining are described. The software tool of data mining is realized by Visual C++ under Windows 2000. Nonmonotony in data mining is dealt with by concept hierarchy and layered mining. The software tool of data mining has satisfactorily applied in the prediction of regularities of the formation of ternary intermetallic compounds in alloy systems, and diagnosis of brain glioma.

Incremental information mining

S. K. Gupta, P. Suresh, Vasudha Bhatanagar

Show abstract

Decision Tree Classification is a simple and important mining function. Decision Tree algorithms are computationally intensive, yet do not capture the evolutionary trends from incremental data repository. In conventional mining approaches, if two or more datasets are to be merged to get a single target dataset, the entire computation for constructing a classifier has to be carried out all over again. Previous work in this field has been to construct individual decision tree classifiers and merge them by taking a voted arbitration or by merging the corresponding decision rules. We have attempted a new approach by data pre-processing the individual windows of the growing database and we call them as Knowledge Concentrates(KC). The formation of the KCs is done in the offline mode. In the mining operations, we use the KCs instead of using the entire past data, thereby reducing the space and time complexity of the entire mining process. The user dynamically selects the target dataset by identifying the windows of interest. The mining requirement is satisfied by merging the respective KCs and running the decision tree algorithm on the merged KC. The proposed scheme operates in three phases. The first phase is the planning phase wherein the dataset domain information is gathered and the data mining goals are defined. The second phase makes a single scan on a window in the database and generates a summary of the window as a knowledge concentrate KC. In our work we have used an efficient Trie structure to store the KCs. The third phase merges the desired windows(KCs) and applies the classification algorithm on the aggregate of the KCs to give the final required classifier. The salient issues addressed in this work are to form a condensed form of the database which enables in the extraction of the patterns in the database that are input to a decision making algorithm to form the required decision tree. The entire scheme is decision tree algorithm independent, in the sense that a user has flexibility to use any standard decision tree algorithm.

FP-tree approach for mining N-most interesting itemsets

Yin Ling Cheung, Ada Wai Chee Fu

Show abstract

In classical association rules mining, a minimum support threshold is assumed to be available for mining frequent itemsets. However, setting such a threshold is typically hard. If the threshold is set too high, nothing will be discovered; and if it is set too low, too many itemsets will be generated, which also implies inefficiency. In this paper, we handle a more practical problem, roughly speaking, it is to mine the N k-itemsets with the highest support for k up to a certain k_max value. We call the results the N-most interesting itemsets. Generally, it is more straightforward for users to determine N and k_max. This approach also provides a solution for an open issue in the problem of subspace clustering. However, with the above problem definition without the support threshold, the subset closure property of the apriori-gen algorithm no longer holds. We propose three new algorithms, LOOPBACK, BOLB, and BOMO, for mining N-most interesting itemsets by variations of the FP-tree approach. Experiments show that all our methods outperform the previously proposed Itemset-Loop algorithm.

Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV

Volume Details

Table of Contents

Table of Contents