Data Mining and Knowledge Discovery: Theory, Tools, and Technology III

New data clustering technique and its applications

Chi-Man Kwan, Roger Xu, Leonard S. Haynes

Show abstract

A new approach to data clustering is presented in this paper. The approach consists of three steps. First, preprocessing of raw sensor data was performed. Intelligent Automation, Incorporated (IAI) used Fast Fourier Transform (FFT) in the preprocessing stage to extract the significant frequency components of the sensor signals. Second, Principal Component Analysis (PCA) was used to further reduce the dimension of the outputs of the preprocessing stage. PCA is a powerful technique for extracting the features inside the input signals. The dimensionality reduction can reduce the size of the neural network classifier in the next stage. Consequently the training and recognition time will be significantly reduced. Finally, neural network classifier using Learning Vector Quantization (LVQ) is used for data classification. The algorithm was successfully applied to two commercial systems at Boeing: Auxiliary Power Units and solenoid valve system.

Value-balanced agglomerative connectivity clustering

Gunjan K. Gupta, Joydeep Ghosh

Show abstract

In this paper we propose a new clustering framework for transactional data-sets involving large numbers of customers and products. Such transactional data pose particular issues such as very high dimensionality (greater than 10,000), and sparse categorical entries, that have been dealt with more effectively using a graph-based approach to clustering such as ROCK. But large transactional data raises certain other issues such as how to compare diverse products (e.g. milk vs. cars) cluster balancing and outlier removal, that need to be addressed. We first propose a new similarity measure that takes the value of the goods purchased into account, and form a value-based graph representation based on this similarity measure. A novel value-based balancing criterion that allows the user to control the balancing of clusters, is then defined. This balancing criterion is integrated with a value-based goodness measure for merging two clusters in an agglomerative clustering routine. Since graph-based clustering algorithms are very sensitive to outliers, we also propose a fast, effective and simple outlier detection and removal method based on under-clustering or over- partitioning. The performance of the proposed clustering framework is compared with leading graph-theoretic approaches such as ROCK and METIS.

Clustering of complex shaped data sets via Kohonen maps and mathematical morphology

Jose Alfredo Ferreira Costa, Marcio Luiz de Andrade Netto

Show abstract

Clustering is the process of discovering groups within the data, based on similarities, with a minimal, if any, knowledge of their structure. The self-organizing (or Kohonen) map (SOM) is one of the best known neural network algorithms. It has been widely studied as a software tool for visualization of high-dimensional data. Important features include information compression while preserving topological and metric relationship of the primary data items. Although Kohonen maps had been applied for clustering data, usually the researcher sets the number of neurons equal to the expected number of clusters, or manually segments a two-dimensional map using some a-priori knowledge of the data. This paper proposes techniques for automatic partitioning and labeling SOM networks in clusters of neurons that may be used to represent the data clusters. Mathematical morphology operations, such as watershed, are performed on the U-matrix, which is a neuron-distance image. The direct application of watershed leads to an oversegmented image. It is used markers to identify significant clusters and homotopy modification to suppress the others. Markers are automatically found by performing a multilevel scan of connected regions of the U-matrix. Each cluster of neurons is a sub-graph that defines, in the input space, complex and non-parametric geometries which approximately describes the shape of the clusters. The process of map partitioning is extended recursively. Each cluster of neurons gives rise to a new map, which are trained with the subset of data that were classified to it. The algorithm produces dynamically a hierarchical tree of maps, which explains the cluster's structure in levels of granularity. The distributed and multiple prototypes cluster representation enables the discoveries of clusters even in the case when we have two or more non-separable pattern classes.

Efficient boundary hunting via vector quantization

Claudia Diamantini, Maurizio Panti

Show abstract

A great amount of information about a classification problem is contained in those instances falling near the decision boundary. This intuition dates back to the earliest studies in pattern recognition, and in the more recent adaptive approaches to the so called boundary hunting, such as the work of Aha et alii on Instance Based Learning and the work of Vapnik et alii on Support Vector Machines. The last work is of particular interest, since theoretical and experimental results ensure the accuracy of boundary reconstruction. However, its optimization approach has heavy computational and memory requirements, which limits its application on huge amounts of data. In the paper we describe an alternative approach to boundary hunting based on adaptive labeled quantization architectures. The adaptation is performed by a stochastic gradient algorithm for the minimization of the error probability. Error probability minimization guarantees the accurate approximation of the optimal decision boundary, while the use of a stochastic gradient algorithm defines an efficient method to reach such approximation. In the paper comparisons to Support Vector Machines are considered.

Extraction and optimization of classification rules for continuous or mixed-mode data using neural nets

Dianhui Wang, T. S. Dillon

Show abstract

Extracting and optimizing rules from continuous or mixed- mode data directly for pattern classification problems is a challenging problem. Self-organizing neural-nets are employed to initialize the rules. A regularization model which trades off misclassification rate, recognition rate and generalization ability is first presented for refining the initial rules. To generate rules for patterns with lower probability density but considerable conceptual importance, an approach to iteratively resolving the clustering part for a filtered set of data is used. The methodology is evaluated using Iris data and demonstrates the effectiveness of technique.

Fuzzy feature-based image mining in remote sensing

Jiang Li, Ram Mohan Narayanan, William J. Waltman, et al.

Show abstract

The problem of image mining combines the areas of content- based image retrieval (CBIR), image understanding, data mining and databases. Image mining in remote sensing is more challenging due to its multi-spectral and spatio-temporal characteristics. To deal with the phenomena that have imprecise interpretation in remote sensing applications, images should be identified by the similarity of their attributes rather than exact matching. Fuzzy spatio-temporal objects are modeled by spatial feature values combined with geographic temporal metadata and climatic data. This paper focuses on the implementation of a remotely sensed image databased with fuzzy characteristics, and its application to data mining. A comprehensive series of calibrated, geo- registered, daily observations, and biweekly maximum NDVI composite AVHRR images are processed and used to build the database. The particularity of the NDVI composite images that our experiments are conducted on is that they cover large geographic areas, and are suitable to observe seasonal changes in biomass (greenness). Based on the characterization of land cover and statistical analysis of climatic data related to NDVI, spatial and temporal data mining such as abnormality detection and similar time sequences detection were carried out by fuzzy object queries.

Retrieval of multi- and hyperspectral images using an interactive relevance feedback form of content-based image retrieval

Irwin E. Alber, Morton S. Farber, Nancy Yeager, et al.

Show abstract

This paper demonstrates the capability of a set of image search algorithms and display tools to search large databases for multi- and hyperspectral image cubes most closely matching a particular query cube. An interactive search and analysis tool is presented and tested based on a relevance feedback approach that uses the human-in-the-loop to enhance a content-based image retrieval process to rapidly find the desired set of image cubes.

Fuzzy, crisp, and human logic in e-commerce marketing data mining

Kelda L. Hearn, Yanqing Zhang

Show abstract

In today's business world there is an abundance of available data and a great need to make good use of it. Many businesses would benefit from examining customer habits and trends and making marketing and product decisions based on that analysis. However, the process of manually examining data and making sound decisions based on that data is time consuming and often impractical. Intelligent systems that can make judgments similar to human judgments are sorely needed. Thus, systems based on fuzzy logic present themselves as an option to be seriously considered. The work described in this paper attempts to make an initial comparison between fuzzy logic and more traditional hard or crisp logic to see which would make a better substitute for human intervention. In this particular case study, customers are classified into categories that indicate how desirable the customer would be as a prospect for marketing. This classification is based on a small set of customer data. The results from these investigations make it clear that fuzzy logic is more able to think for itself and make decisions that more closely match human decision and is therefore significantly closer to human logic than crisp logic.

eShopper modeling and simulation

Valery A. Petrushin

Show abstract

The advent of e-commerce gives an opportunity to shift the paradigm of customer communication into a highly interactive mode. The new generation of commercial Web servers, such as the Blue Martini's server, combines the collection of data on a customer behavior with real-time processing and dynamic tailoring of a feedback page. The new opportunities for direct product marketing and cross selling are arriving. The key problem is what kind of information do we need to achieve these goals, or in other words, how do we model the customer? The paper is devoted to customer modeling and simulation. The focus is on modeling an individual customer. The model is based on the customer's transaction data, click stream data, and demographics. The model includes the hierarchical profile of a customer's preferences to different types of products and brands; consumption models for the different types of products; the current focus, trends, and stochastic models for time intervals between purchases; product affinity models; and some generalized features, such as purchasing power, sensitivity to advertising, price sensitivity, etc. This type of model is used for predicting the date of the next visit, overall spending, and spending for different types of products and brands. For some type of stores (for example, a supermarket) and stable customers, it is possible to forecast the shopping lists rather accurately. The forecasting techniques are discussed. The forecasting results can be used for on- line direct marketing, customer retention, and inventory management. The customer model can also be used as a generative model for simulating the customer's purchasing behavior in different situations and for estimating customer's features.

Web caching and prefetching: a data mining approach

Amidha Shyamsukha, Archana Sathaye, Arun Swami

Show abstract

With the increase in popularity of the Internet, the latency experienced by an individual, while accessing the Web, is increasing. In this paper, we investigate one approach to reducing latency by increasing the hit rate for a web cache. To this effect, we developed a predictive model for pre- fetching and a modified Least Recently Used (LRU) method called AssocLRU. This paper investigates the application of a data mining technique, called Association rules to the web domain. The association rules, predict the URLs a user might reference next, and this knowledge is used in our web caching and pre-fetching model. We developed a trace driven cache simulator to compare the performance of our predictive model with the widely used replacement policy, namely, LRU. The traces we used in our experiments were the traces of Web proxy activity taken at Virginia Tech and EPA HTTP. Our results show that our predictive pre-fetching model using association rules achieves a better hit rate than both LRU and AssocLRU.

Knowledge discovery through games and game theory

James F. Smith III, Robert D. Rhyne

Show abstract

A fuzzy logic based expert system has been developed that automatically allocates electronic attack (EA) resources in real-time over many dissimilar platforms. The platforms can be very general, e.g., ships, planes, robots, land based facilities, etc. Potential foes the platforms deal with can also be general. The initial version of the algorithm was optimized using a genetic algorithm employing fitness functions constructed based on expertise. A new approach is being explored that involves embedding the resource manager in a electronic game environment. The game allows a human expert to play against the resource manager in a simulated battlespace with each of the defending platforms being exclusively directed by the fuzzy resource manager and the attacking platforms being controlled by the human expert or operating autonomously under their own logic. This approach automates the data mining problem. The game automatically creates a database reflecting the domain expert's knowledge, it calls a data mining function, a genetic algorithm, for data mining of the database as required. The game allows easy evaluation of the information mined in the second step. The measure of effectiveness (MOE) for re-optimization is discussed. The mined information is extremely valuable as shown through demanding scenarios.

Discovery of diagnostic knowledge from multisensor data

Wojciech A. Moczulski, Jan M. Zytkow

Show abstract

The paper deals with discovering qualitative and functional dependencies among attributes that describe a complex technical object. The database contains data which are values of control parameters applied in the experiment and multiple features of vibration signals. These signals can be acquired by a multi-sensor measuring system. Information carried by signals acquired from different sensors is in some sense complementary. However, since correlation between signals observed by some sensors is likely, some redundancy in the data may be achieved. Since redundancy may yield reliability and better quality of predictions, it is reasonable to take it into consideration in the model. The attempt depends on selection of the right combination of attributes and then on recursive application of the Equation Finder in order to find functional equations containing control and dependent attributes. Further on, the equations may be inverted providing the opportunity to obtain predictions of values of control attributes, that is the task of the diagnostics of the object. Such knowledge may then be applied in a diagnostic expert system.

Schema extraction and levelization for XML data

Jong P. Yoon, Sung-Rim Kim

Show abstract

XML is a new standard for representing and exchanging information on the Internet. An XML data is a data that is tagged by XML elements. Such an XML data can be retrieved not only by a Boolean connection with keywords on the Internet. Keyword-based information retrieval does not precisely result in user requests partly because user requests cannot be properly conveyed. Either too many or too few matches are produced. It is not trivial to formulate what to retrieve for a good-sized query-result. In conventional approaches, a database schema is useful for users to formulate queries and for query processing. Likewise, this paper proposes a method of schema extraction for XML data collection. Obtaining one single schema is not sufficient to serve for the good size of information retrieval and adaptively for the various requests from Internet users. To support this, schemas are then levelized with respect to the frequency of topological data structures in a database. The topological structural information of these schemas is used to formulate queries and further to rewrite queries for relaxation and restriction. Without modification, the method proposed in this paper is used not only for multimedia XML data collections but also for general XML databases.

Visualization for enhancing the data mining process

Claudio J. Meneses, Georges G. Grinstein

Show abstract

Visualization has proved to be a suitable paradigm for the analysis and exploration of datasets. In the data mining cycle, visualization has been mainly focused on data visualization and output generation. However, besides datasets, many other entities need to be explored and understood by users and analysts. In this paper, we describe the role of visualization in the data mining process, and we present a model to support the interaction between users and data mining entities. We discuss visualizations of datasets, parameter spaces of data mining algorithms, models induced from datasets, and patterns generated by the application of data mining algorithms to datasets. We have developed a Java-based testbed, that implements the extended data mining model with visual support to interact with datasets, models, parameter spaces, and patterns. Experimental results based on several public datasets, data mining algorithms, multidimensional visualization techniques, and other novel visualizations, show clearly the benefits of the integration of visualization in the data mining process.

Toward ubiquitous mining of distributed data

Rajeev Ayyagari, Byong-Hoon Park, Daryl Hershberger, et al.

Show abstract

The role of data-centric information is becoming increasingly important in our everyday professional and personal lives. The advent of laptops, palmtops, handhelds, and wearable computers is also making ubiquitous access to large quantity of data possible. Advanced analysis of distributed data for extracting useful knowledge is the next natural step in the world of ubiquitous computing. However, this will not come for free; it will introduce additional cost due to communication, computational, security among others. Distributed data mining techniques offer a technology to analyze distributed data by minimizing this cost to maintain the ubiquitous presence. This paper adopts the Collective Data Mining approach that offers a collection of different scalable and distributed data analysis techniques. It particularly focuses on two collective techniques for predictive data mining, presents some experimental results, and points the readers toward more extensive documentations of the technology.

Data mining of text as a tool in authorship attribution

Ari J. E. Visa, Jarmo Toivonen, Sami Autio, et al.

Show abstract

It is common that text documents are characterized and classified by keywords that the authors use to give them. Visa et al. have developed a new methodology based on prototype matching. The prototype is an interesting document or a part of an extracted, interesting text. This prototype is matched with the document database of the monitored document flow. The new methodology is capable of extracting the meaning of the document in a certain degree. Our claim is that the new methodology is also capable of authenticating the authorship. To verify this claim two tests were designed. The test hypothesis was that the words and the word order in the sentences could authenticate the author. In the first test three authors were selected. The selected authors were William Shakespeare, Edgar Allan Poe, and George Bernard Shaw. Three texts from each author were examined. Every text was one by one used as a prototype. The two nearest matches with the prototype were noted. The second test uses the Reuters-21578 financial news database. A group of 25 short financial news reports from five different authors are examined. Our new methodology and the interesting results from the two tests are reported in this paper. In the first test, for Shakespeare and for Poe all cases were successful. For Shaw one text was confused with Poe. In the second test the Reuters-21578 financial news were identified by the author relatively well. The resolution is that our text mining methodology seems to be capable of authorship attribution.

Efficiently mining maximal frequent patterns: fast-miner

Michael J. Dewsnip, Malika Mahoui

Show abstract

The problem of finding maximal patterns in databases has been an intensive search area in recent years. The Max-Miner algorithm has been presented as an efficient pattern-mining algorithm which extracts maximal frequent itemsets from databases. Although Max-Miner produces only the set of maximal patterns, it does generate many candidate maximal patterns that need to be discarded at the end of the mining process. In this paper, we propose a set of enhancements to Max-Miner algorithm in order to address this issue. The new version of the algorithm uses new pruning strategies combined with adequate data structures to both speed up the process of counting the support of itemsets and avoid the processing of a larger number of non-maximal patterns. These new features translate directly into a considerable gain in performance. The proposed algorithm also has important features such as requiring a constant number of database passes, and supporting a pipeline structure which enables to output patterns as soon as they are identified as maximal patterns.

Application of preprocessing filtering on Decision Tree C4.5 and rough set theory

Joseph Chi Chung Chan, Tsau Young Lin

Show abstract

This paper compares two artificial intelligence methods: the Decision Tree C4.5 and Rough Set Theory on the stock market data. The Decision Tree C4.5 is reviewed with the Rough Set Theory. An enhanced window application is developed to facilitate the pre-processing filtering by introducing the feature (attribute) transformations, which allows users to input formulas and create new attributes. Also, the application produces three varieties of data set with delaying, averaging, and summation. The results prove the improvement of pre-processing by applying feature (attribute) transformations on Decision Tree C4.5. Moreover, the comparison between Decision Tree C4.5 and Rough Set Theory is based on the clarity, automation, accuracy, dimensionality, raw data, and speed, which is supported by the rules sets generated by both algorithms on three different sets of data.

System to build fuzzy logic models from databases and application to multisensor data

Gary Witus

Show abstract

This paper presents the results of a trial application of a data mining algorithm to a multi-sensor data base. The data base structure is assumed to consist of a single dependent variable and an unspecified number of independent variables (if the data base contains more than one dependent variable, a separate model is built for each dependent variable). Given a set of values for the independent variables, the fuzzy logic model estimates the value of the dependent variable, and the error in the estimated value. The algorithm re-organizes the data records into a multi- dimensional partition tree. The tree is binary (each node is partitioned into exactly two disjoint nodes) and unbalanced (the two child nodes do not have the same number of members). The partitioning algorithm is greedy: each node is partitioned independently, and at each node the algorithm searches to find the independent variable and partition threshold that best accounts for variance in the dependent variable. Partitioning stops when the variance in the dependent variable at a node is less than some user- specified threshold. Fuzzy logic rules are constructed from the leaf nodes. The algorithm compresses the data base into a set of fuzzy logic rules. The set of fuzzy logic rules is a model of the information in the data base. We conducted meta-analysis of the relationships between the independent and dependent variables by studying how the independent variables are used in the fuzzy logic model, i.e., the number of rules that use each independent variable and the number of data records in the training data to which those rules apply.

Statistical extension of rough set rule induction

Shusaku Tsumoto

Show abstract

Rough set based rule induction methods have been applied to knowledge discovery in databases. The empirical results obtained show that they are very powerful and that some important knowledge has been extracted from datasets. However, quantitative evaluation of induced rules are based not on statistical evidence but on rather naive indices, such as conditional probabilities and functions of conditional probabilities. In this paper, we introduce a new approach to induced rules for quantitative evaluation, which can be viewed as a statistical extension of rough set methods. For this extension, chi-square distribution and F- distribution play an important role in statistical evaluation.

New soft computing model for data mining

Sayee Sumathi, S. N. Sivanandam, Suresh Babu

Show abstract

Information and energy are at the core of everything around us. Our entire existence is a process of gathering, analyzing, understanding and acting on the information. For many applications dealing with large amount of data, pattern classification is a key element in arriving at the solution. Engineering applications like SONAR, RADAR, SEISMIC and medical diagnosis require the ability to accurately classify the recorded data for controlling, tracking and decision making respectively. Although modern technologies enable storage of large streams of data, we do not yet have a technology to help us to understand, analyze, or even visualize the hidden information in the data. Data Mining is now the emerging field attracting all research communities. Pattern classification is one particular category of Data Mining, which enables the Discovery of Knowledge from Very Large Databases (VLDB). Modern developments in hardware and software technologies have paved way for developing the software for analyzing and visualizing the data. This development is based on the application of data mining concept. Artificial Neural Networks are used to mine the data base which has better noise immunity and less training time. The paper aims mainly at classification accuracy with reduction in learning time using self-organizing neural networks. The newness in the concept of Adaptive Resonance Theory (ART) makes this as the best and efficient approach for classification.

Algebra view and information view of rough sets theory

Guoyin Wang

Show abstract

Rough Set is a valid mathematical theory developed in recent years, which has the ability to deal with imprecise, uncertain, and vague information. It has been applied in such fields as machine learning, data mining, intelligent data analyzing and control algorithm acquiring successfully. Many researchers have studied rough sets in different view. In this paper, the algebra view and information view of rough set theory are analyzed and compared with each other systematically. Some equivalence relations and other kind of relations such as inclusion relations are resulted through comparing study. For example, the reduction under algebra view will be equivalent to the reduction under information view if the decision table is consistent. Otherwise, the reduction under information view will include the reduction under algebra view. These results will be useful for designing heuristic reduction algorithms.

Knowledge discovery in scientific data using hierarchical modeling in dimensional analysis

Steffen Brueckner, Stephan Rudolph

Show abstract

In the automotive and the aerospace industry large amounts of expensively gathered experimental data are stored in huge databases. The real worth of these databases lies not only in easy data access, but also in the additional possibility of extracting the engineering knowledge implicitly contained in these data. As analytical modeling techniques in engineering are usually limited in model complexity, data driven techniques gain more and more importance in this kind of modeling. Using additional engineering knowledge such as dimensional information, the data driven modeling process has a great potential for saving modeling as well as experimental effort and may therefore help to generate financial benefit. In a technical context, knowledge is often represented as numerical attribute-value pairs with corresponding measurement units. The database fields form the so-called relevance list which is the only information needed to find the set of dimensionless parameters for the problem. The Pi- Theorem of Buckingham guarantees that for each complete relevance list a set of dimensionless groups exists. The number of these dimensionless parameters is less than the number of dimensional parameters in the dimensional formulation, thus a dimensionality reduction can easily be accomplished. Additionally, dimensional analysis allows a hierarchical modeling technique, first creating models of subsystems and then aggregating them consecutively into the overall model using coupling numbers. This paper gives a brief introduction into dimensional analysis and then shows the procedure of hierarchical modeling, its implications, as well as its application to knowledge discovery in scientific data. The proposed method is illustrated in a simplified example from the aerospace industry.

Asymmetric threat data mining and knowledge discovery

John F. Gilmore, Michael A. Pagels, Justin Palk

Show abstract

Asymmetric threats differ from the conventional force-on- force military encounters that the Defense Department has historically been trained to engage. Terrorism by its nature is now an operational activity that is neither easily detected or countered as its very existence depends on small covert attacks exploiting the element of surprise. But terrorism does have defined forms, motivations, tactics and organizational structure. Exploiting a terrorism taxonomy provides the opportunity to discover and assess knowledge of terrorist operations. This paper describes the Asymmetric Threat Terrorist Assessment, Countering, and Knowledge (ATTACK) system. ATTACK has been developed to (a) data mine open source intelligence (OSINT) information from web-based newspaper sources, video news web casts, and actual terrorist web sites, (b) evaluate this information against a terrorism taxonomy, (c) exploit country/region specific social, economic, political, and religious knowledge, and (d) discover and predict potential terrorist activities and association links. Details of the asymmetric threat structure and the ATTACK system architecture are presented with results of an actual terrorist data mining and knowledge discovery test case shown.

Medical knowledge discovery in hospital information systems

Shusaku Tsumoto

Show abstract

Since early 1980's, the rapid growth of hospital information systems stores the large amount of laboratory examinations as databases. Thus, it is highly expected that knowledge discovery and data mining (KDD) methods will find interesting patterns from databases as reuse of stored data and be important for medical research and practice because human beings cannot deal with such a huge amount of data. However, there are still few empirical approaches which discuss the whole data mining process from the viewpoint of medical data. In this paper, KDD process from a hospital information system is presented by using two medical datasets. This empirical study shows that preprocessing and data projection are the most time-consuming processes, in which very few data mining researches have not discussed yet and that application of rule induction methods is much easier than preprocessing.

Multimedia agent monitoring and assessment system

John F. Gilmore, Michael A. Pagels, Justin Palk

Show abstract

The explosion of Information Technology (IT) in the commercial sector in the 1990's has led to billion dollar corporations overtaking the US Government (e.g., DARPA, NSF) as the leaders in IT research and development. The tenacity of the IT industry in accelerating technology development, in response to commercial demands, has actually provided government organizations with a unique opportunity to incorporate robust commercial IT into their individual applications. This development allows government agencies to focus their limited funds on the application aspects of their problems by leveraging commercial information technology developments. This paradigm applies directly to counterdrug enforcement and support. This paper describes a system that applies the state-of-the-art in information technology to news and information exploitation to produce a Multi-media Agent Monitoring and Assessment (MAMA) system capable of tracking information for field agent use, identifying assets of organizations and individuals for seizure, and disrupting drug shipping routes.

Fractal and multidimensional analysis of data from delay complex systems

Victor F. Dailyudenko

Show abstract

The method of multifractal analysis for multidimensional attractor is proposed, the adaptive segmentation of phase trajectories is used analogously generalized correlation integral approach. A rarefied sequence of points of phase trajectory just forms the centers of segmentation cells, the upper boundary of such rarefying is estimated from statistical analysis of quasi-periodicity along phase trajectories of multidimensional attractor. This method allows to reduce a required computer time in comparison with traditional correlation integral approach and provides a good convergence. So far as the attractor reconstruction was implemented in Takens phase space, analytical approaches for product scheme of functional matrixes derived from delay differential iterations is developed for such phase space. Numerical results for eigenvalues evolution of obtained functional matrixes and for multifractal investigation are represented.

Project X: competitive intelligence data mining and analysis

John F. Gilmore, Michael A. Pagels, Justin Palk

Show abstract

Competitive Intelligence (CI) is a systematic and ethical program for gathering and analyzing information about your competitors' activities and general business trends to further your own company's goals. CI allows companies to gather extensive information on their competitors and to analyze what the competition is doing in order to maintain or gain a competitive edge. In commercial business this potentially translates into millions of dollars in annual savings or losses. The Internet provides an overwhelming portal of information for CI analysis. The problem is how a company can automate the translation of voluminous information into valuable and actionable knowledge. This paper describes Project X, an agent-based data mining system specifically developed for extracting and analyzing competitive information from the Internet. Project X gathers CI information from a variety of sources including online newspapers, corporate websites, industry sector reporting sites, speech archiving sites, video news casts, stock news sites, weather sites, and rumor sites. It uses individual industry specific (e.g., pharmaceutical, financial, aerospace, etc.) commercial sector ontologies to form the knowledge filtering and discovery structures/content required to filter and identify valuable competitive knowledge. Project X is described in detail and an example competitive intelligence case is shown demonstrating the system's performance and utility for business intelligence.

Decomposition in data mining: a medical case study

Andrew Kusiak

Show abstract

Decomposition is a tool for managing complexity in data mining and enhancing the quality of knowledge extracted from large databases. A typology of decomposition approaches applicable to data mining is presented. One of the decomposition approaches, the structure rule-feature matrix, is used as the backbone of a system for informed decision- making. Such a system can be implemented as a decision table, a decision map, or a decision atlas. The ideas presented in the paper are illustrated with examples and a medical case study.

Mining association rules between low-level image features and high-level concepts

Ishwar K. Sethi, Ioana L. Coman, Daniela Stan

Show abstract

In image similarity retrieval systems, color is one of the most widely used features. Users who are not well versed with the image domain characteristics might be more comfortable in working with an Image Retrieval System that allows specification of a query in terms of keywords, thus eliminating the usual intimidation in dealing with very primitive features. In this paper we present two approaches to automatic image annotation, by finding those rules underlying the links between the low-level features and the high-level concepts associated with images. One scheme uses global color image information and classification tree based techniques. Through this supervised learning approach we are able to identify relationships between global color-based image features and some textual decriptors. In the second approach, using low-level image features that capture local color information and through a k-means based clustering mechanism, images are organized in clusters such that images that are similar are located in the same cluster. For each cluster, a set of rules is derived to capture the association between the localized color-based image features and the textual descriptors relevant to the cluster.

Association mining of dependency between time series

Alaaeldin Hafez

Show abstract

Time series analysis is considered as a crucial component of strategic control over a broad variety of disciplines in business, science and engineering. Time series data is a sequence of observations collected over intervals of time. Each time series describes a phenomenon as a function of time. Analysis on time series data includes discovering trends (or patterns) in a time series sequence. In the last few years, data mining has emerged and been recognized as a new technology for data analysis. Data Mining is the process of discovering potentially valuable patterns, associations, trends, sequences and dependencies in data. Data mining techniques can discover information that many traditional business analysis and statistical techniques fail to deliver. In this paper, we adapt and innovate data mining techniques to analyze time series data. By using data mining techniques, maximal frequent patterns are discovered and used in predicting future sequences or trends, where trends describe the behavior of a sequence. In order to include different types of time series (e.g. irregular and non- systematic), we consider past frequent patterns of the same time sequences (local patterns) and of other dependent time sequences (global patterns). We use the word 'dependent' instead of the word 'similar' for emphasis on real life time series where two time series sequences could be completely different (in values, shapes, etc.), but they still react to the same conditions in a dependent way. In this paper, we propose the Dependence Mining Technique that could be used in predicting time series sequences. The proposed technique consists of three phases: (a) for all time series sequences, generate their trend sequences, (b) discover maximal frequent trend patterns, generate pattern vectors (to keep information of frequent trend patterns), use trend pattern vectors to predict future time series sequences.

Bitmap approach to trend clustering for prediction in time series databases

Jong P. Yoon, Yixin Luo, Junghyun Nam

Show abstract

This paper describes a bitmap approach to clustering and prediction of trends in time-series databases. Similar trend patterns, rather than similar data patterns, are extracted from time-series database. We consider four types of matches: (1) Exact match, (2) Similarity match, (3) Exact match by shift, and (4) Similarity match by shift. Each pair of time-series data may be matched in one of these four types if this pair is similar one to another, by similarity (or sim) notion over a threshold. Matched data can be clustered by the same way of matching. To improve performance, we use the notion of center of a cluster. The radius of a cluster is used to determine whether a given time-series data is included in the cluster. We also use a new notion of dissimilarity, called dissim, to make accurate clusters. It is likely that a time-series data is in one cluster rather than in another by using both notions, sim and dissim: a data is similar to one cluster while it is dissimilar to another. For a trend sequence, the cluster that is dissimilar to that sequence is called dissimilar- cluster. The contribution of this paper includes (1) clustering by using not only similarity match but also dissimilarity match. In this way we prevent any positive and negative failures. (2) Prediction by using not only similar trend sequences but also dissimilar trend sequences. (3) A bitmap approach can improve performance of clustering and prediction.

Fuzzy-logic-based method for chaotic time series prediction

Mo Wang, Guang Rong, S.Y. Liao

Show abstract

In this paper, a fuzzy logic based method for single or multi-dimensional Chaotic Time Series (CTS, hereafter) predictions is proposed. The fundament characteristic of CTS is that it demonstrates both stochastic behavior in time domain and determined behavior in phase space. The motivation of this research is two fold: (1) embedding phase space track of CTS data has proven to be a quantitative analysis of a dynamic system in different embedding dimensions; (2) Fuzzy Logic (FL) not only has capability of handling a much more complex system, but its superiority in time convergence has also proven to be a valuable asset for time critical applications. The process of using the proposed method for CTS predictions includes the following steps: (1) reconstructing a phase space using CTS; (2) using known phase space points to construct the input-output pairs; (3) using a fuzzy system to predict the unknown embedding phase space points; (4) predicting the CTS data by converting the phase space points to the time domain. A C++ program is written to simulate the process. The simulation results show that the proposed method is simple, practical and effective.

Multichannel art network with rule extraction: as a data mining tool

Sayee Sumathi, S. N. Sivanandam, Suresh Babu

Show abstract

Information and energy are at the core of everything around us. Our entire existence is a process of gathering, analyzing, understanding and acting on the information. For many applications dealing with large amounts of data, pattern classification is a key element in arriving at the solution. Engineering applications especially like medical diagnosis require the ability to accurately classify the recorded data for controlling, tracking and decision making. Although modern technologies enable storage of large streams of data, we do not have technology to help us to understand, analyze, or even visualize the hidden information in the data. Data mining is the emerging field that helps us to understand and analyze the hidden information from Very Large Databases (VLDB). In this paper, two important mining tools (i.e.) Neural networks and Genetic algorithm have been used for mining the database through pattern classification. The processing methodology consists of three major phases: Network construction and training, Pruning, Rule extraction and validation. The networks used are namely Adaptive Resonance Theory 1.5, Adaptive Resonance Theory 3 and Multi Channel ART (MART). These networks belong to the Adaptive Resonance Theory (ART) family, which is a self-organizing neural network capable of clustering arbitrary sequence of input patterns into stable recognition codes. The pruning phase aims at removing redundant links and units without increasing the classification error rate of the network. The pruning methods adopted are Local pruning and Threshold pruning.

Data Mining and Knowledge Discovery: Theory, Tools, and Technology III

Volume Details

Table of Contents

Table of Contents