Like other data-rich disciplines such as physics, biology, geology, and oceanography, astronomy is facing a data avalanche due to advances in telescope and detector technology, the exponential increase in computing capabilities, improvements in data-collection methods, and successful applications of theoretical simulations. As the era approaches in which data covers the full range of wavelengths from radio to gamma-rays, the expected data volumes will add up to terabytes, soon to be followed by petabytes.
Proper management and processing of massive data sets requires efficient federation of database technologies. However, mining knowledge from huge data volumes is the ultimate goal and development of data-mining techniques is therefore critical. Knowledge discovery in databases (KDD) is the process of extracting useful knowledge from data. Data mining, the application of specific algorithms to discover rare or previously unknown types of object or phenomenon, is a particular step in the process.1 KDD is inherently interactive and iterative, as shown in Figure 1. Common KDD functions are classification, cluster analysis, and regression.
In classification one develops a description or model for each class of data labeled with discrete integers (as opposed to cluster analysis, which is sometimes called ‘unsupervised classification’). Classification is used for the organization of future test data, better understanding of each data class, and predictions of certain properties and behaviors. It is based on spectra or images and, for example, may be used to describe galaxies by morphology.
Figure 1. Knowledge discovery in databases.
Cluster (or clustering) analysis is a multivariate procedure based on placing objects into more or less homogeneous groups such that the relationship between groups is revealed. It lacks an underlying body of statistical theory and is heuristic in nature, requiring decisions to be made by individual users (which can strongly affect results). Cluster analysis is used to classify groups or objects more objectively than subjectively and can help astronomers find unusual objects within a flood of data. Examples include discoveries of high-redshift quasars, type-2 quasars (highly luminous active galactic nuclei whose centers are obscured by gas and dust), and brown dwarfs.
In regression analysis the input data labels are real and continuous. Therefore, if an algorithm can handle data with both real and integer targets, it can be used for classification and regression. Discoveries in astronomy from regression include the Hertzsprung-Russell diagram and Hubble's law relating a galaxy's recessional velocity to its distance. Problems in astronomy that can be solved by regression include photometric or spectral redshift measurements of galaxies and quasars and physical parameter estimations of stars.
KDD is a new and growing field which can address many of the problems facing modern astronomy. Many knowledge-discovery methods are in use and under development, some generic while others remain domain specific. Six common, essential elements qualify a data-mining approach as a KDD technique. All KDD methods2,3 share the same principles of efficiency, accuracy, comprehensibility, automation, and generalization, taking the shortest time possible to learn.
Data-mining algorithms are a core part of KDD. They can be supervised, semisupervised, or unsupervised. Supervised learning uses training data to infer a model which is then applied to test data. Unsupervised learning relies exclusively on test data. In other words, supervised-learning input data uses labels, while unsupervised learning does not. The semisupervised approach uses a combination of labeled and unlabeled data to train a classifier. A large amount of unlabeled data can often be supplemented with a small amount of labeled data to construct a useful classifier.
Generally, supervised-learning algorithms produce a better success rate than unsupervised approaches with respect to the value of the resulting knowledge. For example, reduction of high dimensionality relies on feature selection and extraction, which removes irrelevant or redundant variables. Feature-selection methods include the filter, wrapper, and embedded methods.4
Learning algorithms are complex and generally considered the hardest part of any KDD technique that can be realized using different approaches.5,6 Classification and regression are normally performed by supervised-learning techniques. Many algorithms, such as k-nearest neighbor, support-vector machines, neural networks, naïve Bayes, decision trees, decision rules, metalearning, genetic algorithms, fuzzy sets, rough sets, and ensembles of classifiers have been applied to solve classification problems. Frequently used regression methods include locally weighted, kernel, and projection-pursuit regression, k-nearest neighbors, and neural networks.
Cluster analysis is usually realized by unsupervised-learning techniques. It groups objects of similar kinds into categories and sorts different objects into groups by their degree of association. It uses a number of different algorithms, such as K-means, K-medoids, AutoClass, self-organizing maps, principal-component analysis, and expectation maximization.
Outlier detection aims to detect objects behaving in an unexpected way or which have abnormal properties. It can find rare, unknown, or bad data. The techniques used are commonly divided into six methods, i.e., distribution, depth, distance, clustering, density, and deviation based.7
The future of KDD
Automation of KDD would offer many advantages. Numerous projects are currently underway to achieve this goal, such as the International Virtual Observatory Alliance (IVOA),8 as well as the GRIST9 and astrostastics10 programs. In addition, we previously proposed an architecture for multiwavelength data mining.11 In this system, users with no database knowledge may create their own databases and federate multiwavelength data using automated database-creation and cross-match tools. The use of such data-mining tools will enable scientists to work with large data samples. For example, a recently designed automatic system for photometric-redshift estimation will become an essential tool to automatically determine the physical parameters of galaxies, quasars, and stars.12
Progression in this field requires international collaboration of experts from various disciplines, including computer scientists, database and data-mining specialists, statisticians, and astronomers. Only then will the astronomical community (and other data-rich sciences) share in the intellectual prosperity afforded by optimal investigation of the available data. We believe that our work on the Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) project at the National Astronomical Observatories of the Chinese Academy of Sciences, will be a successful example of how to integrate data acquisition and knowledge retrieval.
This article is funded by the National Natural Science Foundation of China under grant No. 10778724 and by Chinese National 863 project No. 2006AA01A120.
Yanxia Zhang, Yongheng Zhao
National Astronomical Observatories
Chinese Aacademy of Sciences
Yanxia Zhang is an associate professor. She specializes in the study of multiwavelength astronomy and in data-mining algorithms.
Yongheng Zhao is project manager of the LAMOST project. A professor since 1996, he specializes in the study of high-energy astrophysics, and data mining and analysis in astronomy.