AstroShop Support Resources Education Events Publications Membership News About Us Home
The Astronomical Society of the Pacific

 

   home > publications > e-zine

SEARCH ASP SITE:
  Publications Topics:  
   
Books  
ASP Conference Series  
IAU Publications  
  Books of Note  
  Purchase through the AstroShop  
Journals  
  Publications of the ASP (PASP)  
Magazines  
Mercury Magazine  
Newletters  
The Universe in the Classroom  
Contact Us  
         

Digital Dig - Data Mining in Astronomy

 

by Matthew Woodard

Introduction

The sky has always offered a wealth of information to observers. Every night the story of countless stars, galaxies, quasars, planets, and other phenomena appear in the sky. The rich history of our universe has been playing itself out for human eyes since the beginning of history. However, never before in human history have scientists had the ability to capture so much information about the universe. As time progresses, our records of the sky grow continuously larger. The amount of data in terms of pixels captured doubles every year (Szalay website). The growth is now so fast that data accumulation far exceeds the ability of humans to consider it all. The introduction of the computer and magnetic storage devices makes possible the storage of terabytes (2^40 bytes) of information with petabytes (2^50 bytes) of information on the way (Djorgovski, Mahabal, et al. 52). And while the computer is facilitating the gap between information capture and study, it also offers an opportunity to make better use of the data. The computer's ability to rapidly perform algorithms gives it the ability to examine the huge amounts of astronomical data available. By using data analysis techniques such as "knowledge discovery," astronomers can use computers to find new phenomena, relationships, and useful knowledge about the universe, and ultimately reduce the gap between data capture and analysis.

The successful use of computers to analyze data, however, requires many steps. First, raw data is transformed into a catalog that records positions, fluxes, and shapes of stellar objects. This method of cataloging information has the advantages of reducing data sizes, retaining useful information, and presenting the data in a way that is accessible for further analysis by computers. After the catalog is created it must be stored so it can be studied. Currently catalogs are maintained as databases. Finally, knowledge must be extracted from the catalog either by human observation, or by using computerized tools to assist in the discovery of meaningful knowledge. The ability to consider multiple surveys at different wavelengths simultaneously offers another frontier to which computers can be applied. Currently there are tools available for considering data from multiple surveys, and future astronomers can look forward to new seamless ways of examining surveys. This sort of network, in which multiple data sets are presented as a single source, is the idea behind the National Virtual Observatory, which aims to unify all astronomical catalogs through one interface. By using computers to explore digital surveys of the sky, astronomers can effectively mine the huge amount of data they are accruing, and ultimately make new and interesting discoveries.

Digital Catalogs

Digital catalogs of the sky, also called surveys, are analogous to non-digital records of the sky. A digital catalog lists the positions of objects in the sky, as well as several of the object's features. If a star or galaxy is well known its name may also be included. The Sloan Digital Sky Survey, for example, extracts "400 attributes for each celestial object." (Szalay, Gray, et al. 1) The primary information is the flux of an object, measured in green, red, indigo, ultra-violet and near infrared (Szalay Gray, et al. 5). This information is then made available for exploration by scientists. The main difference between a digital and a traditional catalog of the sky is that traditional catalogs are compiled by humans who observe the sky, whereas digital catalogs are made by computers that observe images of the sky. Several of the major catalogs include the Faint Images of the Radio Sky at Twenty centimeters (FIRST), the Digital Palomar Sky Survey (DPOSS), the Two Micron All Sky Survey (2MASS), and the Sloan Digital Sky Survey (SDSS). While most catalogs record only images of the sky, there are spectroscopic surveys as well (Djorgovski, Brunner 1). Most catalogs are produced from sky surveys that take pictures of the sky. If complete photographic records of the sky already exist, why would scientists want to make digital catalogs of the information?

Most digital catalogs represent the sky as a series of objects (stars or galaxies) or entries in a database, even though the original sources for the catalogs were images. This format offers several advantages for scientists. First, digital catalogs of the sky are much smaller than the raw image input. For example the raw images from the FIRST survey consume 250 gigabytes (250 * 2 ^ 30 bytes) of disk space while the corresponding catalog only takes 78 megabytes of storage (78 * 2 ^ 20 bytes), a sizeable difference (Kamath et al. 14). Another advantage of the cataloging process is that it presents only interesting information about the sky. "[The images of the FIRST survey] are mainly noise, with very few 'interesting' pixels corresponding to the radio sources…On the other hand we have the 78 Megabyte catalog, where each entry contains information on only a part of a radio source." (Kamath et al. 14) The catalog in essence tries to reduce the information to only those areas that are relevant sources of information. This transformation has another useful result — the data is now in a format where a computer can easily compare multiple sources, allowing the use of methods like data mining to assist research. In order to create these useful catalogs, scientists must create algorithms to extract useful information from the original images.

The first step in cataloging the sky is acquiring the data to process. Currently, data is being collected in two ways, by scanning old photographic plates, and by using instruments that pipe their data directly into computers. The Palomar Digital Sky Survey, DPOSS, uses plates exposed at the Mount Palomar Schmidt Telescope. Each plate covered 36 degrees of the night sky and there were three exposures in blue, red, and near infrared. These images were scanned and resolved into pixilated images with each pixel representing a 1 arc second square (Djorgovski, Gal, et al. 3). A similar tact was taken for the Sloan Digital Sky Survey. In this project the images were obtained by using hardware that measured fluxes at five different colors directly from the telescope. To properly form the image, the SDSS telescope considered each section of the sky on two separate nights, forming a composite image. The SDSS also obtained spectrograms; however, there are only fifty thousand spectrograms for fourteen million objects (Szalay, Gray, et al. 1). The DPOSS and SDSS represent two Earth-based means of recording surveys, but satellite sources are also used. The Roentgen Satellite, for example, has produced the ROAST survey with information on the X-ray sources in the sky (Page, Denby 351). Although all of these surveys use different data gathering techniques, they all result in raw pixilated images, which are usually processed further to create the catalog.

The next step in constructing a catalog is to convert the raw pixel images to catalog entries. To perform the transformation, computers need to be able to distinguish stellar objects from each other, from earth-based objects such as planes, and from the surrounding sky. These programs have to perform several steps. The first step is to decide which pixels are relevant stars or galaxies, and which are noise. One way to detect objects is to "require a certain minimum on adjacent pixels above some signal to noise threshold for detection." (Djorgovski, Brunner 2) After locating significant pixels, the programs must group the pixels into objects while making sure that it separates or "deblends" overlapping sources. (Djorgovski, Brunner 2) Finally the program should record the relevant modeling information from each object and possibly classify it as either a star or a galaxy (Djorgovski, Brunner 1). In practice it is known that to be useful this classification needs accuracies of "better than 90%," but that "accuracies higher than 95% are generally hard to achieve." (Djorgovski, Brunner 3) Several different approaches exist for implementing this procedure.

An example of this process is The Neural Extractor (NEXT), which uses an artificial neural network to classify objects in the sky (Tagliaferri, Longo, Iovane 107). An artificial neural network is a computer algorithm that uses an extremely simple model of the brain to correctly classify information. The neural network is composed of sets of nodes interconnected to each other by weighted links. Every time information is passed between the nodes, the signal is amplified or suppressed by the strength of the link between the two nodes. By adjusting the strength of the links, the network can learn to produce correct output (Russel, Norvig 567). NEXT uses several neural networks to detect objects. First, pixels are labeled as either background or as significant pixels; this is done to remove noise from the data. Each pixel is considered along with adjacent pixels "since the attribution of a pixel to either the 'background' or the 'object' classes depends on the pixel value and the values of the adjacent pixels." (Taliaferri, Longo, Iovane 108) The procedure continues by grouping adjacent pixels into objects. After determining objects, NEXT deblends overlapping objects into their own unique objects. Deblending is done by separating each peak in light magnitude into its own object. Finally NEXT extracts the information for each object and attempts to classify it as either a galaxy or a star by using another neural network. The steps taken by NEXT are similar to the steps taken by other classifiers.

The Sky Server Project at Fermilab, which catalogs the SDSS, uses a similar set of steps to classify objects in the sky. The raw input for the sky server is a set of color magnitude readings in five wavelength bands from sensors on the telescope. Some spectroscopic readings are done as well (Szalay, Gray 5). Because many stars and galaxies overlap each other the individual objects must be deblended. Roughly 80% of the objects are singular; the rest are deblended (Szalay, Gray 4). The SDSS is a five year project to be finished in 2006. After the catalog has been captured, it must be stored in a database so the information can be accessed.

Storage

Storage of catalogs has been done primarily on traditional database management systems; however, to improve the retrieval speed of the databases, astronomers have employed several new techniques. Most of the catalogs are stored in relational databases, where the information is stored as tables (since there are many existing software packages to deal with this sort of data) (Djorgovski, Brunner 4). One change that has simplified selecting regions of the sky is the introduction of the HTM system of positioning. HTM, or Hierarchical Triangular Mesh, is based on splitting the celestial sphere into 8 parts by quartering the hemispheres. Each section can be thought of as a triangle projected onto the celestial globe. These triangles are then further subdivided by connecting the midpoints of the triangle, producing four new regions, which in turn are subdivided. This scheme simplifies representations of the sky, and the time taken to query any one section (Brunner 6). The naming of the sections reflects this hierarchy of divisions on the celestial sphere. The base regions for the Northern hemisphere are N0, N1, N2, N3. Successive selections are represented by adding digits to specify the next triangle to select. To select the lower-left triangle of the N0 region one would simply ask for N01. (Brunner 6). The HTM system of locating an object is done for ease of data retrieval by the computer. In grouping the data by HTM coordinates, objects near each other on the celestial sphere will be near each other on the computer disk, speeding up searches for objects within certain angles of each other (Szalay, Gray 7). Another common approach is to store object's positions as Cartesian coordinates on a unit sphere since computers work faster with Cartesian than spherical coordinate systems. With these modifications, queries of sky surveys become much simpler to implement (Bruner 6). Most systems allow users to enter right ascension and declination to find objects, so the HTM system is fairly transparent to astronomers. After the data has been stored, it will need to be explored to discover relevant information.

Data Mining

Once the data has been cataloged, the most important phase begins, namely using the data in the catalog to obtain new astronomical knowledge. Of course, the data can be studied by astronomers unaided by computers; however, the digital format of the data makes it possible for computers to automate the search to assist in knowledge discovery. This process of using computers to extract useful information from a database is called "knowledge discovery," or simply data mining. Data mining can be described as "an information extraction activity whose goal is to discover hidden facts contained in databases." (Borne 1) Common tasks in astronomical data mining are using known properties to find a specific type of object, finding new phenomena by clustering objects and considering the outliers, using predicted properties of a theoretical model to search for candidates, and searches for "one of a kind" events (Borne 2). Many of these techniques have been used to direct research in astronomy.

An example of using the data mining technique is the search for bent-double radio galaxies at the Lawrence Livermore National Laboratory. Scientists used the FIRST catalog of radio emissions at twenty centimeters to search for bent-double galaxies. To find these galaxies, scientists first grouped catalog entries into 'radio sources' that represented sets of galaxies where all members were at most 0.96 arc minutes from their closest neighbors (Kamath 14). Scientists discarded all one-galaxy sets since they could not be bent doubles, and then removed groups with four or more galaxies for individual consideration, since groups with more than three galaxies "are 'interesting' to the astronomers, regardless of whether they are bent-doubles or not." (Kamath 15) The scientists then used a "decision tree" to classify the galaxies as bent doubles or not.

Decision trees are classifiers that use a set of properties to decide how to classify something. A decision tree is analogous to a binomial taxonomic key; both attempt to ask meaningful questions to quickly determine how to classify something. In a decision tree, the selection of these 'questions' is done by "[splitting] the data set on the basis of a parameter and a value." (Voisin, Donas 37) An ideal question would put all bent-double galaxies into one group and the remaining data into another group. This property of well-split data is known as entropy. A decision tree attempts to maximize entropy for each split it makes (Voisin Donas 37). In order to maximize the entropy of the questions asked, the decision tree must already know the correct answers. Clearly, this is unrealistic because, if we knew the classifications, we wouldn't be classifying the set to begin with. To estimate the entropy, questions are determined by considering a subset of the total data where the correct classification is already known. This is referred to as a training set. If there is any statistical bias in the training set, the tree will be biased in its classifications as well. With a good training set, a decision tree can classify objects with fairly high accuracy.

To find bent-doubles, the scientists built decision trees to classify each radio source. The properties the scientists considered included area of the galaxies, peak fluxes in the source, the angles between major axes of the ellipses representing each galaxy, the eccentricity of these ellipses, and other considerations (Kamath 15). Scientists created a training set of 195 radio sources which were either classified as bent-doubles or not by FIRST scientists (Kamath 17). Multiple trees were built from this data, with each tree only considering some percent of the training set (Kamath 17). To classify radio sources, all the trees would classify the source as a bent double or not. All of the trees cast a vote and the majority classification was then chosen as the correct classification (Kamath 17). The scientists found that this method of using multiple decision trees was more accurate than using a single tree (Kamath 17). The search drew heavily on the correct choice of properties to consider. In fact, the scientists considered that "the results are much more dependent on the features [they considered] than on the particular classifier." (Kamath 16) This sort of search for objects with known properties is only one type of data mining.

Another type of data mining used in astronomy is the search for statistically similar groups in the catalogs. Clustering algorithms try to group objects into sets that have similar properties. Scientists can then look for previously unknown relationships in the data. The correlations found by clustering "may be correlated, … some of these correlations may be already known and some as yet unknown." (Djorgovski, Mahabal, et al. 46) So clustering could reveal new relationships among celestial objects. Clustering can answer such questions as, "Are there any previously unknown classes of objects… are there rare outliers, … are there interesting correlations among properties of objects?" (Djorgovski, Mahabal, et al. 50) By performing clustering, scientists can consider why the various relationships among different objects exist, or consider outliers which could represent previously unknown, unique objects. An example of this was the discovery of high-Z quasars in both the SDSS and DPOSS catalogs (Djorgovski, Mahabal, et al. 55). This method of clustering objects could be a tremendous resource for scientists. In most catalogs; however, the information available may be contained on only a few spectra, to be truly useful, multiple surveys at different wavelengths have to be considered.

Cross Correlations

Since many scientists want to examine the same objects recorded from different surveys, there is a growing attempt to correlate objects in different catalogs. The problem of cross-correlating requires a reliable way to associate objects in different databases. One obvious criteria is spatial proximity; however, other statistical properties could be involved as well. One catalog of this sort is the NASA/IPAC Extragalactic Database (NED), which correlates sources from the 2MASS survey and several other surveys of extragalactic objects. NED scientists note that the problem of correlating objects across databases is difficult, because "observations taken with different telescopes and at various wavelengths often differ in substantial ways." (Mazzarella 21) Problems include variable resolution, differing positional coordinates, and changing sizes/appearances due to different wavelengths (Mazzarella 20). To resolve these problems, NED uses statistical associations to correlate objects. These associations are updated as new data are discovered (Mazzarella 21). A similar service is offered by the Centre de Donnˇes de Strausbourg, whose SIMBAD program offers information on sources within our galaxy (Egret, Bonarel 166). Another correlation service is Vizier, which gives catalog information in tabular forms. Vizier currently accesses over 3,000 catalogs/databases (Genova, et al. 145). All of these services offer the user access to multiple surveys; however, the task of integrating all the catalogs into one source is far from over.

One new project that aims to unify the many catalogs is the National Virtual Observatory. "The NVO would be a 'Rosetta Stone;' linking the archival data sets of space and ground-based observatories, the catalogs of multi-wavelength surveys, and the computational resources necessary to support comparison and cross-correlation among these resources." (Szalay website) The NVO is envisioned as both a way to integrate old catalogs and to integrate new ones as well (Szalay website). The coordinators of the NVO feel that it could be used to address many problems, such as matching infrared data with visible spectra to search for gravitational lenses, searching for rare objects, using X-ray imaging to search for Active Galactic Nuclei, to search for extra-solar planets, and to provide data for models of galactic interactions (Szalay website). The search for AGN's is particularly interesting because it requires several different wavelengths. The NVO has made this project a keystone of their research to demonstrate the benefits of a virtual observatory (Szalay website). In order to offer these types of services, the NVO must overcome several hurdles.

One of the challenges the NVO must handle is the unification of data from widely different sources. Most surveys have their own types of data and formats for transmitting them. To overcome this problem, the NVO will use XML as its primary means of data transmission. XML is a metadata language in which the data description is embedded in the same document as the data itself. XML is associated with technologies that make it easier to transform to other useful formats (Szalay website). The NVO also plans to use an image format called FITS since it is widely used in astronomy (Szalay website). Another problem is that, with such large datasets, downloading information will be very time consuming. To alleviate this problem, computer programs could be sent to the data instead of vice versa. The NVO is trying to make this possible by ensuring "agent code portability." (Szalay website) Since the NVO will be responsible for a large amount of work, it must have tremendous computing power to handle these computing demands (Szalay website). All of these issues must be addressed by the NVO in some manner. Regardless of these hurdles, the NVO offers extraordinary potential for integrating astronomical studies of our universe.

Conclusions

Modern astronomy is gathering more information than at any other period of its history. The ability to analyze and draw conclusions from this massive data will be an ever-constant challenge for astronomers. However, the use of computers to automate this task can greatly improve the success of astronomers in comprehending this data. Many of the crucial tools are already in place. The transformation of raw telescope data into catalogs extracts useful information from the large areas of noise in the original data set. This transformation also creates a record that can be analyzed by other computer programs. The steps in transforming raw data to catalogs are demonstrated by the NEXT program and the SDSS. The accuracy of this transformation is crucial since it is the basis of all other work done. The next step involves using knowledge discovery techniques such as classifying and grouping to search for new types of phenomena and to discover statistical correlations among groups of objects. From these steps, data mining techniques can provide astronomers with new information by finding rare events, locating candidates for further research, and by discovering unknown statistical relations. This type of exploration will enable scientists to study a much larger set of data than they could examine otherwise. Further gains can be made by combining data from multiple wavelengths and independent surveys. Several tools such as NED, SIMBAD, and Vizier make comparison possible today, while future astronomers will have the NVO as a resource for studying the sky. With all of these tools in place, or being developed, the time is ripe for extensive use of computers in the extraction of useful knowledge from these huge data sets. In the future, a data mining resource such as the NVO will provide a much deeper knowledge of our universe, and allow us to fully utilize those fractions of the sky that we can capture. Data mining will likely help to close the gap between the data we record and the knowledge contained therein.

MATTHEW WOODARD is a software developer living in Chicago. He graduated from the University of Wisconsin in with a BA in computer science in 2002. This paper was written as an honors project for Astronomy 100. Matthew can be reached at themattwoodard@hotmail.com.

Works Cited

Borne, Kirk D. Data Mining in Astronomical Databases. NASA Astronomical Data Center. http://adc.gsfc.nasa.gov/adc (April 2002).

Brunner, Robert J. Panchromatic Mining for Quasars: An NVO Keystone Science Application. Astronomical Data Analysis. 2-3 August 2001 San Diego USA. Chairs Jean-Luc Starck, Fionn D. Murtagh. International Society for Optical Engineering, 2001.

Djorgovski, S. G., A. A. Mahabal, et al. Searches for Rare and New Types of Objects. NASA Astronomical Data Center. http://adc.gsfc.nasa.gov/adc (April 2002).

Djorgovski, S. G., Robert J Brunner. Digital Sky Surveys: Software Tools and Technologies. NASA Astronomical Data Center. http://adc.gsfc.nasa.gov/adc (April 2002).

Egret, Daniel. et al. A Global Perspective on Astronomical Data and Information: the Strasbourg Astronomical Data Center (CDS). Information and Online Data in Astronomy. Eds. Egret, Daniel. Albrecht, Miguel A. Kluwer Academic Publishers, 1995.

Genova, Fran¨ois, et al. Information Integration and Retrieval: the CDS Hub. Astronomical Data Analysis. 2-3 August 2001 San Diego USA. Chairs Jean-Luc Starck, Fionn D. Murtagh. International Society for Optical Engineering, 2001.

Kamath, Chandrika, et al. Using Data Mining to Find Bent-Double Radio Galaxies in the FIRST Survey. Astronomical Data Analysis. 2-3 August 2001 San Diego USA. Chairs Jean-Luc Starck, Fionn D. Murtagh. International Society for Optical Engineering, 2001.

Mazzarella, Joseph M. et al. Capabilities of the NASA/IPAC Extragalactic Database in the Era of a Global Virtual Observatory. Astronomical Data Analysis. 2-3 August 2001 San Diego USA. Chairs Jean-Luc Starck, Fionn D. Murtagh. International Society for Optical Engineering, 2001.

Page C.G., M. Denby. The ROAST Extreme Ultra-Violet Sky Survey. Astronomical Data Analysis Software and Systems I. Eds. Diana M. Worrall, Chris Biemesderfer, Jeanette Barnes. Astronomical Society of the Pacific, 1992.

Russel, Stuart, Peter Norvig. Artificial Intelligence a Modern Approach. Prentice Hall, 1995.

Szalay, Alexander S. The National Virtual Observatory Website. http://www.us-vo.org/docs/nvo-proj.html (April 15, 2002). Astronomical Society of the Pacific, 2001.

Szalay, Alexander S., Jim Gray, et al. The SDSS SkyServer — Public Access to the Sloan Digital Sky Server Data. NASA Astronomical Data Center. http://adc.gsfc.nasa.gov/adc (April 2002).

Tagliaferri, Roberto. Giuseppe Longo, Gerardo Iovane. Extraction of Catalogs from Astronomical Images. Astronomical Data Analysis. 2-3 August 2001 San Diego USA. Chairs Jean-Luc Starck, Fionn D. Murtagh. International Society for Optical Engineering, 2001.

Voisin, Bruno. José Donas. Data Mining for Multi-wavelength Cross-referencing. Astronomical Data Analysis. 2-3 August 2001 San Diego USA. Chairs Jean-Luc Starck, Fionn D. Murtagh. International Society for Optical Engineering, 2001.

 
 
top
line

home | about us | news | membership | publications

events | education | resources | support | astroshop | search

Privacy & Legal Statements | Site Index | Contact Us

Copyright ©2001-2008 Astronomical Society of the Pacific