Introduction
The
sky has always offered a wealth of information to observers. Every
night the story of countless stars, galaxies, quasars, planets,
and other phenomena appear in the sky. The rich history of our
universe has been playing itself out for human eyes since the
beginning of history. However, never before in human history have
scientists had the ability to capture so much information about
the universe. As time progresses, our records of the sky grow
continuously larger. The amount of data in terms of pixels captured
doubles every year (Szalay website). The growth is now so fast
that data accumulation far exceeds the ability of humans to consider
it all. The introduction of the computer and magnetic storage
devices makes possible the storage of terabytes (2^40 bytes) of
information with petabytes (2^50 bytes) of information on the
way (Djorgovski, Mahabal, et al. 52). And while the computer is
facilitating the gap between information capture and study, it
also offers an opportunity to make better use of the data. The
computer's ability to rapidly perform algorithms gives it the
ability to examine the huge amounts of astronomical data available.
By using data analysis techniques such as "knowledge discovery,"
astronomers can use computers to find new phenomena, relationships,
and useful knowledge about the universe, and ultimately reduce
the gap between data capture and analysis.
The
successful use of computers to analyze data, however, requires
many steps. First, raw data is transformed into a catalog that
records positions, fluxes, and shapes of stellar objects. This
method of cataloging information has the advantages of reducing
data sizes, retaining useful information, and presenting the data
in a way that is accessible for further analysis by computers.
After the catalog is created it must be stored so it can be studied.
Currently catalogs are maintained as databases. Finally, knowledge
must be extracted from the catalog either by human observation,
or by using computerized tools to assist in the discovery of meaningful
knowledge. The ability to consider multiple surveys at different
wavelengths simultaneously offers another frontier to which computers
can be applied. Currently there are tools available for considering
data from multiple surveys, and future astronomers can look forward
to new seamless ways of examining surveys. This sort of network,
in which multiple data sets are presented as a single source,
is the idea behind the National Virtual Observatory, which aims
to unify all astronomical catalogs through one interface. By using
computers to explore digital surveys of the sky, astronomers can
effectively mine the huge amount of data they are accruing, and
ultimately make new and interesting discoveries.
Digital
Catalogs
Digital
catalogs of the sky, also called surveys, are analogous to non-digital
records of the sky. A digital catalog lists the positions of objects
in the sky, as well as several of the object's features. If a
star or galaxy is well known its name may also be included. The
Sloan Digital Sky Survey, for example, extracts "400 attributes
for each celestial object." (Szalay, Gray, et al. 1) The
primary information is the flux of an object, measured in green,
red, indigo, ultra-violet and near infrared (Szalay Gray, et al.
5). This information is then made available for exploration by
scientists. The main difference between a digital and a traditional
catalog of the sky is that traditional catalogs are compiled by
humans who observe the sky, whereas digital catalogs are made
by computers that observe images of the sky. Several of the major
catalogs include the Faint Images of the Radio Sky at Twenty centimeters
(FIRST), the Digital Palomar Sky Survey (DPOSS), the Two Micron
All Sky Survey (2MASS), and the Sloan Digital Sky Survey (SDSS).
While most catalogs record only images of the sky, there are spectroscopic
surveys as well (Djorgovski, Brunner 1). Most catalogs are produced
from sky surveys that take pictures of the sky. If complete photographic
records of the sky already exist, why would scientists want to
make digital catalogs of the information?
Most
digital catalogs represent the sky as a series of objects (stars
or galaxies) or entries in a database, even though the original
sources for the catalogs were images. This format offers several
advantages for scientists. First, digital catalogs of the sky
are much smaller than the raw image input. For example the raw
images from the FIRST survey consume 250 gigabytes (250 * 2 ^
30 bytes) of disk space while the corresponding catalog only takes
78 megabytes of storage (78 * 2 ^ 20 bytes), a sizeable difference
(Kamath et al. 14). Another advantage of the cataloging process
is that it presents only interesting information about the sky.
"[The images of the FIRST survey] are mainly noise, with
very few 'interesting' pixels corresponding to the radio sources…On
the other hand we have the 78 Megabyte catalog, where each entry
contains information on only a part of a radio source." (Kamath
et al. 14) The catalog in essence tries to reduce the information
to only those areas that are relevant sources of information.
This transformation has another useful result — the data
is now in a format where a computer can easily compare multiple
sources, allowing the use of methods like data mining to assist
research. In order to create these useful catalogs, scientists
must create algorithms to extract useful information from the
original images.
The
first step in cataloging the sky is acquiring the data to process.
Currently, data is being collected in two ways, by scanning old
photographic plates, and by using instruments that pipe their
data directly into computers. The Palomar Digital Sky Survey,
DPOSS, uses plates exposed at the Mount Palomar Schmidt Telescope.
Each plate covered 36 degrees of the night sky and there were
three exposures in blue, red, and near infrared. These images
were scanned and resolved into pixilated images with each pixel
representing a 1 arc second square (Djorgovski, Gal, et al. 3).
A similar tact was taken for the Sloan Digital Sky Survey. In
this project the images were obtained by using hardware that measured
fluxes at five different colors directly from the telescope. To
properly form the image, the SDSS telescope considered each section
of the sky on two separate nights, forming a composite image.
The SDSS also obtained spectrograms; however, there are only fifty
thousand spectrograms for fourteen million objects (Szalay, Gray,
et al. 1). The DPOSS and SDSS represent two Earth-based means
of recording surveys, but satellite sources are also used. The
Roentgen Satellite, for example, has produced the ROAST survey
with information on the X-ray sources in the sky (Page, Denby
351). Although all of these surveys use different data gathering
techniques, they all result in raw pixilated images, which are
usually processed further to create the catalog.
The
next step in constructing a catalog is to convert the raw pixel
images to catalog entries. To perform the transformation, computers
need to be able to distinguish stellar objects from each other,
from earth-based objects such as planes, and from the surrounding
sky. These programs have to perform several steps. The first step
is to decide which pixels are relevant stars or galaxies, and
which are noise. One way to detect objects is to "require
a certain minimum on adjacent pixels above some signal to noise
threshold for detection." (Djorgovski, Brunner 2) After locating
significant pixels, the programs must group the pixels into objects
while making sure that it separates or "deblends" overlapping
sources. (Djorgovski, Brunner 2) Finally the program should record
the relevant modeling information from each object and possibly
classify it as either a star or a galaxy (Djorgovski, Brunner
1). In practice it is known that to be useful this classification
needs accuracies of "better than 90%," but that "accuracies
higher than 95% are generally hard to achieve." (Djorgovski,
Brunner 3) Several different approaches exist for implementing
this procedure.
An
example of this process is The Neural Extractor (NEXT), which
uses an artificial neural network to classify objects in the sky
(Tagliaferri, Longo, Iovane 107). An artificial neural network
is a computer algorithm that uses an extremely simple model of
the brain to correctly classify information. The neural network
is composed of sets of nodes interconnected to each other by weighted
links. Every time information is passed between the nodes, the
signal is amplified or suppressed by the strength of the link
between the two nodes. By adjusting the strength of the links,
the network can learn to produce correct output (Russel, Norvig
567). NEXT uses several neural networks to detect objects. First,
pixels are labeled as either background or as significant pixels;
this is done to remove noise from the data. Each pixel is considered
along with adjacent pixels "since the attribution of a pixel
to either the 'background' or the 'object' classes depends on
the pixel value and the values of the adjacent pixels." (Taliaferri,
Longo, Iovane 108) The procedure continues by grouping adjacent
pixels into objects. After determining objects, NEXT deblends
overlapping objects into their own unique objects. Deblending
is done by separating each peak in light magnitude into its own
object. Finally NEXT extracts the information for each object
and attempts to classify it as either a galaxy or a star by using
another neural network. The steps taken by NEXT are similar to
the steps taken by other classifiers.
The
Sky Server Project at Fermilab, which catalogs the SDSS, uses
a similar set of steps to classify objects in the sky. The raw
input for the sky server is a set of color magnitude readings
in five wavelength bands from sensors on the telescope. Some spectroscopic
readings are done as well (Szalay, Gray 5). Because many stars
and galaxies overlap each other the individual objects must be
deblended. Roughly 80% of the objects are singular; the rest are
deblended (Szalay, Gray 4). The SDSS is a five year project to
be finished in 2006. After the catalog has been captured, it must
be stored in a database so the information can be accessed.
Storage
Storage
of catalogs has been done primarily on traditional database management
systems; however, to improve the retrieval speed of the databases,
astronomers have employed several new techniques. Most of the
catalogs are stored in relational databases, where the information
is stored as tables (since there are many existing software packages
to deal with this sort of data) (Djorgovski, Brunner 4). One change
that has simplified selecting regions of the sky is the introduction
of the HTM system of positioning. HTM, or Hierarchical Triangular
Mesh, is based on splitting the celestial sphere into 8 parts
by quartering the hemispheres. Each section can be thought of
as a triangle projected onto the celestial globe. These triangles
are then further subdivided by connecting the midpoints of the
triangle, producing four new regions, which in turn are subdivided.
This scheme simplifies representations of the sky, and the time
taken to query any one section (Brunner 6). The naming of the
sections reflects this hierarchy of divisions on the celestial
sphere. The base regions for the Northern hemisphere are N0, N1,
N2, N3. Successive selections are represented by adding digits
to specify the next triangle to select. To select the lower-left
triangle of the N0 region one would simply ask for N01. (Brunner
6). The HTM system of locating an object is done for ease of data
retrieval by the computer. In grouping the data by HTM coordinates,
objects near each other on the celestial sphere will be near each
other on the computer disk, speeding up searches for objects within
certain angles of each other (Szalay, Gray 7). Another common
approach is to store object's positions as Cartesian coordinates
on a unit sphere since computers work faster with Cartesian than
spherical coordinate systems. With these modifications, queries
of sky surveys become much simpler to implement (Bruner 6). Most
systems allow users to enter right ascension and declination to
find objects, so the HTM system is fairly transparent to astronomers.
After the data has been stored, it will need to be explored to
discover relevant information.
Once
the data has been cataloged, the most important phase begins,
namely using the data in the catalog to obtain new astronomical
knowledge. Of course, the data can be studied by astronomers unaided
by computers; however, the digital format of the data makes it
possible for computers to automate the search to assist in knowledge
discovery. This process of using computers to extract useful information
from a database is called "knowledge discovery," or
simply data mining. Data mining can be described as "an information
extraction activity whose goal is to discover hidden facts contained
in databases." (Borne 1) Common tasks in astronomical data
mining are using known properties to find a specific type of object,
finding new phenomena by clustering objects and considering the
outliers, using predicted properties of a theoretical model to
search for candidates, and searches for "one of a kind"
events (Borne 2). Many of these techniques have been used to direct
research in astronomy.
An
example of using the data mining technique is the search for bent-double
radio galaxies at the Lawrence Livermore National Laboratory.
Scientists used the FIRST catalog of radio emissions at twenty
centimeters to search for bent-double galaxies. To find these
galaxies, scientists first grouped catalog entries into 'radio
sources' that represented sets of galaxies where all members were
at most 0.96 arc minutes from their closest neighbors (Kamath
14). Scientists discarded all one-galaxy sets since they could
not be bent doubles, and then removed groups with four or more
galaxies for individual consideration, since groups with more
than three galaxies "are 'interesting' to the astronomers,
regardless of whether they are bent-doubles or not." (Kamath
15) The scientists then used a "decision tree" to classify
the galaxies as bent doubles or not.
Decision trees are classifiers that use a set of properties to
decide how to classify something. A decision tree is analogous
to a binomial taxonomic key; both attempt to ask meaningful questions
to quickly determine how to classify something. In a decision
tree, the selection of these 'questions' is done by "[splitting]
the data set on the basis of a parameter and a value." (Voisin,
Donas 37) An ideal question would put all bent-double galaxies
into one group and the remaining data into another group. This
property of well-split data is known as entropy. A decision tree
attempts to maximize entropy for each split it makes (Voisin Donas
37). In order to maximize the entropy of the questions asked,
the decision tree must already know the correct answers. Clearly,
this is unrealistic because, if we knew the classifications, we
wouldn't be classifying the set to begin with. To estimate the
entropy, questions are determined by considering a subset of the
total data where the correct classification is already known.
This is referred to as a training set. If there is any statistical
bias in the training set, the tree will be biased in its classifications
as well. With a good training set, a decision tree can classify
objects with fairly high accuracy.
To
find bent-doubles, the scientists built decision trees to classify
each radio source. The properties the scientists considered included
area of the galaxies, peak fluxes in the source, the angles between
major axes of the ellipses representing each galaxy, the eccentricity
of these ellipses, and other considerations (Kamath 15). Scientists
created a training set of 195 radio sources which were either
classified as bent-doubles or not by FIRST scientists (Kamath
17). Multiple trees were built from this data, with each tree
only considering some percent of the training set (Kamath 17).
To classify radio sources, all the trees would classify the source
as a bent double or not. All of the trees cast a vote and the
majority classification was then chosen as the correct classification
(Kamath 17). The scientists found that this method of using multiple
decision trees was more accurate than using a single tree (Kamath
17). The search drew heavily on the correct choice of properties
to consider. In fact, the scientists considered that "the
results are much more dependent on the features [they considered]
than on the particular classifier." (Kamath 16) This sort
of search for objects with known properties is only one type of
data mining.
Another
type of data mining used in astronomy is the search for statistically
similar groups in the catalogs. Clustering algorithms try to group
objects into sets that have similar properties. Scientists can
then look for previously unknown relationships in the data. The
correlations found by clustering "may be correlated, … some
of these correlations may be already known and some as yet unknown."
(Djorgovski, Mahabal, et al. 46) So clustering could reveal new
relationships among celestial objects. Clustering can answer such
questions as, "Are there any previously unknown classes of
objects… are there rare outliers, … are there interesting correlations
among properties of objects?" (Djorgovski, Mahabal, et al.
50) By performing clustering, scientists can consider why the
various relationships among different objects exist, or consider
outliers which could represent previously unknown, unique objects.
An example of this was the discovery of high-Z quasars in both
the SDSS and DPOSS catalogs (Djorgovski, Mahabal, et al. 55).
This method of clustering objects could be a tremendous resource
for scientists. In most catalogs; however, the information available
may be contained on only a few spectra, to be truly useful, multiple
surveys at different wavelengths have to be considered.
Since many scientists want to examine the same objects recorded
from different surveys, there is a growing attempt to correlate
objects in different catalogs. The problem of cross-correlating
requires a reliable way to associate objects in different databases.
One obvious criteria is spatial proximity; however, other statistical
properties could be involved as well. One catalog of this sort
is the NASA/IPAC Extragalactic Database (NED), which correlates
sources from the 2MASS survey and several other surveys of extragalactic
objects. NED scientists note that the problem of correlating objects
across databases is difficult, because "observations taken
with different telescopes and at various wavelengths often differ
in substantial ways." (Mazzarella 21) Problems include variable
resolution, differing positional coordinates, and changing sizes/appearances
due to different wavelengths (Mazzarella 20). To resolve these
problems, NED uses statistical associations to correlate objects.
These associations are updated as new data are discovered (Mazzarella
21). A similar service is offered by the Centre de Donnˇes de
Strausbourg, whose SIMBAD program offers information on sources
within our galaxy (Egret, Bonarel 166). Another correlation service
is Vizier, which gives catalog information in tabular forms. Vizier
currently accesses over 3,000 catalogs/databases (Genova, et al.
145). All of these services offer the user access to multiple
surveys; however, the task of integrating all the catalogs into
one source is far from over.
One
new project that aims to unify the many catalogs is the National
Virtual Observatory. "The NVO would be a 'Rosetta Stone;'
linking the archival data sets of space and ground-based observatories,
the catalogs of multi-wavelength surveys, and the computational
resources necessary to support comparison and cross-correlation
among these resources." (Szalay website) The NVO is envisioned
as both a way to integrate old catalogs and to integrate new ones
as well (Szalay website). The coordinators of the NVO feel that
it could be used to address many problems, such as matching infrared
data with visible spectra to search for gravitational lenses,
searching for rare objects, using X-ray imaging to search for
Active Galactic Nuclei, to search for extra-solar planets, and
to provide data for models of galactic interactions (Szalay website).
The search for AGN's is particularly interesting because it requires
several different wavelengths. The NVO has made this project a
keystone of their research to demonstrate the benefits of a virtual
observatory (Szalay website). In order to offer these types of
services, the NVO must overcome several hurdles.
One
of the challenges the NVO must handle is the unification of data
from widely different sources. Most surveys have their own types
of data and formats for transmitting them. To overcome this problem,
the NVO will use XML as its primary means of data transmission.
XML is a metadata language in which the data description is embedded
in the same document as the data itself. XML is associated with
technologies that make it easier to transform to other useful
formats (Szalay website). The NVO also plans to use an image format
called FITS since it is widely used in astronomy (Szalay website).
Another problem is that, with such large datasets, downloading
information will be very time consuming. To alleviate this problem,
computer programs could be sent to the data instead of vice versa.
The NVO is trying to make this possible by ensuring "agent
code portability." (Szalay website) Since the NVO will be
responsible for a large amount of work, it must have tremendous
computing power to handle these computing demands (Szalay website).
All of these issues must be addressed by the NVO in some manner.
Regardless of these hurdles, the NVO offers extraordinary potential
for integrating astronomical studies of our universe.
Conclusions
Modern
astronomy is gathering more information than at any other period
of its history. The ability to analyze and draw conclusions from
this massive data will be an ever-constant challenge for astronomers.
However, the use of computers to automate this task can greatly
improve the success of astronomers in comprehending this data.
Many of the crucial tools are already in place. The transformation
of raw telescope data into catalogs extracts useful information
from the large areas of noise in the original data set. This transformation
also creates a record that can be analyzed by other computer programs.
The steps in transforming raw data to catalogs are demonstrated
by the NEXT program and the SDSS. The accuracy of this transformation
is crucial since it is the basis of all other work done. The next
step involves using knowledge discovery techniques such as classifying
and grouping to search for new types of phenomena and to discover
statistical correlations among groups of objects. From these steps,
data mining techniques can provide astronomers with new information
by finding rare events, locating candidates for further research,
and by discovering unknown statistical relations. This type of
exploration will enable scientists to study a much larger set
of data than they could examine otherwise. Further gains can be
made by combining data from multiple wavelengths and independent
surveys. Several tools such as NED, SIMBAD, and Vizier make comparison
possible today, while future astronomers will have the NVO as
a resource for studying the sky. With all of these tools in place,
or being developed, the time is ripe for extensive use of computers
in the extraction of useful knowledge from these huge data sets.
In the future, a data mining resource such as the NVO will provide
a much deeper knowledge of our universe, and allow us to fully
utilize those fractions of the sky that we can capture. Data mining
will likely help to close the gap between the data we record and
the knowledge contained therein.
MATTHEW
WOODARD is a software developer living in Chicago.
He graduated from the University of Wisconsin in with a BA in
computer science in 2002. This paper was written as an honors
project for Astronomy 100. Matthew can be reached at themattwoodard@hotmail.com.
Works Cited
Borne, Kirk D. Data Mining in Astronomical
Databases. NASA Astronomical Data Center. http://adc.gsfc.nasa.gov/adc
(April 2002).
Brunner, Robert J. Panchromatic Mining for
Quasars: An NVO Keystone Science Application. Astronomical
Data Analysis. 2-3 August 2001 San Diego USA. Chairs Jean-Luc
Starck, Fionn D. Murtagh. International Society for Optical Engineering,
2001.
Djorgovski, S. G., A. A. Mahabal, et al. Searches
for Rare and New Types of Objects. NASA Astronomical Data
Center. http://adc.gsfc.nasa.gov/adc
(April 2002).
Djorgovski, S. G., Robert J Brunner. Digital
Sky Surveys: Software Tools and Technologies. NASA Astronomical
Data Center. http://adc.gsfc.nasa.gov/adc
(April 2002).
Egret, Daniel. et al. A Global Perspective
on Astronomical Data and Information: the Strasbourg Astronomical
Data Center (CDS). Information and Online Data in Astronomy.
Eds. Egret, Daniel. Albrecht, Miguel A. Kluwer Academic Publishers,
1995.
Genova, Fran¨ois, et al. Information Integration
and Retrieval: the CDS Hub. Astronomical Data Analysis. 2-3
August 2001 San Diego USA. Chairs Jean-Luc Starck, Fionn D. Murtagh.
International Society for Optical Engineering, 2001.
Kamath, Chandrika, et al. Using Data Mining
to Find Bent-Double Radio Galaxies in the FIRST Survey. Astronomical
Data Analysis. 2-3 August 2001 San Diego USA. Chairs Jean-Luc
Starck, Fionn D. Murtagh. International Society for Optical Engineering,
2001.
Mazzarella, Joseph M. et al. Capabilities of
the NASA/IPAC Extragalactic Database in the Era of a Global Virtual
Observatory. Astronomical Data Analysis. 2-3 August 2001 San
Diego USA. Chairs Jean-Luc Starck, Fionn D. Murtagh. International
Society for Optical Engineering, 2001.
Page C.G., M. Denby. The ROAST Extreme Ultra-Violet
Sky Survey. Astronomical Data Analysis Software and Systems
I. Eds. Diana M. Worrall, Chris Biemesderfer, Jeanette Barnes.
Astronomical Society of the Pacific, 1992.
Russel, Stuart, Peter Norvig. Artificial Intelligence
a Modern Approach. Prentice Hall, 1995.
Szalay, Alexander S. The National Virtual Observatory
Website. http://www.us-vo.org/docs/nvo-proj.html
(April 15, 2002). Astronomical Society of the Pacific, 2001.
Szalay, Alexander S., Jim Gray, et al. The
SDSS SkyServer — Public Access to the Sloan Digital Sky
Server Data. NASA Astronomical Data Center. http://adc.gsfc.nasa.gov/adc
(April 2002).
Tagliaferri, Roberto. Giuseppe Longo, Gerardo
Iovane. Extraction of Catalogs from Astronomical Images. Astronomical
Data Analysis. 2-3 August 2001 San Diego USA. Chairs Jean-Luc
Starck, Fionn D. Murtagh. International Society for Optical Engineering,
2001.
Voisin, Bruno. José Donas. Data Mining
for Multi-wavelength Cross-referencing. Astronomical Data
Analysis. 2-3 August 2001 San Diego USA. Chairs Jean-Luc Starck,
Fionn D. Murtagh. International Society for Optical Engineering,
2001.