I am sorry that I had to miss the DS4 meeting in Leicester last week, but since I and a few others in
AstroGrid have done a bit of work in this area before, here are a few half-baked thoughts and comments.
Cross-matching has been discussed both by DS4 and DS6 groups, so efforts obviously need to be coordinated. It's so important a topic that we must make sure that it doesn't fall between the cracks. My own view is that cross-matching probably involves one or more tools, so perhaps more appropriate to come under DS4, but it's obviously an important step in many data mining investigations.
Scalability: I agree with Mark Taylor that it is very important to get the scientific requirements established, as cross-matching on small, medium, or large scales requires rather different techniques. Most of what I was working on was trying to see how cross-matching could be done on large catalogues, up to a billion rows in size, since the VO efforts were set up to cope with the so-called
data explosion. But there are scientific requirements, I am sure, to do cross-matching on all sizes from dozens of rows upwards, which may mean that we need to develop more than one tool to cover the range.
For the larger cases, it seemed sensible to try to use a relational DBMS, and one with a spatial index to make it efficient at locating and handling positions on the celestial sphere. I found that R-trees and similar spatial indexing techniques which are available for most DBMS, including free ones like MySQL and Postgres, work quite well in practice. Since then Dave Abel et al of CSIRO have proposed the sort/sweep algorithm, see
http://www3.ict.csiro.au/vgn/images/portal/cit_16537/16/6/270267ICT_pdf_1091077340525.pdf for the report, and
http://www3.ict.csiro.au/vgn/images/portal/cit_16537/15/28/269690ICT_pdf_1091075813378.pdf for a report on benchmarking. This algorithm is undoubtedly faster when cross-matching two large catalogues, but when one catalogue is significantly smaller than the other, the R-tree method is faster. This is fairly easy to understand in terms of the I/O requirements and the relative speeds of sequential disc reads versus disc seeks. The break-even point comes, probably, when one table is a maybe a hundred times smaller than the other. I have written more on this in various documents linked from
http://wiki.astrogrid.org/bin/view/Astrogrid/DataDocs
For the smaller cases, where both catalogues will fit into memory, tools like TOPCAT already perform cross-matching with reasonable performance.
Uniqueness of matching: if the error region is large than one may not get a unique match. Having a background in X-ray astronomy where error regions were large until recently, I'm well aware of this problem. It seems to me that the cross-matching tool can ignore this problem, provided that it includes in the resulting table
all possible matches on positional grounds and puts in the output table
all useful columns from the two input catalogues; the resolution of ambiguous cases can then be
left to a later stage, as all the necessary information is there, for the user to deal with subsequently. I'm in favour of keeping tools simple and modular, and this restriction certainly helps, as it means that cross-matcher only needs to consider positional information.
Unmatched sources: again this often happens in the X-ray field as some X-ray sources appear to have no optical counterpart at all. This information is often scientifically important, so the tool must allow unmatched sources to be listed in the output, i.e. you want what database people call a LEFT OUTER JOIN as an option. This needs, obviously, a standard way of representing NULL values. FITS and VOTable support nulls fairly well; with text formats such as CSV there is really no standard.
Vizier: I see that CDS are working on implementing cross-matching within Vizier - this would be very welcome, and we ought to see to what extent the cross-matching work within VOTech can be aligned with this, or whether we can assist. What is needed, I assume, is the ability
- to cross-match any Vizier catalogue with any other one in the same system
- for the user to upload a catalogue in some suitable format and have that cross-matched with one (or more) standard catalogues resident in Vizier.
The latter is simple in theory but involves a host of problems in practice, because of the variety of data formats in use and ways of expressing positions and errors in catalogues. Mark Taylor points out that the Starlink STIL library (and his tablecopy utility based on it) can handle many data formats including FITS tables, VOTable, and plain text.
Distributed cross-match: again highly desirable on scientific grounds as the tables to be matched often reside on different servers. This involves extra problems
- DBMS rarely work well over the wide area network,
- many different DBMS are used by astronomers most pairs of which do not interwork well.
We may be able to cope if we an export a table (or section of a table) in some standard format (such as FITS or VOTable) and ingest it easily into the other DBMS server, and there do the cross-match on two tables held locally. Trying to do anything else seems to me to be rather ambitious at present.
--
ClivePage - 01 Mar 2005