r11 - 14 Sep 2005 - 11:39:23 - PatricioOrtizYou are here: TWiki >  VOTech Web  > CrossMatchReq

Requirements for a Cross Matching Tool


Context

The DS4 group suggests to initially focus on a cross matching tool resulting in an initial prototype service in iteration #2. This functionality is also relevant to data mining (DS6) and a possible ADQL extension. This page should help capturing scientific as well as technical requirements complementary to the DS4 forum.


Homework for Partner Organizations

We need to investigate the following aspects for each organization, but take advice from other parties as well. Read ref. [ 1 ], slides 13ff, which is the original proposal and starting point.
  1. possible use cases for the x-match service
  2. available data sets for testing and analysis
  3. desired distance functions (parameters, algorithms)
  4. identification of local expertice, users, needs
  5. other?

@ESO
  • ad 2) various catalogues from the EIS group etc.
  • ad 3) distance function: error ellipse (a+b) for non symmetric PSF, S/N ratio, statistical probability; return non-matching, best-matching, all-matching

Requirements for cross-match

This section does not address the problem of cross-match at the pixel (image) level. It is assumed that the cross-match operation is done on (extended or point) sources that have been detected/extracted from the images, and stored as a list (whether it be in a file -TSV, FITS, VOTable- or a DBMS or other).

The following describes the different technical schemes for the cross-match, and the possible methods to associate sources.

Different cases

The problem of cross-match can be split into 3 main cases:
  1. local data vs local data
  2. local data vs remote data
  3. remote data vs remote data

Local data can be a file located on the user's desktop, but does not necessarily restrict to this. It only needs to be an easily accessible dataset (e.g. a URL pointing to a downloadable VOTable some distant server).

Remote data, on the contrary, means data that are not readily available to the user (because they are stored in a DBMS on some remote server, or because the data volume is so large that it can't be transferred over the network).

Local vs local
Case 1 can be solved either by:
  • running the cross-match on the user's desktop
  • sending (pointers to) the files to a remote cross-match service

There are already existing tools for running the cross-match locally (Aladin, TopCat). The limitation here is mainly the available memory. A decent machine (year 2005) allows to cross-match lists of 105 sources in a few seconds.

There are two limitations with a remote cross-match service: bandwidth for transferring the files, and memory on the server. For VOTable transfer, there can be large overheads, and performances better than local computing are unlikely. The interfaces with a remote service also need to be clearly identified (as was demonstrated in the AVO 2005 demo).

Comment from Clive Page: the limitation on memory only applies when an in-memory algorithm is being used. This is the case for TOPCAT, I understand; is it for Aladin also? This need not be a problem: any DBMS with a spatial index such as MySQL or Postgres can be used to perform the cross-match from two tables which are on disc, with no memory limitation. Speed is at least 10,000 sources/sec on the smaller table - so tables of up to a million objects can be cross-matched in interactive jobs.

Comment from PatricioOrtiz: The number quoted above should be much better these days. I saw those cross match times 5 years ago. What hasn't changed much though is I/O rate, and that's what will differentiate efficient from inefficient methods. Assuming a good indexation scheme exists (as Clive pointed out), the handling in memory should be quite reasonable.

Local vs remote
In case 2, the user wants to cross-match one local file with a remote list (some resource in a DBMS in a data center, a very large catalogue in VizieR, ...).

The computation needs to be done on the server's side, and the result can be either:

  • returned to the user
  • stored in some dedicated storage (ftp, mySpace)

The cross-match method can range between two extremes:

  • the simplest is to perform multiple cone-search-like query for each target source (this was adopted for the VizieR query by list in the AVO 2005 demo).
  • the best is to ingest the user's file into the same DBMS as the remote list, and run a dedicated algorithm.

The size of the local file should be on average on the order of 103 sources, and up to 105 sources (usual figures for VizieR usage). The result should be retrieved in no longer than a few minutes (in order to avoid timeouts in the case of direct transfer).

Comment from Clive Page: do we know that 105 is really the upper limit for demand from astronomers, or could it be that larger datasets are simply to difficult to handle with current facilities like Vizier?

Comment from PatricioOrtiz: I think larger local catalogues can be expected, after all, such "targets" catalogue could be created as a subset of a very large one (> 108). HTTP based transfers will likely timeout (either in long uploads or long downloads), so it might be better to let the server fetch the data from a URL using FTP (or something better).

Remote vs remote
Case 3 involves only remote data. These data might be located on different servers.

Because the processing is done remotely, there are two points to identify:

  • how the query is expresed (using ADQL/VOQL or something else?)
  • how the result is retrieved (directly or ftp/mySpace storage)

This case is the most difficult. There might be different implementations of the cross-match for different server architectures. The adopted solution must be scalable to very large datasets (up to a billion rows), and allow an efficient parsing of constraints/filtering. This efficient filtering is critical when very large caralogues are involved. For example, one could cross-match sources from USNOB and 2MASS, with a cut on the optical magnitude, a selection on 2MASS flags, and a final filtering on and optical-IR color: the constraints must be applied at the proper level in the processing, in order to optimize the result.

The baseline for this case is to transfer the query, not the whole original data. When different servers are involved, solutions 'a la Skynode' might be used, but more sophisticated xMatch methods are desirable.

Comment by PatricioOrtiz: I believe that this kind of query requires a bit more than what we have. It requires decision making in a dynamic way. Say we have two Very Large Catalogues (VLCs) located in different places, each having a cross match service which accepts data import... to which one do we send the query? To the one with the largest number of records or to the one with the largest number of selected records?? Will there be a limit to the number of selected records? Instead of launching the query on both sides, one could ask for an estimate on both sides (which will gives us a rough idea of how many sources will be selected without actually doing the selection). Based on those number, the system (agent?) figures to which server send the query to produce a table (VOT binary?) and locate it in an FTP area, and to which server to send the full cross correlation request, including the URL of the table to use as either targets or matches.

Association methods

The problem is to return (possibly combined) information, resulting from the comparison of information in two lists of sources:
  • list-A, which are the target sources;
  • list-B, which are possibly associated sources.

Comment by Clive Page: I think it is worth listing here the need for what DBMS people call the LEFT OUTER JOIN, i.e. in some cases astronomers will only be interested in sources which match, in others they will also be interested in those which have no counterpart in the second table. There may also be a case where they are only interested in the latter, indeed the x syntax of the VOQL is designed to cater for this case exclusively.

Simple positional cross-match
The simplest case for associating sources is to compute their angular separation s (distance along a great circle on the sky). Each target source Ai from list-A can then be:
  1. Associated to all sources from list-B where the angular separation s(Ai,B) is compatible with some thresholds sMIN<s(Ai,B)<sMAX. This method returns all the matching pairs, usually repeating information for target Ai for each associated source from B (with information from list B).
  2. Associated to the best match from list-B, i.e. the source from list B where the angular separation s(Ai,B) is minimum, and compatible with some thresholds sMIN<s(Ai,B)<sMAX (this corresponds to the nearest neighbour method: the result is the merging of measurements in lists A and B for the best pairs)
  3. Selected if no source from list-B matches sMIN<s(Ai,B)<sMAX. This method is useful to search for undetected sources (dropouts), or to ensure that every source from list-A is still present in the output.

A general tool should allow for an output consisting of ((1 OR 2) AND/OR 3).

Comment from Clive Page: I am not convinced that case (2) has much demand: if there really are two or more sources in catalogue two which appear to be within the error region of my source in catalogue one then I think most astronomers would want to know about the ambiguity and have both (all) results returned in the output. As an X-ray astronomer I know that the optical object which was the nearest neighbour was rarely the right one to choose. It seems to me that only cases (1) and (3) are useful, i.e. the INNER and LEFT OUTER JOIN as noted above. But perhaps other astronomers disagree...

Comment from PatricioOrtiz: Agreed, additionally, cross matching routines can be used to detect density enhancements or to locate as many sources of a given type within a certain region. However, if someone wants a "best match: whatever that means" should we impede it? Perhaps we could think of how to tell the probability of this best match to be unique.

Another possibility is that the user is only interested in the number of matches (say that I'm looking for density enhancements)

Positional cross-match with errors
A slightly more evolved tool could take into account positional errors (or source extension) in the estimation of the separation. When information is available on the position accuracy (error circle/ellipse), it could be used to deternime associations based on a number of sigmas of these errors (the SkyNode sort of does this kind of estimation, but it can't be precisely controlled because this is in a black box).

The same algorithm could be used to deal with extended sources, or when the cross-match is done between catalogues with very different resolutions (e.g. old X-ray catalogues with large positional uncertainties). Searching for all associations within a given ellipse would allow hierarchical cross-match (when one extended source gets resolved by new observations or observations at different wavelengths).

Comments from PatricioOrtiz: And let's not forget different geometries (shapes): elliptical (circle), rectangle (square), polynomial. CCDs and photographic plates are rectangular ones; molecular clouds are probably betted described by a polynomial (vertices in RA,DEC or glon, glat)

Advanced positions handling
The cross-match tool could also deal with different coordinate systems (e.g. cross-match one catalogue with positions in equatorial coordinates with another catalogue with galactic coordinates), different equinox. Also, stellar proper motions could be used for comparing catalogues at different epochs.

Comments by PatricioOrtiz: inclusion of precession is the very next step: we're "receeding" from J2000.

Proper motions are fine, but how do we account for the errors in proper motion?

Generalized 'distance'
Very often, it is useful to use extra information (other than position) in order to improve the quality of the cross match. But because of the huge number of possible cases, they can't be all predefined. It is therefore highly desirable to let the user express a customized 'distance' function, for example something like d=sqrt((angular_separation/positional_error)^2 + (magnitude_difference/magnitude_error)^2).

This could be implemented by re-using the syntax used in Aladin filters, or new column computation.

Comment from Clive Page: I agree that this is something to aim for eventually, but it seems to me that distance is nearly always the main criterion, and often the only one. Thus it is possible to use a simple cross-match service which works on positional overlaps only in a more advanced way simply by sending all the results to the user, who can then run a further selection process on the results, using SQL-like syntax (as in VOQL) to compute a more advanced distance or likelihood function. This is, of course, particularly likely to be useful when the positional match alone is ambiguous. I'd be in favour of keeping the initial cross-match function simply by ignoring this case, as it can be dealt with by subsequent tools.

Comments from PatricioOrtiz: I kind of disagree with Clive here. As I said before, cross matching methods can be used to locate not just matches, but "similar" sources in a vicinity of the main source. IMHO, what we need is to extend the match criteria. So far, we only have angular separation, but if we could have any other criteria (like "distance in magnitude" or "distance in colour" or "distance in whatever", where distance means the absolute value of the target value minus the value of the potential match, eg |Q_t - Q_m|), then we could let users explore even in large datasets. The problem I see with Clive's approach is -again- in VLCs. If I search for stars with magnitudes within 0.1 from the target in a radius of 2 degrees in USNOB, the number is much smaller than just stars within a 2deg radius. The file to download and work will be so large that it will likely discourage users from attempting this.

Likelihood
Estimating the likelihood of associations done during the cross-match is scientifically very useful. It allows users to seek for more completeness (getting as many associations as possible) or more reliability (getting only very reliable associations).

The likelihood could be estimated in the case of simple positional cross match. However, this might significantly decrease the performances, as this implies for example to estimate the local source density in the catalogues...

In the case of complex distance functions, the user could also provide his own likelihood function.

Comments from PatricioOrtiz: unless one has an idea of the local density beforehand (it doesn't change). This is related to the indexation scheme used.

Metadata curation and automated processing

When the cross-match is performed on VOTable files, the metadata can be used to detect automatically the proper columns (e.g. where the positions are located). This is already done using the UCD in Aladin. A careful metadata curation is necessary during the cross-match:
  • the UCD in the output VOTable might need to be changed from their original value (if each original catalogue contains a 'phot.mag;em.opt.V;meta.main', there can be only one 'meta.main' in the resulting catalogue).
  • this would allow fine tuning of the output: the user could ask to cross-match list A with list B (in that order, using A as reference positions), but require that the reference positions in the output be that of list B.

Comments from PatricioOrtiz: Although I fully support the use of UCDs :-), there might be instances where it is best to let the user pick up the columns by hand. I've seen this in the time domain, where a time is a time, whether is t_start or t_end


Short Range Plan

  1. SebastienDerriere & MarkusDolensky will collect reqs. prior to TAP meeting in March
  2. design on paper in iteration 1
  3. prototype implementation in iteration 2


Related Material

  1. Original Proposal (slides 13ff): http://eurovotech.org/twiki/pub/VOTech/DS4PlanningStage01/DS4_CDS.pdf
  2. CDS Cross Matcher: http://www.euro-vo.org/twiki/bin/view/Avo/CDSXMatchPlugin
  3. DB Technology Reports by Clive Page: http://www.euro-vo.org/twiki/bin/view/Avo/WorkPackageThreeThree (from 2002), http://wiki.astrogrid.org/pub/Astrogrid/DataDocs/avodbms4.pdf (AVO report, 2004) and http://wiki.astrogrid.org/bin/view/Astrogrid/DataDocs which links to other documents on the AstroGrid wiki.
  4. Original Cross Match Specification for AVO Demo 2004: http://www.euro-vo.org/twiki/bin/view/Avo/GeomXMatch
  5. Benchmark Catalogue Cross Matching by Power & Devereux, 2004: http://www3.ict.csiro.au/vgn/images/portal/cit_16537/15/28/269690ICT_pdf_1091075813378.pdf
  6. Some comments from Clive Page CrossMatchReq2
  7. Some thoughts from BobMann from the AstroGrid Phase A study: http://wiki.astrogrid.org/bin/view/Astrogrid/WPA5AssociationMethods
  • ...


-- MarkusDolensky - 24 Feb 2005

-- Additional comments by ClivePage - 7 March 2005

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r11 < r10 < r9 < r8 < r7 | More topic actions
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback