r29 - 01 Oct 2008 - 13:36:39 - BriceGassmannYou are here: TWiki >  VOTech Web  >  ResourceDiscovery > SedConstruction2 > DataExtractionTool

Introduction

The Data Extraction Tool is the next step after you have found relevant resources thanks to the Registry Query Tool. It helps one extracting tabular data from those relevant resources and to transform it in a uniform schema : same units, same columns names... In addition, one can filter the sources one want to keep in the output, one can generate new columns by combining input columns and one can define rules to generate unique astronomical source identifiers. One can also choose the coordinate system one wants (B1950 or J2000), provided equatorial coordinates are available in the input resource.

First step

The first step is to select the resources that must be processed and to define a uniform schema for the output (column names, units...).

Snapshot

DataExtractionToolScreenshot.png

Resources selection

The first step is to choose which resources must be processed. There are two ways to do so:
  • Load a workspace (it is common to the two tools, the resources that have been marked as relevant will be automaticaly loaded in the processing list)
  • Load from a file containing the VO resources identifiers, one per line

Features of the resource list

The feature for manipulating the list of resources are the same as in the Registry Query Tool. You can see the concerned section? for more details.

Uniform output schema definition

An ordered list of columns must be defined for the output. This is called the "output schema". A column is defined by four parameters :

  • The name: this parameter is mandatory and describes the name of the output column
  • The unit: this parameter describes the unit for the data of the output column. This means that the tool will always try to convert the input values for this column into this unit.
  • The UCD: this parameter describes the UCD of the output column. The tool will highlight all columns of the input table having this UCD to help the user choosing the right one.
  • The format: this parameter describes the decimal format of the output column. The tool will always try to convert the output value for this column to this format.

The "Output schema" section of the tool allows the user to create and update it:

  • Create: the Name and optionally the UCD, Unit and Format fields of the output column must be filled. The [...] button must be clicked to add this output column to the schema.
  • Update: if the user select one of the output column, its information will be displayed in the fields. Updating them and clicking on the [...] button will update the selected column.

Construction of the uniformisation form

By clicking on the "Generate the uniformisation form" button at the bottom of the window, the tool will construct the uniformisation form for the selected resources in the list and according to the output schema defined.

Second step

During the second step one can interact with a form to customize the uniformisation of the resources. Some items will be automatically filled in, but the user can provide additional constraints and customizations : drop constraints, columns generation... The following features are available:

Snapshot

snapshot2.jpg

Columns selection

For each resource and for each column of the output schema you can select which original column from the input resource you want to map to the target column. This selection is presented with a scrolling list and to help you doing this mapping the original columns that match the UCD are colored in red. If no column of the original resource can be mapped you can decide to generate a column thanks to the arithmetic form (button Edit16.gif).

A button About16.gif provides access to information about the selected input column (name, UCD, unit and description).

Arithmetic expression

DataExtractionToolSnapshotWiki3.jpg

The user can combine input column values to generate output values. This can be done with the arithmetic form above. Here is the general syntax:

condition_1  {aritExpression_1}
condition_2  {aritExpression_2}
condition_3  {aritExpression_3}
...
condition_n  {aritExpression_n}
{default_aritExpression}

Algorithm:

  • If condition_i is verified then the output value will be calculated from the aritExpression_i and no other condition will be tested.
  • If no condition is verified then the output value will be calculated from the default_aritExpression.

Specification of the condition:

  • can contain input column names (Flux must be written ${Flux} for example)
  • can contain the classical logical operators:
||, &&, =, !=, <, >, <=, >=

Specification of the arithmetic expression:

  • can contain input column names (Flux must be written ${Flux} for example)
  • can contain the classical arithmetic operators: +, -, *, /, ^, (, )
  • some mathematical functions are supported: cos, sin, tan, acos, asin, atan, ln, log, abs, deg2rad, rad2deg, sqrt, exp

Note about the input columns: the values are taken "as is" with their original unit, there is no unit management for the moment.

Here is a complete example:

${Flux}>10 || ${Flux}<3 {${FluxDen}*300}
${flux}<5 &&  ${flux}<6 {${FluxDen}*500}
{1}

Decimal format

The format attached to the output schema column defines the decimal format for these values. The general syntax is:

%nb_digits1.nb_digits2

  • nb_digits1: it is the total number of digits, including the decimal separator
  • nb_digits2: it is the number of digits after the decimal separator

If the decimal number is too small to fit the format some spaces will be added at the beggining.

Astronomical identifiers management

The output columns corresponding to astronomical identifiers can be generated with a special pattern. It can be useful if they have not been defined in the input table or if there are many duplicates. The general syntax is:

acronym [B|J]RA_pattern[+|-]DEC_pattern

  • acronym is the acronym for the catalog
  • B|J is the equinox for output RA and DEC coordinates
  • RA_pattern and DEC_pattern are the patterns for the RA and DEC coordinates

Here is an example of such a pattern:

B3 JHHMMSS.SS+DDMMSS

The RA and DEC output values are built from the input coordinates values. They are first converted to the correct equinox and then truncated to fit to the pattern.

The previous pattern can for example generate such a value:

B3 J223614.05+684502

If the identifiers are already present in the input resource without acronym one can use this syntax:

acronym *

This means that you concatenate "acronym" and the selected column (don't forget to select an identifer column !) to generate the output identifiers values.

Unit conversion

A unit can be attached to each column of the output schema. The tool will always try to convert input values into the right output unit. But if the conversion is not possible an alert will be displayed at the end of the processing so that the user can react (changing the unit...). Note that in this case the original values are taken as is without unit conversion.

Coordinate equinox selection

One can choose the output coordinates to be expressed with J2000 or B1950 equinox. The tool will always try to convert the input coordinates into the right equinox. This option is found in the preferences window and so is a general option for all the resources in the uniformisation form.

Sources filter

DataExtractionToolSnapshotWiki4.jpg

For each resource the user can define a logical condition for filtering the output rows. Each row that verifies the condition won't be written to the output. For example:

${flux}>500 || ${flux}<200

means: "if the flux column value is greater than 500 or lower than 200 for one row in the input table, this row won't be written in the output table". The columns used in the expression are the input columns and so are expressed in their original units.

Resource selection

The user can select the resources to process by selecting or de-selecting them thanks to the checkbox in the left of the form.

Report window

At the end of the processing, a report is shown. It contains three kinds of information:

  • miscellaneous errors
  • unit conversion errors
  • duplicate identifiers

snapshot3.jpg

Miscellaneous errors

Some miscellaneous error, like a miss of column definition, coordinate columns that were not found..., are reported in this section. For each error the following information is available:
  • resource identifier
  • a small text describing the error.

Notice that if such an error occures for a resource, it has certainly not been processed.

Conversions errors

If a unit conversion could not be done it will be written in this part of the report. For each failure the report contains:

  • the catalogue
  • the column
  • the original unit
  • the target unit

Duplicate identifiers

For each resource that has been processed one can see information about the duplicate identifiers and can interact. Following information and interactions are available:

  • The number of duplicate identifiers found
  • show details button: to see the list of duplicate identifiers
  • resolve button: to resolve the duplicate identifiers (it is just done internally in the memory, not applied in the output resource)
  • write button: to write back the resolved identifiers to the output resource

. To resolve the duplicate identifiers, the following algorithm is performed:

B3 J223103+120532 -> B3 J223103+120532A
B3 J223103+120532 -> B3 J223103+120532B
B3 J223103+120532 -> B3 J223103+120532C

User preferences

Some preferences for the tool can be set in the prefence window. To open it, just click on the "Preferences" item of the option menu. The preferences are separated in 2 parts:

  • Registry: some options about the registry where the tool searches for VO resources metadata can be set here
  • Output data: some options about how the output data is generated can be set here

Registry

optionRegistryScreenshot.png

There are two ways to specify which Registry will be used by the tool:

  • Enter a Registry endpoint URI: if the user knows the endpoint URI of the Registry he can enter it directly in this text field.
  • Choose from the available Registries: in this section there is a list of the available Registries all over the world, that can be refreshed by pushing on the button on the right of the list. The user has just to choose one of these Registries.

Output data

DataPreferencesScreenshot.png

  • Empty values: the tool can automatically replace empty input values by a string or value that can be set here
  • Coordinates system: the equinox of the output coordinates can be set here (choice between 1950 and 2000)
  • Formats: the output formats can be set here. The tool will create one resource per selected format (ASCII and VOTable are supported)
  • ASCII header: it is possible to define a header for each output ASCII table by setting specific parameters to be written to it. Only the number of sources in the output resource is available for the moment.

Technical requirements

  • Java: JRE 1.5+
  • A running Registry thats support the XQuery and getResource requests via SOAP (this is an IVOA standard)

Download

Current release

The current release of the Data Extraction tool is 1.3.2, released on September 24th, 2008.

Package Release Date Release notes Download
dataExtractionTool 1.3.2 September 24, 2008 release notes DataExtractionTool-v1.3.2.tar.gz

Older releases

Package Release Date Release notes Download
dataExtractionTool 1.3 August 27, 2008 release notes DataExtractionTool-v1.3.tar.gz
dataExtractionTool 1.2 July 17, 2008 release notes DataExtractionTool-v1.2.tar.gz
dataExtractionTool 1.1 May 13, 2008 release notes DataExtractionTool-v1.1.tar.gz

Webstart

The Data Extraction tool can also be launched with java Webstart: Data Extraction Tool webstart

Ready to use workspace

Here is a link to download a ready to use workspace: workspace.xml

Release notes and changelog

1.3.2

  • Changes:
    • Externalized the application's files (user preferences and cache of available registries) on a .dataExtractionTool directory in the user's home directory.
    • Add a webstart version of the tool

1.3

  • Changes:
    • New management of the "load resources" Plastic messages:
      • Translation from workspace resource to IVOA resource to send Plastic message
      • IVOA resource got from Plastic messages can be managed even if no table is specified (the first one of the catalog is taken by default)

1.2

Compatibility with Registry 1.0
  • Changes:
    • UCD1+ support
    • Registry 1.0 compatibility
    • Can now interact with a VOTable containing several tables and select the correct one according to the workspace resource

1.1

First release of the Data Extraction tool
  • Changes:

-- BriceGassmann - 01 Oct 2008

toggleopenShow attachmentstogglecloseHide attachments
Topic attachments
I Attachment Action Size Date Who Comment
jpgjpg snapshot2.jpg manage 85.9 K 05 Mar 2007 - 15:51 BriceGassmann  
jpgjpg DataExtractionToolSnapshotWiki3.jpg manage 31.4 K 05 Mar 2007 - 15:54 BriceGassmann  
jpgjpg DataExtractionToolSnapshotWiki4.jpg manage 28.7 K 05 Mar 2007 - 15:56 BriceGassmann  
jpgjpg snapshot3.jpg manage 27.8 K 05 Mar 2007 - 15:59 BriceGassmann  
jpgjpg toolSS5.jpg manage 32.1 K 05 Mar 2007 - 16:00 BriceGassmann  
jpgjpg toolSS6.jpg manage 31.4 K 05 Mar 2007 - 16:00 BriceGassmann  
jpgjpg toolSS7.jpg manage 26.5 K 05 Mar 2007 - 16:00 BriceGassmann  
jpgjpg snapshot14.jpg manage 97.0 K 21 Sep 2007 - 15:26 BriceGassmann  
elsegz DataExtractionTool-v1.1.tar.gz manage 4212.3 K 13 May 2008 - 11:55 BriceGassmann  
elsegz DataExtractionTool-v1.2.tar.gz manage 9400.3 K 17 Jul 2008 - 14:11 BriceGassmann  
elsegz DataExtractionTool-v1.3.tar.gz manage 9401.0 K 27 Aug 2008 - 13:29 BriceGassmann  
elsegz DataExtractionTool-v1.3.2.tar.gz manage 10489.2 K 24 Sep 2008 - 12:19 BriceGassmann  
xmlxml workspace.xml manage 20.0 K 25 Sep 2008 - 12:09 BriceGassmann  
jpgjpg snapshot10.jpg manage 24.2 K 01 Oct 2008 - 09:21 BriceGassmann  
pngpng optionRegistryScreenshot.png manage 24.2 K 01 Oct 2008 - 09:21 BriceGassmann  
pngpng DataExtractionToolScreenshot.png manage 58.8 K 01 Oct 2008 - 09:32 BriceGassmann  
pngpng DataPreferencesScreenshot.png manage 20.3 K 01 Oct 2008 - 13:35 BriceGassmann  
Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r29 < r28 < r27 < r26 < r25 | More topic actions
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback