r4 - 02 May 2006 - 11:33:32 - PaulHarrisonYou are here: TWiki >  VOTech Web  > VOSpaceReevaluation

VOSpace Architecture Evaluation

Abstract

This report is an evaluation of the overall architecture and design aims of the VOSpace effort withing the IVOA in the light of the experience implementing a VOStore interface for the NGAS system for a demonstration at the last ADASS conference.

I believe that a reconsideration of the thinking behind the current VOSpace design is necessary, in particular the split between VOSpace and VOStore as it currently stands is unhelpful to integrating with existing software implementations with similar aims to VOSpace.

This report presents some general recommendations rather than detailed design.

Introduction

There is a strong desire within the Virtual Observatory to have a distibuted storage mechanism, that will allow users to easily refer to a piece of data without necessarily needing to concern themselves with the physical location. The task of defining this system has been given to the IVOA Grid and Web Services WG This conceptual storeage space is called VOSpace. It is envisaged that the VOSpace will manage references to the physical location of data. A primary design goal of VOSpace should be to allow easy integration of existing systems such as SRB or NGAS which have similar goals, but differing levels of abstraction and implementation

Use Cases

There are several primary use cases

  1. A certain dataset is needed for anaysis - for efficiency reasons it would be best if the data could reside "close" in network terms to the compute resource on which the analysis will be performed. In the ideal case this will involve the data being located on the anaysis computer's hard disks. With potentially many analyses being performed on the same data at many locations, it is more efficient if the data are gradually replicated onto these locations
  2. A user would like easily to publish and share his own data.

Logical Architecture

The system has three logical domains.

  1. The user view of the data storage. This generally will manifest itself in some form of "file browser" GUI, but the important point is that there is a particular name and categorization that a user might want to call the "name" of a particular data item, and each user could potentially want to have a different name for a particlar logical file identifier.
  2. Logical Namespace. This is the domain in which a particular data item has its global identity. It is the location where any metadata about the data item should be stored.
  3. Physical Storage. This is where the copies of the physical data are actually stored. A particular logical file may be stored in many physical locations in order to provide back-up of the data or to speed up access.

  • VOSpace conceptual:
    VOSpace conceptual

The diagram above shows the relationships between these domains in the logical architecture.

Functionalities

A VOSpace system needs to provide (at least) the following functionalities

  1. Store and retrieve a file.
  2. Store descriptive metadata about a file to allow easy access.
  3. Provide logical namespace so that there can be a long lived "identifer" for a file.
  4. Support some form of access control based on owner identifiers.

Physical Architecture

Under the influence of the topology of the logical architecture, the early VOSpace standardization effort was split into two parts

  1. VOStore - this was intended to address the functionality of the "physical storage" domain of the logical architecture.
  2. VOSpace - the metadata storage and user interface compontent.

It was decided that the definition of VOStore should be attempted first, leaving the more complex aspects of the VOSpace for later. The latest draft definition document is v0.18.

This document presents a physical architecture similar to that illustrated below vospace client-server

In this case the VOStores where the physical files are stored are controlled by a VOSpace server in a client-server model. The VOSpace server stores the metadata about the files, and when retrieving a file first the VOSpace server needs to be contacted to determine where the file actually resides, and then the appropriate VOStore server needs to be contacted to retrieve the file.

Peer Network

An alternative topology for the architecture is presented below.

Vospace in a peer network

In this case each server node in the VOSpace where files are stored is a peer i.e. a file can be requested from any node (irrespective of where it is actually stored) because they all have access to a common metadata repository and thus can either redirect the request to an appropriate node, or manage the remote retrieval themselves. Note that this diagram represents two VOSpace server nodes in a single VOSpace namespace that share a metadata repository - this is not saying that all VOSpace server nodes will share a global metadata repository - see Federation support below.

I believe that this topology more closely matches that of SRB and NGAS, and thus would make it easier to implement a VOSpace based on either of these systems.

Discussion of VOSpace facilities

This section discusses to desirable VOSpace functionality - some of these functions are not necessarily that easy to implement, and some of the problems are discussed.

Collections support

Users need to be able to subdivide their data holdings into categories or collections so that they can be more easily managed. Support for such features need not exclusively lie within the VOSpace system.

Some examples of collections are;

  1. users can create personal favourites - effectively "bookmarks" that can point to particular data items in various VOSpaces - the implementation of this would best lie outside the VOSpace
  2. VOSpace can create static collections - i.e. put arbitrary file identifiers into a collection, much like placing files in a folder in a conventional file system.
  3. VOSpace can create virtual collections -i.e. a dynamic collection based on a file metadata selection criterion. It is certainly true that some of the file types (e.g. FITS) that would be stored within a VOSpace are rich in metadata that could be potentially mined.

In addition a user should be able to "search" the VOSpace based on file metadata.

User "home" directories should not be a collection in VOStore, but should be implemented at a higher level.

Authentication and Authorization Support

User authentication will be handled by the single sign on mechanisms and will ultimately result in the VOStore being presented with an X509 certificate that identifies a particular user. Authorization support is more complex in that VOSpace will need to manage metadata identifies the levels of access.

  • Owner priviledges
  • group access control lists (ACL)

An alternative is to have the authorization stored in a separate server and exchange SAML Messages, but this might have unacceptable performance problems.

Federation Support

There will inevitably be more than one VOSpace - there are 3 approaches that can be taken
  1. Attempt to federate between all of the VOSpaces to try to build a global namespace
  2. Accept that there will be multiple VOSpaces and build user level tools that make it easier for users to build personal collections (with references in multiple namespaces)
  3. Mandate that each VOSpace be able to store links similar to unix file system soft links that can point to holdings in other VOSpaces.

Local file reference

A Key use case for VOSpace is to provide distributed storage for files that act as data input and output within a workflow or processing pipeline (see for example MySpaceNotes). Often the data sets can be large and there is a danger that the processing will ultimately be I/O bound - to minimise this risk it would be highly advantageous if the VOSpace node could return a filesystem reference to a local file where possible - for optimization this would mean that techniques such as random access and memory mapping could be used, which are simply not possible when the access protocol for a file is stream based.

Key Recommendations

This report can be summarized by the following recommendations, which I believe will speed up future development of a VOSpace standard.

Drop VOStore

There are already enough protocols for physical file storage and retrieval ftp, gridftp etc. we do not need another.

I note that very similar recommendations have been made by Regan Moore VOStore and VOSpace Design - though he called for a simplification of VOStore rather than the complete removal.

This should make it easier to implement the interface as a thin layer on top of existing implementations such as SRB and NGAS, which functionally and architecturally are much closer to the original VOSpace concept. So a decision to drop VOSpace could be viewed as a purely pragmatic one.

The existing implementation of a VOStore interface on NGAS actually hides core NGAS abilities (e.g. file replication in different physical locations). This means that NGAS cannot be used effectively - the VOStore interface implementation should thus be deprecated.

Saying "drop VOStore" is perhaps not an entirely accurate representation of what should happen next in the development process. In fact what I believe is that the current VOStore interface should be transformed to become a VOSpace standard interface (with the associated change of semantics). In fact a VOSpace that consisted of a single node would effectively be equivalent to what is currently called a VOStore, but there would be the advantage that clients would only have to deal with a single interface.

Adopt VOSpace in a "peer network"

Any node that is part of the same VOSpace should be able to respond to a file request in that space. It should be possible to set up each node in a "auto-replication mode" - This means that the physical file connected with a particular request is not actually local to the VOSpace node when requested, the node should make a local physical replica of the file at that node. This means that the next time file is requested at that node, no physical transfer would be necessary, and in a processing workflow the VOSpace node would be acting as a cache.

Do not attempt federation

It is simply too difficult to standardize between heterogenous implementations - each particular implemenation or a VOSpace might have methods of federating between instances of itelf, and thus each instance could possibly be registered as belonging to the same VOSpace. However, federation between different implementations would involve detailed specification of protocols for metadata synchonization and file locking. It might be possible in future to have a limited form of cross-linking between VOSpaces by stipulating that a concept similar to a unix style soft link should be supported by a compliant VOStore.

-- PaulHarrison - 23 Feb 2006

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r4 < r3 < r2 < r1 | More topic actions
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback