CORDRA: Content Object Repository Discovery and Registration/Resolution Architecture

Search Interface Protocols and Specifications

Document Information

Abstract

There are several available interface specifications, protocols and APIs for repository search and information retrieval. This document compares key characteristics of these to inform selection or profiling of one or more of these specifications for use within a CORDRA&trade implementation or within Federated CORDRA.

Contents

Introduction

Different search interface and protocol specifications are based on different models and have different capabilities. Select, major alternatives are compared in order to identify common features and variations between approaches. These are also compared to core requirements for CORDRA. As applicable, one or more of these specifications might be selected and profiled to provide a search interface to the registry of a CORDRA implementation or to provide a search interface for Federated CORDRA.

This document compares five different specifications used in digital libraries and web search. The comparison is based on public documents included as part of the official document collections for each specification, as listed below. Additional application profiles or third-party descriptions outside of the formal document set are not included in the comparision.

Search Specifications

SQI

Simple Query Interface (SQI) is an API-oriented interface for query of (learning) content repositories [SQI]. It is a session-based protocol for passing queries and results between a requesting client and a data provider. It is designed to be independent of query language and results format, and can support both synchronous and asynchronous return of results. It includes an optional simple authentication specification [SQI Sessions]. The overall model is separated from both (1) the bindings to data representations for query and results (e.g., XML, plain text) and (2) the messaging protocols (e.g., SOAP, RPC, RMI). A profile of SQI with associations for data representations and messaging is required to implement an SQI interface between a client and data provider.

SRW/SRU

Search/Retrieval Webservice (SRW) / Search/Retrieval by URL (SRU) is an XML-based protocol for information retrieval [SRW]. Its development was motivated, in part, to provide a web-oriented protocol similar to Z39.50 [Zing, Z39.50]. It is a message-oriented protocol for passing queries and results between a requesting client and a data provider. It is designed to be used with a specific query language ([CQL]) but can support any results format. It defines synchronous messaging without authentication or session controls. The model is explicitly bound to XML data representations and web services or REST messaging models.

OpenSearch

OpenSearch is a simple, web-based search mechanism that returns results in RSS format [OpenSearch]. The focus is on using RSS (with extensions) and other existing specifications as a way to "publish" search results such that they can be further syndicated and accessed by commonly available tools. OpenSearch uses its own query format transferred via HTTP. Returned results are rendered in XML in extended RSS 1.0.

Google Web Service API

The Google Web Service API provides a SOAP interface to Google [Google]. It provides an XML format for queries, using Google's query language and a Google-specific XML format for results. It is a synchronous message-oriented protocol, with messages defined in WSDL and communicated with SOAP. While specific to Google, the same API model and service definitions could be used with any data provider.

OAI-PMH

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is an XML-based protocol for metadata information retrieval (harvesting) [OAI-PMH]. It is a synchronous messaging protocol, transmitted over HTTP (a REST style interface). It uses a fixed set of protocol requests. Results are returned in XML and conform to OAI-PMH XML Schema XSDs. Individual record results may be in any format. Certain aspects of the protocol are left to application profiles or agreements between clients and data providers established within a community of practice.

Search Interface Specification Comparison

The five alternatives are compared on several key characteristics important in the development of the CORDRA model (the order of presentation is not significant).

Throughout most of the remainder of the feature comparison, the analysis will focus on features within individual layers in the abstract model described below. While the features are described individually, there are natural overlaps and connections among the features such that choices for one may influence the design and implementation of others.

Some aspects of the specifications have been simplified for comparison purposes. As noted, the comparisons are based only on public documents describing the specifications. Additional characteristics from actual implementations, practice, application profiles, etc., were not considered in the comparisons.

Abstract Model

An overall abstract model for search and retrieval services can be represented as a set of abstraction layers (patterned after the SQI Learning Object Repository Interoperability Stack [SQI LORI]). Blank entries are not defined within the specification. Items marked "application profile" require additional specifications or a profile of the specification to define a complete interoperable system between the requesting client and the data provider. Such application profiles are not defined as parts of the specifications themselves.

Abstraction Layer SQI SRW OpenSearch Google API OAI-PMH
Semantic Model (Data Model, Representation) Application Profile XML, CQL RSS 1.0 / Data provider specific Google Query, XML XML, OAI
Services and Protocols Sync/Async Search & Retrieve Sync Search & Retrieve, Browse, Discovery Sync Search & Retrieve, Discovery Sync Search & Retrieve Sync Harvest Request & Retrieve, Discovery
Common Services (Sessions Management, Authentication, Authorization) Sessions, Authentication Google ID
Messaging (REST, SOAP, RPC, RMI) Application Profile REST, SOAP REST SOAP REST
Network Transport HTTP

The complete CORDRA model for a community of practice or CORDRA implementation requires specifications for all parts of the abstract model. Having separable models and composable layers of specifications is desirable.

Web Orientation and Web Architecture

Core web architecture and successful scalable implementations are based on several characteristics, including an orientation towards large-grained, document-focused transactions implementable over stateless, cacheable messaging, and URI-based identification.

While CORDRA does not have a requirement for explicit web orientation in the specifications selected, a web-oriented approach is preferred for scalability and performance. Services should be focused on document exchange. Messages should be cacheable. URI-based identification throughout is required.

Versioning

We expect that the specifications and repository interfaces will evolve, with new versions incorporating different features and behaviors. Requesting clients and data providers may implement different versions of a protocol.

CORDRA will require versioning of all services at the services or protocol level. Version information will need to be carried forth into the messaging layer.

Session Management

Session management provides for stateful services where all elements of a request or response do not need to be included within a single message.

There are no current CORDRA requirements for session-oriented search and retrieval services. While a community or implementation may choose to implement session-oriented models within their federation, this is not a requirement. Providing QoS information would be useful in operations.

Authentication and Authorization

Authentication and authorization (combined with separate identity management) may be important for some registries and federations to control access to query services and results.

Individual CORDRA implementations may require authenticated and authorized query services. Models for authentication and authorization must remain uncoupled from core definitions of query services and protocols. It also must be possible to combine authentication and authorization data within the messaging layer independent of the service definitions. CORDRA requires additional specifications and specification profiles for authentication and authorization.

Stateless Query Transactions

Independent of maintaining session state overall, individual query transactions might be stateful or stateless, i.e., if a part of the results set is returned, a request for the next part will use the same results set in a stateful transaction model while a stateless query service will need to regenerate the entire results set.

There are no CORDRA requirements for stateful query transactions. Given the dynamic nature of the registry, an expectation of consistency between subsequent queries is unreasonable. However, implementing and caching results sets may improve performance.

If results sets are implemented, their behavior needs to be fully specified (e.g., what is the TTL, what happens when an element from the results set is deleted).

Synchronous and Asynchronous Queries

Synchronous queries provide response messages directly to query requests (one-to-one). Asynchronous queries let the data provider return multiple asynchronous responses that are merged by the requesting client.

There are no CORDRA requirements for asynchronous queries. While a community or implementation may choose to implement asynchronous queries within their federation, this is not a requirement.

Compared with synchronous queries, supporting asynchronous queries appears to place an additional implementation burden on both requesting clients and data providers (and force an implementation to support sessions). A separate specification or application profile to support asynchronous queries should be investigated.

Query Languages

Query languages are combined with a query request protocol to define a search interface.

A CORDRA implementation must select one or more query language specifications for use within a federation. Defining the query language seperately from the service protocol provides flexibility. However, supporting multiple query languages may result in more complex specifications and will likely result in more complex implementations.

The requirements for a CORDRA query language are not yet specified. General query characteristics such as keyword and index matching, Boolean operators, etc., are most likely important, as is ordering of results. Results filtering might be useful, but like ordering, having the data provider perform the operation instead of deferring it to post-query processing by the requesting client is primarily an optimization in search and data transfer. Such optimizations minimize the work and complexity of the requesting client and may improve performance of the search process.

There is no common model for representing all of the aspects of a query (string, operators, filtering, ordering). Having a common model would aid in interoperability across CORDRA implementations that use different query languages and query service protocols.

Query Results

Query results present the set of information returned to the requesting client in response to query processing.

A CORDRA implementation may need to support multiple query result formats (both semantically and structurally different forms). The ability to define the query result format independent of the service protocol provides flexibility at the cost of a more complex set of specifications and a more complex implementation if multiple result formats and data model translations are supported. Defining a results set as both a known set of control information (e.g., results set size) and the collection of results may simplify processing by the requesting client.

The requirements for the results set formats supported by an implementation are unique to that implementation and are not yet specified.

There is no common model to represent the different characteristics of a results set (overall format, individual result format, formatting, size and control paramenters). Having a common model would aid in interoperability across CORDRA implementations that use different query results formats.

Protocol and Repository Information Discovery

Various defaults and protocol control parameters are defined for both the protocol and specifications overall and as implementation characteristics of individual data providers.

The design of CORDRA favors self-descriptive systems and the ability to automatically discover and configure clients and servers. While this imposes an additional implementation overhead, it improves interoperability as these parameters and the processing systems can be adjusted on demand without requiring changes to the underlying specifications to set new values or without negotiating the parameters via separate specifications or out-of-bound methods.

Extension Mechanisms

A query specification cannot be expected to be complete and meet the needs of all communities. Extension mechanisms provide a way to adapt and extend the specification to meet additional or unforeseen requirements without developing a new version of the specification.

CORDRA has no explicit requirements for an extension mechanism, but given the open nature of the problem and that the available specifications are initial versions, extensions and adaptations should be anticipated.

CORDRA Requirements Summary

The following is an overview of the CORDRA requirements and features for a query protocol and specification. It includes general features plus a summary of critical items outlined above. Since no candidate is complete, and selection will balance what features are available via the core specification and which have to be added through CORDRA profiles.

References

VersionIDDateChange Summary
1.00   H  20050530 Initial release
20050604 Editorial changes
20050622 Editorial changes