- From: Jim Davis <jdavis@parc.xerox.com>
- Date: Thu, 17 Sep 1998 18:03:15 PDT
- To: internet-drafts@ietf.org
- Cc: www-webdav-dasl@w3.org
INTERNET-DRAFT Jim Davis
draft-davis-dasl-requirements-00.html Xerox Corporation
Sept 17, 1998 Saveen Reddy
Expires Mar 17, 1999 Microsoft Corporation
Judith Slein
Xerox Corporation
Requirements for DAV Searching and Locating
Status of this Memo
This document is an Internet draft. Internet drafts are working
documents of the Internet Engineering Task Force (IETF), its areas and
its working groups. Note that other groups may also distribute working
information as Internet drafts.
Internet Drafts are draft documents valid for a maximum of six months
and can be updated, replaced or obsoleted by other documents at any
time. It is inappropriate to use Internet drafts as reference material
or to cite them as other than as "work in progress".
To learn the current status of any Internet draft please check the
"lid-abstracts.txt" listing contained in the Internet drafts shadow
directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
munnari.oz.au (Pacific Rim), ds.internic.net (US East coast) or
ftp.isi.edu (US West coast). Further information about the IETF can be
found at URL: http://www.ietf.org/ .
Distribution of this document is unlimited. Please send comments to
the mailing list at www-webdav-dasl@w3.org, which may be joined by
sending a message with subject "subscribe" to
www-webdav-dasl-request@w3.org.
Discussions of the list are archived at
http://www.w3.org/pub/WWW/Archives/Public/www-webdav-dasl .
Abstract
The Distributed Authoring and Versioning protocol [WEBDAV] defines
simple mechanisms to assign and retrieve values for properties. This
document presents requirements for a WebDAV extension to support
efficient searching for resources based on WEBDAV properties and
content. These requirements are intended to be the basis for the DAV
Searching and Location (DASL) protocol.
Davis et al. [Page 1]
Internet Draft DASL Req
1. Introduction
Motivation for DASL
WEBDAV and HTTP provide support for client-side search, but not
server-side search. The GET method defined in [HTTP] allows clients to
retrieve a resource's content; the PROPFIND method defined in [WEBDAV]
allows clients to retrieve a resource's properties. Having retrieved a
resource's properties and / or content, the client can compare them to
its search criteria to determine whether the resource is of interest.
Although this client-side searching is logically sufficient, and
requires no modifications to the server, it comes at a significant
cost, because it makes inefficient use of network resources. A client
must retrieve properties and content for each resource under
consideration. Furthermore, it does not take advantage of server
intelligence. Servers capable of searching can use sophisticated
mechanisms to generate results: internal caching of intermediate
search results, content-indexing, etc.
Even simple, common queries may expose these limitations. Consider the
query "find all text files modified during the last week." When such a
query is extended to a large number of clients searching against a
single server, the limitations become more apparent. Client-side
searching has difficulties scaling in these cases.
DASL allows for server-side searching. Server-side searching allows
the client to formulate a query and have the server perform task of
selecting the resources that fit the criteria. This overcomes both of
the limitations of client-side searching described above. The benefit
is a searching solution that scales; the cost is that the server
software becomes more complex.
This document presents requirements for any possible protocol that
might be proposed for DASL. In this document, the word "MUST" refers
to features or behavior which, if absent from a proposed protocol,
would make it unusable for DASL. The word "SHOULD" refers to features
that are desirable, but not necessary. These requirements come from
considerations of the scenarios presented in [SCENARIOS], from the
need to support the WebDAV object model, the use of HTTP, and general
IETF rules. We have provided rationale for those requirements whose
justification is not obvious.
2. Terminology
scope
a set of resources to be searched.
criteria
an expression against which each resource in the search scope is
evaluated.
result set
a set of records, one for each resource for which the search
criteria evaluated to True.
Davis et al. [Page 2]
Internet Draft DASL Req
record
a description of a resource. A result record is a set of
properties, and possibly other descriptive information
result
A result is a result set, optionally augmented with other
information describing the search as a whole.
result record definition
a specification of the set of properties to be returned in the
result record
sort specification
a specification of an ordering on the result records in the
result set.
search modifier
an instruction that governs the execution of the query but is not
part of the search scope, result record definition, the search
criteria, or the sort specification. An example of a search
modifier is one that controls how much time the server can spend
on the query before giving a response.
query
A query is a combination of a search scope, search criteria,
result record definition, sort specification, and a search
modifier.
query grammar
a set of definitions of XML elements, attributes, and constraints
on their relations and values that defines a set of queries and
the intended semantics.
schema
a listing, for any given grammar and scope, of the properties and
operators that may be used in a query with that grammar and
scope.
Hit highlighting
is a specification of the location(s) within a resource
containing text that matched a content-query. It allows clients
to provide visual cues to a user to identify segments in a text
resource that cause them to match content-based queries.
paged results
allows a client to request that the server return a subset of the
result set rather than the entire set. In subsequent calls to the
server, additional results from the same query can be requested.
Paged results are intended to improve the performance and
manageability of search results.
In addition to the terms defined above, this document uses terminology
consistent with [HTTP] and [WEBDAV].
Requirements are divided into five categories, and numbered within
each category. The categories are Scope, Criteria, Record Definition,
Other and Discovery.
Davis et al. [Page 3]
Internet Draft DASL Req
3. Requirements: Scope
S1: It MUST be possible to specify at least one resource in the
scope. It SHOULD be possible to specify a set of distinct, unrelated
resources in the scope.
S2 It MUST be possible to specify a WebDAV collection as a scope.
S3: It SHOULD be possible to specify other types of resources in a
scope.
Rationale: A client might wish to determine whether a given
resource was of interest without transfering it.
S4: When the scope is a collection, it MUST be possible to specify
the depth.
Users often intend to scope their searches either to the
immediate children of a container or to extend the search
recursively to the container's children. Furthermore, depth
control is needed to prevent servers from performing unnecessary
work.
4. Requirements: Criteria
Criteria generalities
C1: It MUST be possible to search properties in a query. It MUST be
possible to search both DAV-defined and application-defined properties
in a query.
Further requirements for properties are below.
C2: It MUST be possible to search content in a query.
Note that at this writing, unlike property searches, there is no
single widely accepted semantics for content-based queries.
Further requirements for content criteria are below.
C3: It MUST be possible to search both properties and content in a
single query
C4: It MUST be possible to combine criteria with Boolean operators
Davis et al. [Page 4]
Internet Draft DASL Req
(i.e. and, or, not)
Criteria for properties
C5: It MUST be possible to include undefined properties in a query
without error.
Rationale: . This arises from the property model of DAV. Unlike
the more familiar relational model, DAV does not define tables or
schema for resources, hence there is no guarentee that all
properties will be defined for all resources. Moreover, DAV
allows an client to store arbitrary properties on arbitrary
resources. Therefore DASL must support queries that use
properties that are not defined on all resources in the scope. If
such a query failed, there would be no way to locate the desired
resources.
C5.1: It MUST be possible to test whether a property is defined.
C6.1: It MUST be possible to compare property values to constant
values.
C6.2.1: It SHOULD be possible to compare property values to other
properties of the same resource.
C6.2.2: It SHOULD be possible to compare property values to other
properties of other resources.
Note that this may involve a "join". We do not expect the first
version of the DASL protocol to meet this requirements.
C6.3: It SHOULD be possible to compare property values to results of
expressions.
C6.4: It MUST be possible to match property values with pattern
matching, at a minimum, string-ending wildcards. More powerful
patterns, such as defined by the SQL "like" operator or Unix regular
expressions, are desirable.
The minimum is necessary to enable DASL to locate resources by
content type, e.g. to locate all image files by comparison with
"image/*". More powerful comparisons are useful when strings
encode structured data such as times or lists.
C7.1: It MUST be possible to use equality operators.
Davis et al. [Page 5]
Internet Draft DASL Req
C7.2: It MUST be possible to use relative operators.
C8: It MUST be possible to specify case sensitivity
Note this does not say that all DASL servers must support both
case-sensitive and case-insensitive comparisons, but only that
the protocol must be able to express a client's preference, and
define behavior in the case where the server cannot support that
preference.
C9: It MUST be possible to specify language-specific definitions for
string comparison and sorting.
Different cultures define different rules for string comparison,
e.g. for collating sequence and for significance of diacritics.
Cross-language comparison is out of scope for DASL, but
comparisons within the same language must be done with the
appropriate semantics.
Requirements: Criteria for content searches
C10: It MUST be possible to search content of any text media type.
The definition of "searching content" for DASL means locating
sequences of character in the contents of the resource.
DASL defines no specific requirements for searching for structure
within text media types (e.g. for finding character strings
within HTML tags) DASL defines no requirements for searching
other media types that might contain text (e.g. subtypes of
application). Searching non-text media types (e.g.images, audio)
is out of scope for DASL.
C11.1: It SHOULD be possible to search for words that are within a
specified number of words of each other.
This is often called 'near' search.
C11.2: It SHOULD be possible to search for words that occur within
the same grammatical context, e.g. same phrase, sentence, or
paragraph.
This is sometimes called 'in' search.
C12.1: It SHOULD be possible for a client to control whether content
searches does or does not use a stemming comparison.
Davis et al. [Page 6]
Internet Draft DASL Req
C12.2: It SHOULD be possible for a client to request comparisons
using phonetic similarity (e.g. soundex)
C12.3: It SHOULD be possible for the client to request keyword
expansion (thesaurus expansion).
C13: It SHOULD be possible for a client to conduct a relevance
search. In such a search, the query consists of a set of words
(perhaps an entire resource), and the result is a list of resources
whose contents most closely resemble the query, sorted in decreasing
order of resemblance.
5. Requirements: Results
R1: It MUST be possible to specify a sorting for the result set
R2: It MUST be possible to specify a set of properties to be returned
in the result records, distinct from the properties in criteria
For example, a query might ask for "the authors of those
documents under 10K in size". In this case, the criterion relates
only to the size, but the desired result record contains only the
author.
R3: It MUST be possible for a client to request limits on the
resources consumed in creating of transmitting in the result set.
Some queries can potentially return very large result sets.
Clients that are good citizens will voluntarily limit the size of
such results. In addition, some servers may charge money for
queries.
R3.1: It MUST be possible for a client to limit the number of records
in the result set.
This is the most meaningful unit of resource consumption to the
client.
R4: It MUST be possible for the server to return fewer result records
than match the criteria.
"Client proposes, server disposes".
R5: It MUST be possible to a client to request paged results.
Davis et al. [Page 7]
Internet Draft DASL Req
Paged retrieval is necessary if result sets are very large and if
clients must also present a responsive interface to a user. Note
that this requirement is silent about whether a server implements
paged results by storing results from a query or recalculating
them as needed.
6. Requirements: Other
O1: It MUST be possible to support multiple query grammars.
Rationale: A particular query grammar may not expose all the
useful searching functionality of a server. Clients should be
allowed to query a server using any grammar that takes advantage
of those special server capabilities. This requirement also
allows DASL to define an initial limited query grammar which
meets all the mandatory requirements without needing to address
all the desirable, but non-mandatory requirements.
O2: It MUST be possible to extend the basic grammar defined by DASL.
03: It MUST be possible for the server to redirect a query.
This is useful when a server is not able to search a given scope,
but can refer the client to another server which is able to
search the scope.
O4: It SHOULD be possible for the client to request hit highlighting.
7. Requirements: Discovery
D1: It MUST be possible for a client to discover the set of query
grammars supported by a server.
D2: It MUST be possible for a client to discover the schema supported
by a server for a particular grammar with a particular scope.
Note that the schema may differ depending on the scope. Query
schema discovery allows a client to use optional operators
supported by a server.
D3: It SHOULD be possible for a client to determine information about
the properties within a scope.
This information can enable a user interface to help a user to
construct a valid query, for example by providing meaningful
Davis et al. [Page 8]
Internet Draft DASL Req
names for properties, constraints on values, hints about data
type, and so on, or information about expected performance, for
example whether a property is indexed (and hence more quickly
searched).
8. External Requirements
DASL MUST describe how to perform searches on internationalized
content and properties. This is in keeping with IETF policy.
Information intended for user comprehension MUST conform to the IETF
Character Set Policy [CHAR].
The WebDAV working group is currently addressing the standardization
of mechanisms for authors to submit variants and version of resources,
or for means of exposing access control. DASL should provide
mechanisms that can query for variants, versions, and access control
but of course can not do so until they are defined. Likewise, DASL may
contribute requirements to access control (e.g. control over
querying).
9. Related Work
Z39.50: "Information Retrieval (Z39.50): Application Service
Definition and Protocol Specification".
http://lcweb.loc.gov/z3950/agency/
Z39.50 Profile for Simple Distributed Search and Ranked Retrieval
http://lcweb.loc.gov/z3950/agency/profiles/zdsr.html
The STARTS Protocol
http://www-db.stanford.edu/~gravano/starts.html
The Harvest Information Discovery and Access System
http://mordor.transarc.com/afs/transarc.com/public/trg/Harvest/
10. References
[CHAR] H.T. Alvestrand, "IETF Policy on Character Sets and Languages",
June 1997, internet-draft, work-in-progress,
draft-alvestrand-charset-policy-02.txt.
[HTTP] R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, and T.
Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2068, U.C.
Irvine, DEC, MIT/LCS, January 1997.
Davis et al. [Page 9]
Internet Draft DASL Req
[SCENARIOS] Henderson, R. et al Scenarios for DAV Searching and Locating
not yet even an ID.
[WEBDAV] Y. Y. Goland, E. J. Whitehead, Jr., A. Faizi, S. R. Carter,
D. Jensen, "Extensions for Distributed Authoring and Versioning on the
World Wide Web", April, 1998. internet-draft, work-in-progress,
draft-ietf-webdav-protocol-08.txt.
11. Authors' Addresses
Jim Davis
Xerox Corporation
3333 Coyote Hill Road
Palo Alto, CA 94304
Email: jdavis@parc.xerox.com
Saveen Reddy
Microsoft Corporation
One Microsoft Way
Redmond WA, 9085-6933
email: saveenr@microsoft.com
Judith Slein
Xerox Corporation
800 Phillips Road 105-50C
Webster, NY 14580
Email: slein@wrc.xerox.com
Davis et al. [Page 10]
Received on Thursday, 17 September 1998 21:03:54 UTC