Must scopes be collections?

Most of the discussion about scope has assumed that a scope is the URL of a
collection, or at least that it is the URL of a WebDAV resource.  I want to
examine this assumption, and suggest broadening it.

Although in the typical case I expect scopes to be URLs of collections, I
claim it also makes sense that they be general URIs in some other cases.

Consider a Web crawler such as AltaVista or Lycos.  Such crawlers have
metadata from millions of Web resources, none of which reside on the
crawler.  (The metadata is on the crawler, but not the resources
themselves).   Now suppose you wanted to implement DASL on such a crawler,
and you wanted the ability to limit search to certain subset of the full
Web (which has the topology of a tree), e.g. to search only documents from
the US Department of Justice.

One answer might be: Don't use scope to do this.  Instead, the scope is the
whole crawler, and the query should use a pattern match on a property that
holds the URL of the resource, e.g.
 <where>
   <and>
    <like><prop><theurl></prop>
          <pattern>//*.doj.gov</pattern></like>
    <eq><prop><author/></prop>
        <literal>Sculley</literal></eq>
   </and>
  </where>

But I think it's also reasonable to want to use scope to do this.  Why?
One reason is that the crawler might have different access to different
scopes, depending on institutional relations between the crawler site and
the remote site, and hence have different or better meta data for some
scopes.  Another reason is that, at least to me, it just seems natural.

So what would such scopes look like?  The notion you want to capture is a
pattern on the domain names of hosts, e.g. *.doj.gov, so you might express
this as x-scope://*.doj.gov.  Even if you expressed it as http://doj.gov
(to make it *look like* a URL),  there's no reason to assume that there's a
web server at http://doj.gov running WebDAV whose root collection contains
all documents in *any* machine under doj.gov.

So I propose:

1) a scope is named by a URI.  A scope consists of a set of Web resources.
2) If the scope name is the URI of a WebDAV collection, then every resource
in that collection (depending on the value of depth) is in the scope.
3) If the scope name is the URI of a different kind of Web resource, the
scope is just that resource
4) otherwise, the set of resources is defined by the server.

we don't need to provide (in DASL 1.0) means to discover the scopes
supported by an arbiter.  Crawlers will find some other ways to express
this.  (Besides that, given the current business model of Crawlers, which
is based on selling eyeballs to advertisements, it's not clear any
commercial crawler will support DASL.)

See also the next message "broaden 'scope' to be any kind of WebDAV resource"

Received on Tuesday, 7 July 1998 17:50:02 UTC