- From: Daniel W. Connolly <connolly@hal.com>
- Date: Mon, 23 Jan 1995 11:54:55 -0600
- To: Terry Allen <terry@ora.com>
- Cc: davenport@ora.com, uri@bunyip.com, hackers@ora.com
In message <199501230109.RAA07091@rock>, Terry Allen writes: > >A feature I need not presently available in applications >is the ability to resolve abstract names to addresses automatically >by means of very constrained queries against a bibliographic database >*and* inititate appropriate browser activity without human intervention. >Where is that deployed now? Whoa... too much abstraction in there for me; too many undefined terms. What's the (testable) distinction between an "abstract name" and an "address"? These sound like implementation details. I'm asking about user-visible (and/or author-visible) features of this hypothetical system. I like the concrete example... >| Just write: >| >| <ulink uri="http://www.microsoft.com/windows3.1/userguide.html"> >| Windows 3.1 User's Guide</ulink> >| and run a chaching HTTP server. The client will consult the cache, and > >Then caching has to be reinterpreted as "knowing about all the SGML >entities for which files are installed on the local system," not just >"knowing about recently fetched and still valid entities." OK. Sounds good to me. To accomplish this, you just initialize the cache to think that all the "SGML entities installed on the local system" are actually "recently fetched and still valid entities." You might be able to do this with a clever combination of CGI scripts. Or you might have to hack the server code. Another option would be to take advantage of the fact that many browsers support per-uri-scheme proxies, and create a new URI scheme, say sgml: or davenport:. The link would be: <ulink uri="davenport://www.microsoft.com/win31/userguide">... and the bowser would be configured to use an HTTP proxy for davenport: URIs. This leaves you the option to re-define the behaviour of these links in the future, without going back and revising the documents. (This is perhaps one way to distinguish "abstract names" from "locations." The question is that of who owns the semantics of the names.) For example, the proxy server could implement replication (see below...) >| and links to the document in, say, postscript, PDF, and >| RTF format, along with ordering information for hardcopy (or alink to >| such ordering info). The user chooses which URL to follow manually -- >| I haven't seen any successful techniques for automating this step. > >Automation is exactly what I want to achieve. I think the required >technique is to query an appropriately constructed bibliographic >database. For the purpose at hand, I assume all the docs are in >SGML, not some choice of formats. I think I'm missing something. You want it so: * the user to selects a ULINK * the browser uses info from the ULINK to find another document * that document is displayed Where does the URC and the bibliographic database come in? As far as I can tell, the browser just takes a URL from the ULINK, decodes it into a server and path as usual, and sends the path to the server. The server looks up the path (ok... this could be some sort of database lookup: is this the piece you want to explore?) and returns an SGML entity. >[...] And I haven't >yet heard of a browser that will make a choice among a list of URLs >returned from a query. Where does this "list of URLs" come in? It it a list of various formats? If so, what rules would you use to automatcially choose from them? Or is it a list of replicas? Replication would be a handy feature. If that's what you're after, this could be an interesting project. How would replicas be created/maintained/administered? >I agree that the technology exists to achieve what I want; this >project is meant to explore what technology would work well and >what behavior is desired in the system overall. It may be that the >only piece that needs specifying except in prose is the URC format. I fail to see why a URC format is needed at all here. All I see is a need for servers to do identifier->entity mappings, for which URLs provide sufficient information. I don't think I understand your notion of "constrained queries against a bibliographic database." The database I see has two fields: identifier, and SGML entity. The query is "select entity where identifier='...'". The '...' can be the path from the URL; and for a quick-and-dirty implementation, the unix filesystem can represent this database. Another interesting application would be to put the URC in the _source_ document (kinda like a HyTime bibloc), and use that as the key for the databse lookup. You'd trade precise-but-fragile URL paths for more traditional fuzzy-but-robust citations. For example: <ulink linkend="MS94">Windows 3.1 User's Guide</ulink> ... <!-- BibTex/SGML adapatation used here. MARC/TEI/etc. influences should also be considered --> <book id=MS94> <editor>Fred Jones</> <title>Microsoft Windows 3.1 User's Guide</> <publisher">Microsoft Press <year>1993 <edition>2nd </book> So the client uses the whole <book> element as the query key. The first question is: what network entity does it consult to resolve the query? The "local" SGML cache server? Some "well known" server like http://www.davenport.org/ (a configuration option on the browser)? The publisher's server? How do you find the publisher's server? Perhaps the <publisher> element would be expanded to specify: <publisher server="http://sgml.microsoft.com">Microsoft Press or, in the case where a publisher leases space through an access provider: <publisher server="http://www.dec.com">O'Reilly & Associates</> Ah! Bright idea: the ideal way to implement this would be a big distributed citation index, right? Here are some implementation possibilities: I. Quick and dirty hack: 1. URCs are encoded as great big URLs, using cgi-bin/forms style syntax: <ulink uri="davenport:/citation-index?editor=Fred Jones;title=...">... 2. existing browsers are used. They are locally configured, e.g. setenv davenport_proxy="http://localhost:8010/" 3. a proxy runs on the local (or nearby) host. It maps /citation-index to a cgi-bin program which does a WAIS query against a list of sources. The list of sources is a local configuration issue; ther might be some "well known" or "bootstrap" WAIS sources, say wais.davenport.org or wais.oclc.org or whatever; as time goes by, a few dominant servers emerge. Each site could suppliment the datbase with a locally maintained WAIS database. 4. Each of these WAIS servers contains a large collection of URCs. The WAIS servers are configured to know, for example, that exact matches on titles are a lot more important than exact matches on years. A site like OCLC might even license the commercial WAIS implementation. 5. The WAIS query comes back with the WAIS doc-ids of the best items from all the servers, including scores. The proxy selects the "best" item (highest scoring? closest? some combination of them?) and fetches the source of that item, and passes it back to the browser. II. Harvest-based version 1. user selects link; link points via IDREF (or some other hytime locladder) to a URC (e.g. the <book> element above) 2. Browser translates URC to Harvest SOIF format, and conducts a Harvest broker query against a list of brokers. (This task could be conducted by a proxy as above). 3. Each publisher maintains a harvest gatherer for the information it publishes. Large sites like OCLC run caching brokers that do periodic bulk-updates from the publishers, and allow high-speed searching of the aggregate database. Client sites will typically search the OCLC broker, plus a handful of individual publishers that OCLC doesn't carry, plus their own local broker. 4. Client (browser/proxy) retrieves "best" item from query results and displays it. The choice of query protocols (WAIS, Harvest, or an HTTP/cgi/forms application like lycos/archiplex/aliweb) isn't critical: the critical thing is the ability to do distributed queries; i.e. the client (the browser/proxy) consults _multiple_ servers, which share an understanding of the precision/recall issues for the overall database; i.e. they share a scoring system. At a minimum, this implies that they share an abstract understanding of the schema of the distributed database; in practice, it will be convenient if this schema has a concrete represetnation, i.e. a URC syntax. I still have serious doubts about the ability to automate choosing the "best" item from the query results. Perhaps these databases will be very carefully maintiained and controlled, and precision of the queries will be sufficiently high that the client can just pick the top scoring item. Dan
Received on Monday, 23 January 1995 13:02:48 UTC