- From: Daniel W. Connolly <connolly@hal.com>
- Date: Fri, 27 Jan 1995 17:43:00 -0600
- To: Terry Allen <terry@ora.com>
- Cc: Multiple recipients of list <html-wg@oclc.org>
- Cc: www-talk@info.cern.ch, uri@bunyip.com
- Cc: Robert Boyer <boyer@cs.utexas.edu>
- Cc: "Benjamin J. Kuipers" <kuipers@cs.utexas.edu>
- Cc: jcma@reagan.ai.mit.edu
- Cc: davenport@ora.com
- Cc: cso@hal.com
[Look out folks. This could be a long one. We finally finished and released OLIAS 1.1, so I have a little time, plus I've been thinking a lot about Terry's proposal and how the Harvest Technology applies, and how we increase quality on the web in general in preparation for my "Formalizing Web Technology" presentation for next week's WebWorld conference. I copied all these lists becaue I think there may be interested folks on all these lists. I suggest follow-ups be sent only to uri@bunyip.com and davenport@ora.com.] In message <199501271917.LAA24883@rock>, Terry Allen writes: >Dan says >>For example, if there's a postscript file on an FTP server out there >called "report_127," you effectively can't link to it given today's >web. > >But doesn't that mean simply that not enough info is being sent >about the file by the server, or that the client isn't smart enough? >Putting a content-type att on <A> seems like a fragile solution >to the problem, as it shifts responsibility to the author of >the doc, who is in most cases just a poor dumb human. Yes, it's fragile, but it's better than completely broken. This is _distributed_ hypertext. It spans domains of authority. As an author, I have authority over the info I put in the link, but I may not have the authority to change the filename on the server. So I'm stuck. This situation will only get more complex: as a value-added proxy server, I can add annotations, show references to related documents, etc., but I can't change the original. I think this is directly relavent to your URN/davenport application[1]. >From the evidence that I have studied, the way to make links more reliable is not to deploy some new centralized namespace (ala URNs with publisher id's), but to put more redundant info in links. Rather than looking at the web as documents addressed by an identifier, I think we should look at it as a great big content-addressable-memory. "Give me the document written by Fred in 1992 whose title is 'authentication in distributed systems'." I think the same sort of thing that makes for a high-quality citation in written materials will make for a reliable link in a distributed hypermedia system. A robust _link_ should look like a BibTex entry (MARC record, etc.) Given a system like harvest[2], it makes sense to handle queries like "find me the document who's publisher is O'Reilly and Associates, published in 1994 under the title 'DNS and Bind'." Their model for distributed indexing, brokers, replication, and caching (with taxonomies and query routing in the works) has me convinced that it's the right way to go. One party actually develops the document (or program or database...). Another publishes it. Some folks referee it. Another party advertises/markets it. Another party provides shared disk space and bandwith for a fee. Another party is an expert librarian for some field. All of these parties are humans or groups of humans, but they are all aided more or less by the machines that participate in this distributed hypermedia system. All these folks share resources. Each of them has different policies and procedures, different experties, different goals. The way to make the whole thing work is (1) let the computer do the work, wherever we can, and (2) keep the simple thinks simple (3) make the complex things possible. So if I as the link author know more than the reader's client can get from the FTP server, I should be _able_ to contribute the knowledge that I have. Making all the authors put content type info in their links is the the wrong answer; the optimal solution is for the provider to adapt to the .ps convention. But the link author should be able to add value and quality despite the poor efforts of the FTP server maintainer. "But the link author could just copy that file and put a .ps extension on his own machine," you might reply. This doesn't allow for the case when the document in question changes daily, and it doesn't provide an audit trail, and it violates my #1 engineering principal: never maintain the same information in more than one place. The whole point is that as long as links just give one little point of information, they're going to be fragile. In effect, URLs give several pieces of information. They usually give a DNS domain name, so you can deploy conventions like having webmaster@host be the point of contact for a given server. From a typical "home page" address http://host/~user.html I can infer that user@host is the associated mailbox. It's not 100%, but it usually works. That brings me to another point: The sharing of information can only be automated to the point that it can be formalized. I've been trying to find some formalism for the way the web works. I've decided that this is a useful excercise for areas like security, where you have to be 100% sure of your conclusions relative to your premises. But for the web in general, 100% accuracy and authenticity is not necessary. The web is a model for human knowledge, and human knowledge is generally not clean and pricise -- it's not even 100% consistent. So I think that in stead of modelling the web with formal systems like Larch[3], a more "fuzzy" AI knowledge-representation sort of approach like Algernon[4] is the way to go. Traditional formal systems like Larch rely on consistency, which is not a good model for the knowledge base deployed on the web. The URN model of publisher ID/local-identifier may be sufficient for the applications of moving the traditional publishing model onto the web. But that is only one application of the technology that it takes to achieve high quality links. Another application may have some other idea of what the "critical meta-information" is. For example, for bulk file distribution (ala archie/ftp), the MD5 is critical. OK... so... now that I've a brian dump, how about a specific answer to the "Davenport proposal": Problem Statement ================= The Davenport Group is a group of experts in technical documentation, mostly representing Unix system vendors. They have developed DocBook, a shared SGML-based representation for technical documentation. They will probably be using a combination of CD-ROM distribution and the Internet to deliver their techincal documention. They are developing hypertext documentation; they each have solutions for CD-ROM distribution, but while the World-Wide Web is the most widely-deployed technology for internet distribution, it does not meet their needs for response time nor reliability of links over time. As publishers, they are willing to invest resources to increase the quality of service for the information they provide over the web. Moreover, the solution for increased reliability must be shared among the vendors and publishers, as the links will cross company boundaries. Ideally, the solution will be part of an Internet-wide strategy to increase the quality of service in information retrieval. Theory of Operation =================== The body of information offered by these vendors can be regarded as a sort of distributed relational database, the rows being individual documents (retrievable entities, to be precise), and the columns being attributes of those documents, such as content, publisher, author, title, date of publication, etc. The pattern of access on this database is much like many databases: some columns are searched, and then the relavent row is selected. This motivates keeping a certain portion of this data, sometimes referred to as "meta-data," or indexing information, highly available. The harvest system is a natural match. Each vendor or publisher would operate a gatherer, which culls the indexing information from the rows of the database that it maintains. A harvest broker would collect the indexing information into an aggregate index. This gatherer/broker collection interaction is very efficient, and the load on a publisher's server would be minimal. The broker can be replicated to provide sufficiently high availability. Typically, a harvest broker exports a forms-based HTTP searching interface. But locating documents in the davenport database is a non-interactive process in this system. Ultimately, smart browsers can be deployed to conduct the search of the nearest broker and select the appropriate document automatically. But the system should interoperate with existing web clients. Hence the typical HTTP/harvest proxy will have to be modified to not only search the index, but also select the appropriate document and retrieve it. To decrease latency, a harvest cache should be collocated with each such proxy. Ideally, links would be represented in the harvest query syntax, or a simple s-expression syntax. (Wow! In surfing around for references, I just found an example of how these links could be implemented. See the PRDM project[2].) But since the only information passed from contemporary browsers to proxy servers is a URL, the query syntax will have to be embedded in the URL syntax. I'll leave the details aside for now, but for example, the query: (Publisher-ISBN: 1232) AND (Title: "Mircosoft Windows User Guide") AND (Edition: Second) might be encoded as: harvest:/davenport?publisher-isbn=1232;title=Microsoft%20Windows%20Users%20Guide;edition=Second Each client browser is configured with the host and port of the nearest davenport broker/HTTP proxy. The reason for the "//davenport" in the above URL is that such a proxy could serve other application indices as well. Ultimately, browsers might implement the harvest: semantics natively, and the browser could use the Harvest Server Registry to resolve the "davenport" keyword to the address of a suitable broker. To resolve the above link, the browser client contacts the proxy and sends the full URL. The proxy contacts a nearby davenport broker, which processes the query and returns results. The broker then selects any match from those results. Through careful administration of the links and the index, all the matches should identify replicas of the same entity, possibly on different ftp/http/gopher servers. An alternative to manually replicating the data on these various servers would be to let the harvest cache collocated with the broker provide high availability of the document content. Security Considerations ======================= The main considerations are authenticity and access control for the distributed database. Securely-obtained links (from a CD-ROM, for example) could include the MD5 checksum of the target document. If the target document changes, a digital signature providing a secure override to the MD5 could be transmitted in the HTTP header. Assuming the publishers' public keys are made available to the cache/proxies in a secure fashion, this would allow the cache/proxy to detect a forgery. But the link from the cache/proxy to the client is insecure until clients are enhanced to implement more of this functionality natively. At that point, the problem of key distribution becomes more complex. This proposal does not address access control. As long as all information distributed over the web is public, this solution is complete. But over time, the publishers will expect to be able to control access to their information. If the publishers were willing to trust the cache/proxy servers to implement access control, I expect an access control mechanism could be added to this system. If the publishers are willing to allow the indexing information to remain public, I believe that performance would not suffer tremendously. The primary difficulty would be distributing a copy of the access control database among the proxies in a secure fashion. Conclusions =========== I believe this solution scales well in many ways. It allows the publishers to be responsible for the quality of the index and the links, while delegating the responsibility of high-availability to broker and cache/proxy servers. The publishers could reach agreements with network providers to distribute those brokers among the client population (much like the GNN is available through various sites.) It allows those cache/proxy servers to provide high-availability to other applications as well as the davenport community. (The Linux community and the Computer Science Technical reports community already operate harvest brokers.) The impact on clients is minimal -- a one-time configuration of the address of the nearest proxy. I believe that the benefits to the respective parties outweigh the cost of deployment, and that this solution is very feasible. [1] http://www.acl.lanl.gov/URI/archive/uri-95q1.messages/0080.html Sun, 22 Jan 1995 12:41:10 PST [2] PRDM http://www-pcd.stanford.edu/ANNOT_DOC/annotations.html [3] http://www.research.digital.com/SRC/larch/larch-home.html [4] http://www.cs.utexas.edu/~qr/algernon.html
Received on Friday, 27 January 1995 19:02:10 UTC