- From: Terry Allen <terry@ora.com>
- Date: Sun, 29 Jan 1995 11:37:45 PST
- To: uri@bunyip.com, davenport@ora.com
In this post I consider aspects of Dan's proposed solution (which is pretty good).`. | Date: Fri, 27 Jan 95 19:04:04 EST | Message-Id: <9501272343.AA22920@ulua.hal.com> | From: "Daniel W. Connolly" <connolly@hal.com> | To: Multiple recipients of list <html-wg@oclc.org> | Subject: Redundancy in links, Davenport Prososal [long] ... | Given a system like harvest[2], it makes sense to handle queries like | "find me the document who's publisher is O'Reilly and Associates, | published in 1994 under the title 'DNS and Bind'." Their model for | distributed indexing, brokers, replication, and caching (with | taxonomies and query routing in the works) has me convinced that it's | the right way to go. Harvest, as described (I haven't actually used it), looks like a good solution for gathering and collating metainformation. (That's a piece I hadn't wanted to worry about for an initial trial, but it's a necessary piece in a full working system.) ... | That brings me to another point: The sharing of information can only | be automated to the point that it can be formalized. I've been trying | to find some formalism for the way the web works. I've decided that | this is a useful excercise for areas like security, where you have to | be 100% sure of your conclusions relative to your premises. | | But for the web in general, 100% accuracy and authenticity is not | necessary. The web is a model for human knowledge, and human knowledge | is generally not clean and pricise -- it's not even 100% consistent. | So I think that in stead of modelling the web with formal systems like | Larch[3], a more "fuzzy" AI knowledge-representation sort of approach | like Algernon[4] is the way to go. Traditional formal systems like | Larch rely on consistency, which is not a good model for the knowledge | base deployed on the web. The fuzzy system I have in mind for the long run is what I do when I go to the library catalogue with a citation along the lines of "Title is `DNS and the Blind,' by some guy named Crikey, and it was published by, I think, Addison-Wesley." Every piece of information there is wrong, but I can still find the book (DNS and BIND, by Paul Albitz & Cricket Liu, published by ORA). Others want their writers to be able to walk over to a bookshelf, pull a book off the shelf, open it, and construct a link to "the illustration of network topology in Flatland, in the book `Moebius Network'" and have it resolve to an appropriate Net object, a list of possible appropriate Net objects, or a message stating that the book exists only in print, etc. | The URN model of publisher ID/local-identifier may be sufficient for | the applications of moving the traditional publishing model onto the | web. But that is only one application of the technology that it takes | to achieve high quality links. Another application may have some other | idea of what the "critical meta-information" is. For example, for bulk | file distribution (ala archie/ftp), the MD5 is critical. Thinking over my first proposal, in the early stages of football deprivation, I see that URNs can be DNSish up to a point, and a DNSish approach to resolving them will probably work. Whether we want to set up all the pieces yet I don't know, but URN:ISBN:56592:something could point to URN servers that know about ISBN servers that can locate servers with records for publisher 56592 (ORA), and it's up to those last servers to resolve [something]. A complexity I didn't address in my first proposal is how to specify what I want returned from a URN query (what I want is a list of URLs, but someone else might want the whole URC set, the URC set of the parent URN, if any, or an alternate URN, etc., etc.). | OK... so... now that I've a brian dump, how about a specific answer | to the "Davenport proposal": | Problem Statement | ================= | | The Davenport Group is a group of experts in technical documentation, | mostly representing Unix system vendors. They have developed DocBook, | a shared SGML-based representation for technical documentation. They | will probably be using a combination of CD-ROM distribution and the | Internet to deliver their techincal documention. Yes. | They are developing hypertext documentation; they each have solutions | for CD-ROM distribution, but while the World-Wide Web is the most | widely-deployed technology for internet distribution, it does not meet | their needs for response time nor reliability of links over time. As nor complexity of link resolution services. | publishers, they are willing to invest resources to increase the | quality of service for the information they provide over the web. Yes. | Moreover, the solution for increased reliability must be shared among | the vendors and publishers, as the links will cross company | boundaries. Ideally, the solution will be part of an Internet-wide | strategy to increase the quality of service in information retrieval. Yes, although smaller successes will be instructive, too. | Theory of Operation | =================== | | The body of information offered by these vendors can be regarded as a | sort of distributed relational database, the rows being individual | documents (retrievable entities, to be precise), and the columns being | attributes of those documents, such as content, publisher, author, | title, date of publication, etc. Yes, probably that model is sufficient for our docs, but I am well aware that it isn't sufficient for publishing in general. The retrievable entities have interrelations that may need to be modelled, and which can be both complex and nonhierarchical. One can quickly develop situations in which the info can't be presented in a table (at least so simple a table), nor in a tree. (I'd be happy to dig out my example about Sir John Chardin's Journey to Persia, in which the branches of the tree grow together.) Considering that publishing happens in time, this may only be a matter of multiple inheritance (and I don't know how that is best to be represented in an URC set or even in a TEI header). My working visualization is of a multidimensional infospace in which many values (author, title, etc) intersect in some published works, sometimes in quirky ways. | The pattern of access on this database is much like many databases: | some columns are searched, and then the relavent row is selected. This | motivates keeping a certain portion of this data, sometimes referred | to as "meta-data," or indexing information, highly available. I would call all that data metadata. The search may be more complex. | The harvest system is a natural match. Each vendor or publisher would | operate a gatherer, which culls the indexing information from the rows | of the database that it maintains. A harvest broker would collect the | indexing information into an aggregate index. This gatherer/broker | collection interaction is very efficient, and the load on a | publisher's server would be minimal. The broker can be replicated to | provide sufficiently high availability. | Typically, a harvest broker exports a forms-based HTTP searching | interface. But locating documents in the davenport database is a | non-interactive process in this system. Ultimately, smart browsers | can be deployed to conduct the search of the nearest broker and | select the appropriate document automatically. But the system should | interoperate with existing web clients. | Hence the typical HTTP/harvest proxy will have to be modified to not | only search the index, but also select the appropriate document and | retrieve it. To decrease latency, a harvest cache should be collocated | with each such proxy. Would it not suffice for the proxy (the URC resolver) to return not the document but a pointer to it (URN or set of URLs)? In earlier correspondence you seemed to be thinking of the metadata being at the same site as the document, but that need not be the case. | Ideally, links would be represented in the harvest query syntax, or a | simple s-expression syntax. (Wow! In surfing around for references, I | just found an example of how these links could be implemented. See the | PRDM project[2].) But since the only information passed from (parenthetically, PRDM is a reasonable representation of the info I'd want to input and validate as a teiheader) | contemporary browsers to proxy servers is a URL, the query syntax will | have to be embedded in the URL syntax. | I'll leave the details aside for now, but for example, the query: | | (Publisher-ISBN: 1232) AND (Title: "Mircosoft Windows User Guide") | AND (Edition: Second) | might be encoded as: | harvest:/davenport?publisher- isbn=1232;title=Microsoft%20Windows%20Users%20Guide;edition=Second | | Each client browser is configured with the host and port of the | nearest davenport broker/HTTP proxy. The reason for the "//davenport" | in the above URL is that such a proxy could serve other application | indices as well. Ultimately, browsers might implement the harvest: | semantics natively, and the browser could use the Harvest Server | Registry to resolve the "davenport" keyword to the address of a | suitable broker. | To resolve the above link, the browser client contacts the proxy and | sends the full URL. The proxy contacts a nearby davenport broker, | which processes the query and returns results. The broker then selects | any match from those results. I kinda want to separate the matter of the query syntax from the means of transport of the query. Leaving aside the issue of how the author knows what syntax to use for a given target (you are assuming Harvest, I'm punting), I have thought of another URLchitecture. Tell me what you think of this (please excuse errors in syntax): http://hostname/urc2url/urcteidavenport?querygoeshere Somehow I or my system has to determine an appropriate hostname, perhaps by pointing to a host that "ought to know": http://semi.omniscient.com/urc2url/... and semi.omniscient.com hands off the rest of the URL more or less in the manner you propose, but needs to know only where it can find a server that knows about urc2url services, which in turn need to locate databases for the urcteidavenport information. | Through careful administration of the links and the index, all the | matches should identify replicas of the same entity, possibly on | different ftp/http/gopher servers. An alternative to manually | replicating the data on these various servers would be to let the | harvest cache collocated with the broker provide high availability of | the document content. | Security Considerations | ======================= [sensible remarks deleted] | | Conclusions | =========== | | I believe this solution scales well in many ways. It allows the | publishers to be responsible for the quality of the index and the | links, while delegating the responsibility of high-availability to | broker and cache/proxy servers. The publishers could reach agreements | with network providers to distribute those brokers among the client | population (much like the GNN is available through various sites.) I am not sure that, for Net publishing in general, the distributed model of DNS will work for the bibliographic databases (as opposed to URN resolution). One could imagine distributing the URC resolution load DNSishly by establishing URC resolvers that are specialized along the vectors of the metadata: the Shakespeare URC resolution service, or the Ancient History service. (Becoming more highly specialized, we would have the Bronte service, which would eventually split into the Charlotte Bronte service, the Emily Bronte service, and the Ann Bronte service ... nahh, let's not.) But otherwise there's going to some efficiency of scale to be realized by collocating great gobs of metadata in relatively few places. When I want to look up information about a print book on the Net, I go to one of the big catalogues (I use Melvyl a lot, if I don't find what I want I try Olorin, etc.) rather than hope the item is in the Sonoma County Public Library catalogue. If I had to check hundreds of catalogues to resolve a URC query it would use significant resources. I think the librarians need to weigh in on this issue, as I'm sure they've thought about it more deeply. | It allows those cache/proxy servers to provide high-availability to | other applications as well as the davenport community. (The Linux | community and the Computer Science Technical reports community already | operate harvest brokers.) Can you supply pointers? sample URLs? do they use the harvest:// method? | The impact on clients is minimal -- a one-time configuration of the | address of the nearest proxy. I believe that the benefits to the | respective parties outweigh the cost of deployment, and that this | solution is very feasible. Yep, for Davenport it would probably work just fine, at least until we got into multiple editions in various languages that have complex interrelations. So the two issues I remain unresolved about are: will this approach really scale for complex bibliographic queries (do we need some equivalents to the big library catalogues), and how complex/simple should/can the bibliographic model (URC set, your table example) be, both for Davenport purposes and for Net publishing in general? -- Terry Allen (terry@ora.com) O'Reilly & Associates, Inc. Editor, Digital Media Group 101 Morris St. Sebastopol, Calif., 95472 monthly column at: http://www.ora.com/gnn/meta/imedia/webworks/allen/ A Davenport Group sponsor. For information on the Davenport Group see ftp://ftp.ora.com/pub/davenport/README.html or http://www.ora.com/davenport/README.html
Received on Sunday, 29 January 1995 14:37:59 UTC