Date: Mon, 2 Dec 91 10:08:04 -2300 From: jfg@bernd.cern.ch (Jean-Francois Groff) Message-Id: <9112030908.AA14185@bernd.cern.ch> To: www-talk@nxoc01.cern.ch Subject: forwarded message from connolly@pixel.convex.com WWW folks may like to comment on this, posted to wais-talk and cni-arch... Sorry if you've already read it there ! -- Jean-Francois ------- Start of forwarded message ------- From: connolly@pixel.convex.com To: wais-talk@Think.COM Cc: cni-arch@uccvma.BITNET Subject: Re: Document identifiers Date: Mon, 02 Dec 91 01:32:36 CST >The Coalition for Networked Information >Architectures & Standards Working Group > I don't like the direction this technology is headed. What is the desired functionality of these identifiers? If you want an identifier that uniquely identifies a file, why not use a checksum, such as returned by the unix sum command? Let's see how a checksum solves these issues, and then see what functionality I'd like to see in stead. >1. The need for identifiers, as distinct from location >information. This is best handled by a number (much like an >ISSN or ISBN), but the system must accomodate multiple >number-assigning agencies. Thus, the identifier is proposed >as <numbering-authority>,<identifier> where numbering >authorities are registered. > There's no location info in a checksum. Done deal. >2. The pointers must be representable as an ASCII string to >facilitate inclusion in a wide range of material, including >documents and electronic mail. > Check. >3. Location information must support multiple Locations for >the document, including the "location of record" and one or >more redistribution centers, local caches, etc. The means of >specifying a location should be sufficiently general to span >at least the set of networks covered under the Internet >Domain Naming system (DNS). > Ah! Now we want to be able to get location info out of the identifier. Checksums don't help. Well, in fact, they help no more or less than <numbering authority>-<id> helps, unless a numbering authority implies a location. I'm not clear on this at all. >4. Objects may be retrieved by a variety of access >mechanisms from servers, including FTP, LISTSERV, Z39.50, >and perhaps FTAM and SQL-based database access, as well as >requests for paper copies. The location information should >be sufficiently general to include information about these >different types of access techniques, and extensible to >include new access methods that may develop in future. > Hmmm... now it looks like the doc id should tell how to get the document... but not exactly. What we're relly looking for is some client software that interprets these numbers and queries servers. Checksums look as good as anything again. >5. Perhaps the location identifier should include some >information about the format and size of the object; on the >other hand, perhaps it should not. Discussion? > Checksums do not contain type/size info. If that's what we want, the checksum idea is no good. >6. It should be possible to further qualify a reference to a >"sublocation" within an object (which would have meaning >only to the server that houses it). This is needed, for >example, for hypertext-type links. Such a sublocation might >be the 25th paragraph of a text, for a hypertext-type >pointer. > Now we raise the question: just what does a document identifier identify? Until this item, it appeared that a document was a file. Now it's not so clear. Perhaps a document should be anything from a single character to a paragraph to a file to a chapter to a book to an encyclopedia to a library. That would be a good trick. Is that what we're after? >7. Indirection should be supported. In other words, one >should be able to format the location as the name of a >server that can be passed the identifier and which would >return location information. The protocol mechanism(s) for >doing this need to be specified as well. > Ah. Now the objectives of the location info become more clear. Sounds to me like the location is a TCP connection, or enough information on how to establish one. >8. While full rights and permissions data would seem to be >outside the scope of such a pointer, it might be useful to >include at least some basic information. This might be an >indication that the object is not copyrighted and can be >freely distributed, that it is copyrighted but can be freely >distributed, that it can be redistributed for noncommercial >use, or that restrictions apply to redistribution. Also, it >might make sense to include a pointer of some sort (an >e-mail address? a host address?) for further information >about rights. > Ack! This stuff seems totally orthogonal to the rest of the stuff, but in practice, this looks like a crucial issue. I don't have any good ideas here. >9. Perhaps there might be some type of checksum that can be >calculated on the retrieved object to ensure that the >pointer and the object have not gotten out of synch? > This is what sparked the checksum idea. My response to all this: I don't think we need [yet another] document identifier format. If you want location info, use an internet address; if you want data integrity, use a checksum; if you want format, we are lacking a standard here; if you want copyright info, ditto; What we need is some nifty client software to glue all the parts together. I guess there is some room for standardization, but please: LET'S LEVERAGE EXISTING SYSTEMS! Where these systems are robust, I think we should support them. I'd also like to see support for ad-hoc document identifiers. Here's an example to clarify: I'm browsing some email, netnews, or a README file from somewhere. I see a reference to more info: A full discussion of the BLURF protocol is available via anonymous FTP from frob.mit.edu as blurf-proto.tex in the directory /pub/protos. I select some or all of that text, and I click one of the buttons in my document retrieval tool: make ftp id -- extract the relevant information and display a well-formed identifier acceptable to some existing FTP client (I've heard of something called ange FTP. Another idea is to make a shell script that would do the retrieval: ftp frob.mit.edu cd /pub/protos get blurf-proto.tex ) make wais id -- get enough info to make a WAIS doc ID [scrap this unless it stabilizes] make WWW id -- same thing for World Wide Web HTTP addresses. make NNTP id -- same thing for USENET news message id's. make LISTSERV id -- you get the idea Rather than making up a new format, these id's are instructions to EXISTING clients to retrieve a document. verify id -- connect to the necessary server(s) and verify that the id references an existing document. Append to the id a "verification date," which is the last time a server acknowledged the existence of the document. get id info -- connect to the necessary server(s) and get about 1K of miscellaneous info: document size in bytes, date of last modification, available formats, short summary, etc. retrieve raw -- connect and retrieve the document in whatever format is convenient to the server, e.g. a compressed tar archive of C and troff sources. retrieve text -- connect and retrieve the document as plain text [defined, e.g. as the body of an RFC-822 mail message] retrieve... -- the user or the supporting client software specifies the supported information formats, (compression schemes, archiving formats, image file formats, typesetting languages) the client and the server hash over their options, [perhaps with user intervention] and the server sends the most desireable version of the document it has available. If we add a few buttons, we begin to encompass the scope of many existing systems: expand -- change the doc id to reference the "document" containing it. In the ftp example, rather than "get blurf.tex," it would have "ls." Click again and get "cd ..; ls." Obviously, this operation depends on the access mechanism. For WAIS documents, the expansion of a document is the source that contains it. select -- narrow the document to some of its parts. For a text file, select some of the characters/paragraphs for a WAIS source, select some of the documents. For a WWW node, select a neighboring node. For a directory, select some files. I guess my point is, let's think about how folks are going to use this document referencing technology, and let's see how well existing systems meet these needs. I guess some groups have come to the conclusion that the existing systems don't cut it. I'm beginning to agree. I guess we'd all agree that we should decide how we're going to use these doc id's and let that drive the design of the format. i.e. Let's decide on the methods of this object before we decide on its representation. [an idea: for syntax, the WAIS folks chose LISP. What about using something akin to RFC-822 syntax? I think it works well: define a bunch of standard headers; require some, allow some, disregard others; allow free-form text in the body. examples: ISBN: 0-13-590126-X or MESSAGE-ID: usenet-thing or FTP-HOST: frob.mit.edu USER: anonymous or WAIS-PORT: 8001@think.com This would allow us to leverage all the email technology out there, plus the emerging multi-part mail format. (and it would allow me to use PERL on these beasties! :-) ] Another thing I hope folks are keeping in mind: I don't think any one client can meet the information-retrieval needs of everybody. We need to support multiple platforms, for one thing. But I hope other folks are considering using mulitple clients at the same time! I'd like to use one slick X-windows front end to the whole ball of wax, in some ways like emacs does for programming, and in some ways like the mac GUI does for office-productivity applications. But I'm going to be using POST mail servers, NNTP servers, WAIS servers, FTP servers, etc, and I don't expect one client to do it all. The crucial trick is to make all this intuitive and interactive, i.e. to support hypertext browsing, fulltext retrieval, USENET news reading, and maybe email correspondence, all in one environment. Let's get started! Dan ------- End of forwarded message -------