Date: Thu, 11 Jun 92 12:22:56 -0400 From: timbl@zippy.lcs.mit.edu (Tim Berners-Lee) Message-Id: <9206111622.AA03819@zippy.lcs.mit.edu> To: connolly@pixel.convex.com, enag@ifi.uio.no, www-talk@nxoc01.cern.ch Subject: MIME, SGML, UDIs, HTML and W3 Cc: timbl@zippy.lcs.mit.edu I have printed off the recent discussion on the new HTTP, HTML and MIMe and UDIs and done what I can to disentangle it all in my mind. I will reply in one message, becase many of the points are linked. I know this should be hypertext, with references but (a) I am away from home and (b) we don't yet have a universal mail/news archive server running to link to. HTTP and HTML First of all, Jean-Francois <jfg@dxcern.cern.ch> points out very properly that the enhaced HTTP protocol and the enhanced HTML spec are quite separate things, and should be specified separatedly. I agree wholeheartdly about all this, and I aplogize for muddling the levels up till now. (As a small aside, I would point out that wheras a HTERR file is not very useful, a HTFWD file IS. It is like a hypertex soft link. But I am happy to leave that as a separate type of file. It should certainly get a different extension so that it gets a different icon) HTTP: SGML vs ASN/1 Let's look at the HTTP protocol first. Carl <barker@cernnext.cern.ch> is mapping out the requirements for this, and assuming that SGML would be a reasonable representation for it in practice. And so it is. When the requirements are clear, it would certainly be interesting to look at mapping them onto a z39.50 - style ASN/1 implementation. This would be useful for two reasons. First, the comparison would point out to us things in z39.50 which we might not have thought of which would b useful for HTTP. Second, the comparison might give a nice short or at least well-defined things which the WAIS guys might like to take into account in the next version of their protocol. (I demod W3 to Brewster who hadn't seen it before live, and was very keen that WAIS and W3 should merge, changing the WAIS protocol if necessary. There is no reason why we shouldn't try both protocols. If they map well onto each other, its just a question of having two separate prasers at the low level, building the same internal structures. When we're talking about an SGML representation, and describe a file to come later down the link, I don't think we have to use the NOTATION= attribute with a notation type, because we won't in fact be talking about the notation of an SGML element. The format in this case is not something which the SGML parse is aware of. I must admit I was disappointed to learn that SGML didn't allow for any way of including 8 bit data. Thanks Eric <enag@ifi.uio.np> for your explanations. MIME and SGML Dan <connolly@pixel.convex.com> rightly points out the relevance of the coming MIME standards. There are several things which we must separate here, though: 1. The MIME classification of data formats 2. The MIME format for multi-part messages 3. The MIME format for rich text. 4. The MIME formal for external document addresses (MIME UDIs) 1. MIME classification of data formats We must do the same disentangling job which JF did on HTML to MIME. First of all, the MIME job of classifying data formats is a useful job which is ideally done by just one bunch of people. Ther has been some suggestion that the MIME classifications are not well enough defined, but they seem to be the best effort yet and one can only assume they will eveolve in the right direction. So I'd back the use of these for W3. 2. The MIME format for multi-part messages This is necessary for sending a multi-part document over a mail link. We have to ask ourselves whether it is reasonable to use over a binary link. Personally, my initial impression is that the MIME stuff, using as it does terminators such as --xxx-- separated by blank lines, looks more horrible to work with in this respect than SGML! Still we have the problem of restrictions on the content: Must not contain delimiters, limited 7 bit character set, line orientation, in fact all the things which email carries as a restriction. This is really taking on board a legacy of all the mail which has evolved over the years. Do we need that for our new ultra-fast hypertext access protocol? [Compare the MIME format with the rather cleaner NeXT Mail format which is as far as I understand simply a uuencoded compressed tar file of all the bits, where uuencoding is designed as an optimal way of getting over mail transport restrictions, compress does what it says and tar is a multipart wrapper designed for that only. Not standard outside unix, perhaps, but cleaner in that the mail formatting is done at the last minute and doesn't affect the other operations] If course, with HTTP2, multipart/alternative shouldn't be needed. Multipart for hypetext? Now, Dan not only suggests the use of this for multipart messages, but also suggests that a hypetext document shoudl necessarily contain many parts, one on SGML and one for each link as a MIME external document. This means that an SGML hypertext document can never stand on its own! An SGML parser will always need to have a MIME parser sitting just outside. I don't like this: I feel we have to separate these two things. Suppose that an SGML document does want to be sent in a MIME message and does want to refer to other parts of that MIME message. In that case, it seems reasonable to have a format for that. However, when an SGML document is seen by itself, and refers to a news message for example, then there is no resaon for it not to be able to contain a complete reference within itself. When SGML documents include other files, then the SYSTEM value is typically a file name. It is a reeference to something outside. The precedent is set that SGML documents are allowed to refer to things outside. I think part of you objection, Dan is based on a dislike of the UDI syntax -- which I'll come to later. 3. The MIME format for rich text. Here, I am not so impressed. Basically, the MIME people are at the same level that we were before we started this cleanup, that they have SGML-LIKE stuff which isn't SGML. As its not difficult to make it SGML, they should do that. Comparing MIME's rich text and HTML, I see that we lack the characetr formatting attributes BOLD and ITALIC but on the other hand I feel that our treatment of logical heading levels and other structures is much more powerful and has turned out to provide more flexible formatting on different platforms than explicit semi-references to font sizes. This is born out by all the systems which use named styles in preference to explicit formatting, LaTeX or other macros instead of TeX, etc etc. So technically, HTML has some things to give MIME's rich text. Are the MIME people still open to additions? If not, I would suggest we add BOLD and ITALIC (or two emphasis styles for characters), and keep HTML separete from MIME's rich text, proposing it as a MIME text standard. (HP0 and HP1 were in the HTML spec but as unimplemented) 4. The MIME format for external document addresses (MIME UDIs) As Ed <emv@msen.com> says, this is a bit of a non-issue, as MIME addersses and currnet style UDIs map onto each other. However, we have to agree on a "concrete syntax" (or two... :-) in the end. It's like the difference between an x400 style mail address generated from an internet address, and that internet address. Which do you prefer timbl@zippy.lcs.mit.edu where the sections of the domain name are defined to have no semantics at all, or S=timbl; HO=zippy; OU=lcs; O=MIT; SECTOR=edu (this is not real x400 - don't use it!) or user=timbl host=zippy group=lcs organization=mit sector=education You say, Dan, that you "don't think [UDIs] work". Do you mean people don't use them in all correspondance? Well, what DO they use? They use ange-ftp addresses for FTP (like info.cern.ch:/pub/www/doc/*.ps), which are even more terse than UDIs! They use news message-ids which are UDIs. Let me say that I personally don't much care about the arbitrary punctuation. There are a few things, though, which are important: - The thing should be printable 7-bit ASCII. Unlike arbitrary document formats, UDIs must be sendable in the mail - White space should not be significant. I would accept the presence of some arbitrary white space as a delimiter, but one cannot distinguish between different forms and quantities of white space. This is because things get wrapped and unwrapped. Dan, you object to UDIs because they don't contain white space. But that is purely so that to CAN wrap them onto several lines and still recuperate them. You can put white space in but it shouldn't mean anything. (This is not possible in W3 as is but it is in the UDI document) I don't see why you say they can't be put as an SGML attribute. They are just text strings. They will be quoted of course (Yes, I know the old NeXT browser doesn't quote them) Is that not allowed? What are the problem characters? If there SGML problem characters in the UDI spec, they probably are ruled out of SGML for a reason. (I recently saw in a galley proof of an article in which our mail adress had been hypernated! UDIs must be squeezable into 2 inch columns.) There is a sematic difference between a tagged list and a punctuation-divided set, and that is that the former has defined semantics but the latter doesn't and can therefore be extended more easily. I suggest that tagging could be used for the four bits of an address that must be separable by all sides, which are limited in number (4). Within those bits, the string should be transparent as the protocol does not require every party to understand the innards. The bits are MIME Used by name space: ACCESS Used by client server details: HOST, PORT used by client, protocol-dependent local doc id: PATH used by server only anchor id: (none) used by presntation application only It seems useful to maintain the ability to work out which bits are seen by whom. I only used punctation to separate these parts in the W3 UDI because people like internet addresses and mail addresses and filenames and telephone numbers and message-ids and room numbers and zip codes which don't have tags and do make do with punctuation. If the groundswell of opionion on this list is that tags are better, then let's use tags! Whatever we sue, it should be as quotable in an SGML attribute as in a MIME external reference as in a scribbled note or a link-pasteboard or whatever. (The U is for Universal, NOT Unique!) PHILOSOPHY In the W3 world, the model is of a dynamic world of documents which generally have some "home" or (or several), which can be found using sufficient intelligence and the help of ones friends given the UDI. A mail message has no home, and so in principle the parts of it have no home. When a hypertext multipart message (really consisting of multiple hypertext documents) has links between its parts they refer to each other within a completely isolated conetext. There are now two possibilites when the message is in fact archived and made readable. One is we say that the parts are then addressed as parts ofthe message, wherever it may be. The other is to say that the parts of the message are very likely things which had some original home. In that case, the message is just giving the reciever a copy to save him the (perhaps insurmountable) trouble of retrieving it. In this case the parts should be identified with thier original UDIs so that the receiver is not confsed with multiple documents which are in fact the same thing. I think that's all the comments I have on what I've read so far.. Tim ________________________________________________________________ Tim Berners-Lee World-Wide Web initiative CERN, 1211 Geneva 23, Switzerland timbl@info.cern.ch Visiting MIT: NE43-513, (617)234 6016 timbl@zippy.lcs.mit.edu