- From: Albert Lunde <Albert-Lunde@nwu.edu>
- Date: Fri, 28 Feb 1997 12:04:15 -0600
- To: www-international@w3.org, unicode@unicode.org
[multi-reply] >Indexing engines are downloading the whole page. They have to make summary >and extract links to other documents. Spiders are doing HEAD requests >to find out if the documents were recently changed or deleted. They don't >need language information. This may be true at present, but may not always be the case. = = = >I do not believe that it is necessary to label content with >a content-language header if the content self-identifies its >language with <HTML LANG=xx> or <BODY LANG=xx>. > >The "charset" information is NECESSARY to interpret the >body (of text/* types) and thus MUST appear in the header. It seems to me that If one was doing content-negotiation on language one would want language to appear in the HTTP headers. If we didn't want in in the headers, why did we define a header for it? = = >There must be possible to transport (floppies, attached, etc) a document >with the lang and charset *inside*. Having it as a file name extension >could be consider *inside* (not a very good solution), but a doubt about a >directory. This was discussed in the past on IETF lists (html-wg and/or http-wg). It was suggested that there were various applications for a format for meta-information within a file system as well as "over-the-wire". This may not just be an issue for HTML. I thought one of the most pointed cases for this could be made for storing data on a CD-ROM. In the alternate case of an HTTP server, it can be argued that the form of the meta-information is an implementation detail that doesn't affect what is sent over-the-wire (since the IETF is first in the "protocol" standardization business, this makes it less important). But it cases like a CR-ROM one is dealing with a data-interchange format in practice. Two related suggestions were to store a complete HTTP or MIME message with headers, or to store an associated file of meta-information as headers, and I think Apache has impliemented both of these options. But I don't _think_ either option has been documented and standarized enough for people to reasonablely be able to expect to use it for file interchange except when moving betweeen controlled server envirionments. Either could have the advantage that it doesn't require parsing an HTML document and it can apply to other file types besides HTML. Maybe this is an area that needs a spec of it's own? = = Storing charset information in the document body is a hack that is useful in particular circumstances, but it is a hack nevertheless. We should document it, but depreciate it, to prevent people from needing to resort to even less reliable hacks to determine the charset of unknown documents. It is easy to use only for those character-encodings that have US-ASCII as a clear sub-mapping (consider EBCDIC as a case that needs more huristics), and it fails for documents that have had their character-encoding translated. Also, use of META to store the charset is really only defined for text/html. = = I haven't heard of anyone actually parsing META tags to generate additiona headers, and one could argue thare are performance reasons why this has not been done. One can make a case that META with HTTP-EQUIV should mean the same thing as the corresponding HTTP headers; this gives some semantics to META, which from a standards view is a lot like "pragma" in C: a dumping ground for implentation-specific stuff and local experiments. The same performance reasons, would suggest that the majority of servers won't parse HTML for language tags on-the-fly either. Performance _is_ an argument in favor of preferring <HTML LANG=... > ; if present, the HTML tag would always appear first. --- Albert Lunde Albert-Lunde@nwu.edu
Received on Friday, 28 February 1997 13:10:15 UTC