Re: Language label from Albert Lunde on 1997-02-28 (www-international@w3.org from January to March 1997)

From: Albert Lunde <Albert-Lunde@nwu.edu>
Date: Fri, 28 Feb 1997 12:04:15 -0600
To: www-international@w3.org, unicode@unicode.org
Message-Id: <v03007800af3cc52fd9cd@[129.105.110.129]>
[multi-reply]
>Indexing engines are downloading the whole page. They have to make summary
>and extract links to other documents. Spiders are doing HEAD requests
>to find out if the documents were recently changed or deleted. They don't
>need language information.

This may be true at present, but may not always be the case.
= = =
>I do not believe that it is necessary to label content with
>a content-language header if the content self-identifies its
>language with <HTML LANG=xx> or <BODY LANG=xx>.
>
>The "charset" information is NECESSARY to interpret the
>body (of text/* types) and thus MUST appear in the header.

It seems to me that If one was doing content-negotiation on language one
would want language to appear in the HTTP headers. If we didn't want in in
the headers, why did we define a header for it?

= =
>There must be possible to transport (floppies, attached, etc) a document
>with the lang and charset *inside*.  Having it as a file name extension
>could be consider *inside* (not a very good solution), but a doubt about a
>directory.

This was discussed in the past on IETF lists (html-wg and/or http-wg). It was
suggested that there were various applications for a format for
meta-information within a file system as well as "over-the-wire". This may
not just be an issue for HTML. I thought one of the most pointed cases for
this could be made for storing data on a CD-ROM.

In the alternate case of an HTTP server, it can be argued that the form of
the meta-information is an implementation detail that doesn't affect what
is sent over-the-wire (since the IETF is first in the "protocol"
standardization business, this makes it less important).

But it cases like a CR-ROM one is dealing with a data-interchange format in
practice. Two related suggestions were to store a complete HTTP or MIME
message with headers, or to store an associated file of meta-information as
headers, and I think Apache has impliemented both of these options. But I
don't _think_ either option has been documented and standarized enough for
people to reasonablely be able to expect to use it for file interchange
except when moving betweeen controlled server envirionments. Either could
have the advantage that it doesn't require parsing an HTML document and it
can apply to other file types besides HTML.

Maybe this is an area that needs a spec of it's own?
= =

Storing charset information in the document body is a hack that is useful
in particular circumstances, but it is a hack nevertheless. We should
document it, but depreciate it, to prevent people from needing to resort to
even less reliable hacks to determine the charset of unknown documents.

It is easy to use only for those character-encodings that have US-ASCII as
a clear sub-mapping (consider EBCDIC as a case that needs more huristics),
and it fails for documents that have had their character-encoding
translated.

Also, use of META to store the charset is really only defined for text/html.
= =

I haven't heard of anyone actually parsing META tags to generate additiona
headers, and one could argue thare are performance reasons why this has not
been done. One can make a case that META with HTTP-EQUIV should mean the
same thing as the corresponding HTTP headers; this gives some semantics to
META, which from a standards view is a lot like "pragma" in C: a dumping
ground for implentation-specific stuff and local experiments.

The same performance reasons, would suggest that the majority of servers
won't parse HTML for language tags on-the-fly either. Performance _is_ an
argument in favor of preferring <HTML LANG=... > ; if present, the HTML tag
would always appear first.


---
    Albert Lunde                      Albert-Lunde@nwu.edu
Received on Friday, 28 February 1997 13:10:15 UTC