Re: DTD & Lang attribute (Checkpoint 3.2 & 4.3) from David Woolley on 2002-09-09 (w3c-wai-ig@w3.org from July to September 2002)

From: David Woolley <david@djwhome.demon.co.uk>
Date: Mon, 9 Sep 2002 21:56:03 +0100 (BST)
To: w3c-wai-ig@w3.org
Message-Id: <200209092056.g89Ku4O02358@djwhome.demon.co.uk>
> >From what I understand (please correct me if I'm wrong) the DTD and HTML=
> 
> Lang attribute=A0are used by browsers/ATs to determine the code and
> language used in order to best represent a site.=A0 My question is this:=

DOCTYPE is used by validators to enable them to check the syntax.
In principle it could be used by a pure CSS browser to tell it something
about HTML - you would need a full style sheet as well.  It is used
by some recent browsers to indicate that the author probably really
wrote in HTML, rather than a random sequence of tags, and therefore
wants HTML obeyed properly, rather than heuristics to make it behave
more like broken earlier browsers.  This is typically triggered by a
Strict variant of the DTD.

LANG is used to set the default natural language for the document, and in
theory is used to select layout rules appropriate for the language (e.g.
there are no word spaces in Chinese).  It can be used by style sheets to
implement some of this.  Some search engines will use it to allow search
by language.

> 
> Can these browsers/ATs still read (and interpret) the DTD and HTML Lang
> attribute if it is not at the very beginning of the document? for

Only if they provide appropriate error recovery - however, in principle,
the nature of that error recovery depends on the document type, so not
starting with the DOCTYPE is likely to force legacy broken HTML mode
and cause the subsequent DOCTYPE to be ignored.

There are some artefacts in the following due to not decoding the quotable
printed encoding, but the actually demonstrate some points.

> <META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html;
> charset=3DUS-ASCII">=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=20

No content is allowed here, so the non break space is invalid.

> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"

Too late.

> <title>Amputations</title><meta http-equiv=3D"content-Type"
> content=3D"text/html; CHARSET=3D"iso-8859-1">

I doubt that any browser will find this, even if it does very crude
parsing.  This document is restricted to ASCII, although you can use
entities, as the browser will have selected the character set on the
first one (chances are it doesn't enforce the restriction).

> and cannot be tinkered with in any way (and if they can I would be more
> than happy to admit my error and learn how!)

It is always going to generate technically invalid documents, but you
can avoid compounding the issue by not including DOCTYPE, a HEAD element,
or BODY tag.  You ought to be able to specify LANG on the outermost elements
in the BODY, but you will have to do it for each one, and not have any
text not in a block element - I'm not sure what rules search engines
use, but I doubt they look for LANG in depth.  You will need to ensure
that you use entities for any non-ASCII characters and should not assume
anything beyond HTML 4.01 Transitional (technically the last version of
HTML that allowed DOCTYPE to be left out was HTML 2.0, but that's not
a problem with real browsers).

> This message contains privileged and confidential information intended

Bogus confidentiality notice deleted.

> Content-Type: application/rtf

Redundant, proprietory, huge (25K), word processor format deleted.
Received on Monday, 9 September 2002 17:01:36 UTC