HTML 4.01 analysis for conformance

Hi,

As an exercise for Spec Guidelines and an exercise to a futur 
evaluation of XHTML 2.0, I have re-read the full HTML 4.01 
Specification and the Erratas.
http://www.w3.org/TR/1999/REC-html401-19991224/

In this exercise, I have tried to identify only the MUST, MUST NOT 
and REQUIRED as defined in the HTML 4.01 spec. There are not in 
uppercase.


	Conformance: requirements and recommendations
	http://www.w3.org/TR/html4.01/conform.html

	"""The key words "MUST", "MUST NOT", "REQUIRED",
	"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT",
	"RECOMMENDED", "MAY", and "OPTIONAL" in this
	document are to be interpreted as described in
	[RFC2119]. However, for readability, these words
	do not appear in all uppercase letters in this
	specification.

	At times, the authors of this specification
	recommend good practice for authors and user
	agents. These recommendations are not normative
	and conformance with this specification does not
	depend on their realization. These recommendations
	contain the expression "We recommend ...",
	"This specification recommends ...", or some
	similar wording."""


* My analysis:



	The HTML 4.01 Spec as defined now is very difficult to test 
on many points. Sometimes some points are contradictory. The 
semantics of elements is not defined as a requirement. For example, 
you don't have such sentences like "The address element MUST contains 
address information", which means that someone could put a shopping 
list and be conformant on a semantic point of view. The use of MUST 
is sometimes inconsistent, sometimes not clear.

	Another difficult I had is how you identify what's a part of 
a MUST or not. Is it the english sentence which start with the end of 
the previous one and end with a final dot? or is it the whole 
paragraph? Reviewing it, I was thinking that a markup for testable 
assertions is needed and/or that once the MUST extracted, the 
sentences have to be meaningful by themselve.

	Another point: How do you distinguish a MUST requirement from 
a must for english prose?


	It was a really interesting exercise that I will push further 
by reviewing the XHTML 2.0 spec.



* List of MUST, MUST NOT, REQUIRED



4.01 User agents must not render SGML processing instructions (e.g., 
<?full volume>) or comments.

5.01 User agents must also know the specific character encoding that 
was used to transform the document character stream into a byte 
stream.

5.02 This specification does not mandate which character encodings a 
user agent must support.

5.03 Conforming user agents must correctly map to ISO 10646 all 
characters in any character encodings that they recognize (or they 
must behave as if they did).

5.04 User agents must not assume any default value for the "charset" parameter.

5.05 The META declaration must only be used when the character 
encoding is organized such that ASCII-valued bytes stand for ASCII 
characters (at least until the META element is parsed).

5.06 To sum up, conforming user agents must observe the following 
priorities when determining a document's character encoding (from 
highest priority to lowest):

    1. An HTTP "charset" parameter in a "Content-Type" field.
    2. A META declaration with "http-equiv" set to "Content-Type" and 
a value set for "charset".
    3. The charset attribute set on an element that designates an 
external resource.

6.01 Although the STYLE and SCRIPT elements use CDATA for their data 
model, for these elements, CDATA must be handled differently by user 
agents. Markup and entities must be treated as raw text and passed to 
the application as is.

6.02 ID and NAME tokens must begin with a letter ([A-Za-z])

6.03 NUMBER tokens must contain at least one digit ([0-9]).

6.04 The "charset" attributes (%Charset in the DTD) refer to a 
character encoding as described in the section on character 
encodings. Values must be strings (e.g., "euc-jp") from the IANA 
registry (see [CHARSETS] for a complete list).

6.05 User agents must follow the steps set out in the section on 
specifying character encodings in order to determine the character 
encoding of an external resource. (Cf. 5.06)

6.06 (Dates and times YYYY-MM-DDThh:mm:ssTZD) Z  indicates UTC 
(Coordinated Universal Time). The "Z" must be uppercase.

6.07 (Dates and times) Exactly the components shown here must be 
present, with exactly this punctuation. Note that the "T" appears 
literally in the string (it must be uppercase), to indicate the 
beginning of the time element, as specified in [ISO8601]

6.08 (Media Descriptors) To facilitate the introduction of these 
extensions, conforming user agents must be able to parse the media 
attribute value as follows:
	1. The value is a comma-separated list of entries.
	2. Each entry is truncated just before the first character 
that isn't a US ASCII letter [a-zA-Z] (ISO 10646 hex 41-5a, 61-7a), 
digit [0-9] (hex 30-39), or hyphen (hex 2d).
	3. A case-sensitive match is then made with the set of media 
types defined above. User agents may ignore entries that don't match.

6.09 (Scripts) User agents must not evaluate script data as HTML 
markup but instead must pass it on as data to a script engine.

6.10 (Stylesheet) User agents must not evaluate style data as HTML markup.

6.11 (Frames) Except for the reserved names listed below, frame 
target names (%FrameTarget; in the DTD) must begin with an alphabetic 
character (a-zA-Z).  -> reserved list (_blank, _self, _parent, _top)

7.01 HTML 4.01 specifies three DTDs, so authors must include one of 
the following document type declarations in their documents.

7.02 Every HTML document must have a TITLE element in the HEAD section.

7.03 For reasons of accessibility, user agents must always make the 
content of the TITLE element available to users (including TITLE 
elements that occur in frames).

7.04 (about id="name") This name must be unique in a document.

7.05 (about class="cdata-list") Multiple class names must be 
separated by white space characters.

7.06 Authors may also choose to use a system identifier that refers 
to a specific (dated) version of an HTML 4 DTD when validation to 
that particular DTD is required.

7.07 Exactly one title is required per document.

7.08 User agents are not required to support meta data mechanisms. 
(META element)
	[Karl: Contradictory with the charset def]

8.01 user agents must make a best attempt to render all characters, 
regardless of the value specified by lang. [8.1]

8.02 User agents must make a best attempt to render [gamma character] 
even though it is not an English character. [8.1]
	[Karl: Use of a must in an example]

8.03 [RFC1766] defines and explains the language codes that must be 
used in HTML documents. [8.1.1]

8.04  If a document contains right-to-left characters, and if the 
user agent displays these characters, the user agent must use the 
bidirectional algorithm. [8.2]

8.05 User agents must not use the lang attribute to determine text 
directionality. [8.2]

8.06 To achieve additional levels of embedded direction changes, you 
must make use of the dir attribute on an inline element. [8.2.3]

8.07 To achieve two embedded direction changes, we must supply 
additional information, which we do by delimiting the second 
embedding explicitly. [8.2.3]

8.08 Because HTML uses the Unicode bidirectionality algorithm, 
conforming documents encoded using ISO 8859-8 must be labeled as 
"ISO-8859-8-i". [8.2.4]

8.09 However, because the bidirectional algorithm relies on the 
inline/block-level distinction, special care must be taken during the 
transformation. [8.2.6]

8.10 The BDO element should be used in scenarios where absolute 
control over sequence order is required (e.g., multi-language part 
numbers). [8.2.4]

8.11 If a document does not contain a displayable right-to-left 
character, a conforming user agent is not required to apply the 
[UNICODE] bidirectional algorithm. [8.2]

9.01 Visual user agents must ensure that the content of the Q element 
is rendered with delimiting quotation marks. [9.2.2]

9.02 A number of issues, both stylistic and technical, must be addressed:

     * Treatment of white space
     * Line breaking and word wrapping
     * Justification
     * Hyphenation
     * Written language conventions and text directionality
     * Formatting of paragraphs with respect to surrounding content [9.3]

9.03 Those browsers that interpret soft hyphens must observe the 
following semantics: If a line is broken at a soft hyphen, a hyphen 
character must be displayed at the end of the first line. If a line 
is not broken at a soft hyphen, the user agent must not display a 
hyphen character. [9.3.3]

9.04 When handling preformatted text, visual user agents: Must not 
disable bidirectional processing. [9.3.4]

9.05 The INS and DEL elements must not contain block-level content 
when these elements behave as inline elements. [9.4]

9.06 Non-visual user agents are not required to respect extra white 
space in the content of a PRE element. [9.3.4]

10.01 All lists must contain one or more list elements. [10.1]

11.01 User agents must know where to render the header and footer. [11.2.1]

11.02 In order for a user agent to format a table in one pass, 
authors must tell the user agent:

     * The number of columns in the table. Please consult the section 
on calculating the number of columns in a table for details on how to 
supply this information.
     * The widths of these columns. Please consult the section on 
calculating the width of columns for details on how to supply this 
information. [11.2.1]

11.03 If any of the columns are specified in relative or percentage 
terms (see the section on calculating the width of columns), authors 
must also specify the width of the table itself. [11.2.1]

11.04 Each row group must contain at least one row, defined by the TR 
element. [11.2.3]

11.05 TFOOT must appear before TBODY within a TABLE definition so 
that user agents can render the foot before receiving all of the 
(potentially numerous) rows of data. [11.2.3]

11.06 The following summarizes which tags are required and [...]:

     * The TBODY start tag is always required except when the table 
contains only one table body and no table head or foot sections. [...]
     * The start tags for THEAD and TFOOT are required when the table 
head and foot sections are present respectively, [...]

Conforming user agent parsers must obey these rules for reasons of 
backward compatibility. [11.2.3]

11.07 The THEAD, TFOOT, and TBODY sections must contain the same 
number of columns. [11.2.3]

11.08 (about span="number" in the colgroup element) This attribute, 
which must be an integer > 0, specifies the number of columns in a 
column group.  [11.2.4]

11.09 (about span="number" in the colgroup element) User agents must 
ignore this attribute if the COLGROUP element contains one or more 
COL elements. [11.2.4]

11.10 (about width="multi-length" in the colgroup element) This 
implies that a column's entire contents must be known before its 
width may be correctly computed.

11.11 When it is necessary to single out a column (e.g., for style 
information, to specify width information, etc.) within a group, 
authors must identify that column with a COL element. [11.2.4]

11.12 (about span="number" in the col element) This attribute, whose 
value must be an integer > 0, specifies the number of columns 
"spanned" by the COL element; the COL element shares its attributes 
with all the columns it spans.   [11.2.4]

11.13 However, if the table does not have a fixed width, user agents 
must receive all table data before they can determine the horizontal 
space required by the table. [11.2.4]

11.14 (about the attribute headers="idrefs" in th and td elements) 
The value of this attribute is a space-separated list of cell names; 
those cells must be named by setting their id attribute. [11.2.6]

11.15 (about the attribute scope="scope-name" in th and td elements) 
When specified, this attribute must have one of the following values:

     * row: The current cell provides header information for the rest 
of the row that contains it (see also the section on table 
directionality).
     * col: The current cell provides header information for the rest 
of the column that contains it.
     * rowgroup: The header cell provides header information for the 
rest of the row group that contains it.
     * colgroup: The header cell provides header information for the 
rest of the column group that contains it. [11.2.6]

11.16 User agents must render either the contents of the cell or the 
value of the abbr attribute. [11.2.6]

11.17 For a given data cell, the headers attribute lists which cells 
provide pertinent header information. For this purpose, each header 
cell must be named using the id attribute. [11.4.1]

11.18 (about char="character" attribute about horizontal alignment) 
User agents are not required to support this attribute. [11.3.2]

11.19 (about charoff="length" attribute about horizontal alignment) 
User agents are not required to support this attribute. [11.3.2]

11.20 If a table or given column has a fixed width, cellspacing and 
cellpadding may demand more space than assigned. User agents may give 
these attributes precedence over the width attribute when a conflict 
occurs, but are not required to. [11.3.3]

12.01 The destination anchor must be given an anchor name and any URI 
addressing this anchor must include the name as its fragment 
identifier. [12.1.1]

12.02 (about name="cdata" in A element) The value of this attribute 
must be a unique anchor name. [12.2]

12.03 Anchor names must observe the following rules:

     * Uniqueness: Anchor names must be unique within a document. 
Anchor names that differ only in case may not appear in the same 
document.
     * String matching: Comparisons between fragment identifiers and 
anchor names must be done by exact (case-sensitive) match. [12.2.1]

12.04 Links and anchors defined by the A element must not be nested; 
an A element must not contain any other A elements. [12.2.2]

12.05 (about id and name attributes) When both attributes are used on 
a single element, their values must be identical. [12.2.3]

12.06 When present, the BASE element must appear in the HEAD section 
of an HTML document, before any element that refers to an external 
source.  [12.4]

12.07 User agents must calculate the base URI for resolving relative 
URIs according to [RFC1808], section 3. [12.4.1]

12.08 User agents must calculate the base URI according to the 
following precedences (highest priority to lowest):

    1. The base URI is set by the BASE element.
    2. The base URI is given by meta data discovered during a protocol 
interaction, such as an HTTP header (see [RFC2616]).
    3. By default, the base URI is that of the current document. Not 
all HTML documents have a base URI (e.g., a valid HTML document may 
appear in an email and may not be designated by a URI). Such HTML 
documents are considered erroneous if they contain relative URIs and 
rely on a default base URI. [12.4.1]

13.01 (about longdesc="uri" in img element) Since an IMG element may 
be within the content of an A element, the user agent's mechanism in 
the user interface for accessing the "longdesc" resource of the 
former must be different than the mechanism for accessing the href 
resource of the latter. [13.2]

13.02 User agents must render alternate text when they cannot support 
images, they cannot support a certain image type or when they are 
configured not to display images. [13.2]

13.03 (about object element) declare [CI]
     When present, this boolean attribute makes the current OBJECT 
definition a declaration only. The object must be instantiated by a 
subsequent OBJECT definition referring to this declaration.  [13.3]

13.04 (about object element) In the most general case, an author may 
need to specify three types of information:

     * The implementation of the included object. For instance, if the 
included object is a clock applet, the author must indicate the 
location of the applet's executable code.
     * The data to be rendered. For instance, if the included object 
is a program that renders font data, the author must indicate the 
location of that data. [13.3]

13.05 A user agent must interpret an OBJECT element according to the 
following precedence rules:

    1. The user agent must first try to render the object. It should 
not render the element's contents, but it must examine them in case 
the element contains any direct children that are PARAM elements (see 
object initialization) or MAP elements (see client-side image maps).
    2. If the user agent is not able to render the object for whatever 
reason (configured not to, lack of resources, wrong architecture, 
etc.), it must try to render its contents. [13.3.1]

13.06 (about attribute valuetype=data|ref|object in param element) 
ref: The value specified by value is a URI that designates a resource 
where run-time values are stored. This allows support tools to 
identify URIs given as parameters. The URI must be passed to the 
object as is, i.e., unresolved.
object: The value specified by value is an identifier that refers to 
an OBJECT declaration in the same document. The identifier must be 
the value of the id attribute set for the declared OBJECT element. 
[13.3.2]

13.07 Any number of PARAM elements may appear in the content of an 
OBJECT or APPLET element, in any order, but must be placed at the 
start of the content of the enclosing OBJECT or APPLET element. 
[13.3.2]

13.08 When an OBJECT element is rendered, user agents must search the 
content for only those PARAM elements that are direct children and 
"feed" them to the OBJECT. [13.3.2]

13.09 To declare an object so that it is not executed when read by 
the user agent, set the boolean declare attribute in the OBJECT 
element. At the same time, authors must identify the declaration by 
setting the id attribute in the OBJECT element to a unique value. 
Later instantiations of the object will refer to this identifier. 
[13.3.4]

13.10 A declared OBJECT must appear in a document before the first 
instance of that OBJECT. [13.3.4]

13.11 User agents that don't support the declare attribute must 
render the contents of the OBJECT declaration. [13.3.4]

13.12 (about attributes code or object in APPLET element) Either code 
or object must be present. If both code and object are given, it is 
an error if they provide different class names. [13.4]

13.13 The content of the APPLET acts as alternate information for 
user agents that don't support this element or are currently 
configured not to support applets. User agents must ignore the 
content otherwise. [13.4]

13.14 Recall that the contents of OBJECT must only be rendered if the 
file specified by the data attribute cannot be loaded. [13.5]

13.15 (about attribute usemap) The value of usemap must match the 
value of the name attribute of the associated MAP element. [13.6.1]

13.16 Therefore, authors must provide alternate text for each AREA 
with the alt attribute (see below for information on how to specify 
alternate text). [13.6.1]

13.17 When a MAP element contains mixed content (both AREA elements 
and block-level content), user agents must ignore the AREA elements. 
[13.6.1]

13.18 It is only possible to define a server-side image map for the 
IMG and INPUT elements. In the case of IMG, the IMG must be inside an 
A element and the boolean attribute ismap ([CI]) must be set. In the 
case of INPUT, the INPUT must be of type "image". [13.6.2]

13.19 The alt attribute must be specified for the IMG and AREA elements. [13.8]

13.20 While alternate text may be very helpful, it must be handled 
with care.  [13.8]

14.01 Authors must specify the style sheet language of style 
information associated with an HTML document. [14.2.1]

14.02 (about type="content-type" attribute in style element) Authors 
must supply a value for this attribute; there is no default value for 
this attribute. [14.2.3]

14.03 User agents that don't support style sheets, or don't support 
the specific style sheet language used by a STYLE element, must hide 
the contents of the STYLE element. [14.2.3]

14.04 When a user selects a named style, the user agent must apply 
all style sheets with that name. [14.3.1]

14.05 User agents must not apply alternate style sheets with a 
different style name. [14.3.1]

14.06 Authors may also specify persistent style sheets that user 
agents must apply in addition to any alternate style sheet. [14.3.1]

14.07 User agents must respect media descriptors when applying any 
style sheet. [14.3.1]

14.08 User agents should also allow users to disable the author's 
style sheets entirely, in which case the user agent must not apply 
any persistent or alternate style sheets. [14.3.1]

15.01 Font style elements must be properly nested. [15.2.1]

16.01 Elements that might normally be placed in the BODY element must 
not appear before the first FRAMESET element or the FRAMESET will be 
ignored.  [16.2]

16.02 (about noresize attribute in frame element)  When present, this 
boolean attribute tells the user agent that the frame window must not 
be resizeable. [16.2.2]

16.03 (about marginwidth="pixels" attribute in frame element)  The 
value must be an integer greater than or equal to zero. (pixels). 
[16.2.2]

16.04 (about marginheight="pixels" attribute in frame element)  The 
value must be an integer greater than or equal to zero. (pixels). 
[16.2.2]

16.05 The contents of a frame must not be in the same document as the 
frame's definition. [16.2.2]

16.06 User agents that support frames must only display the contents 
of a NOFRAMES declaration when configured not to display frames. 
[16.4.1]

16.07 User agents that do not support frames must display the 
contents of NOFRAMES in any case. [16.4.1]

17.01 (about accept-charset="charset list" in form element) The value 
is a space- and/or comma-delimited list of charset values. The client 
must interpret this list as an exclusive-or list, i.e., the server is 
able to accept any single character encoding per entity received. 
[17.3]

17.02 (The FORM element acts as a container for controls.) The 
receiving program must be able to parse name/value pairs in order to 
make use of them. [17.3]

17.03 (The FORM element acts as a container for controls.) A 
character encoding that must be accepted by the server in order to 
handle this form (the accept-charset attribute). [17.3]

17.04 Please consult the section on form submission for information 
about how user agents must prepare form data for servers [17.3]

17.05 (about checked attribute in input element) User agents must 
ignore this attribute for other control types. [17.4]

17.06 Recall that authors must provide alternate text for an IMG 
element. [17.5]

17.07 A SELECT element must contain at least one OPTION element. [17.6]

17.08 The OPTGROUP element allows authors to group choices logically. 
This is particularly helpful when the user must choose from a long 
list of options; groups of related choices are easier to grasp and 
remember than a single long list of options. In HTML 4, all OPTGROUP 
elements must be specified directly within a SELECT element (i.e., 
groups may not be nested). [17.6]

17.09 (about for="idref" attribute in label element) When present, 
the value of this attribute must be the same as the value of the id 
attribute of some other control in the same document. [17.9.1]

17.10 The for attribute associates a label with another control 
explicitly: the value of the for attribute must be the same as the 
value of the id attribute of the associated control element. [17.9.1]

17.11 To associate a label with another control implicitly, the 
control element must be within the contents of the LABEL element. 
[17.9.1]

17.12 In an HTML document, an element must receive focus from the 
user in order to become active and perform its tasks. For example, 
users must activate a link specified by the A element in order to 
follow the specified link. Similarly, users must give a TEXTAREA 
focus in order to enter text into it. [17.11]
	[Karl: examples? mandatory?]

17.13 (about tabindex="number" attribute) This value must be a number 
between 0 and 32767. [17.11.1]

17.14 (about tabindex="number" attribute) Values need not be 
sequential nor must they begin with any particular value. [17.11.1]

17.15 Similarly, an author may want to include a piece of read-only 
text that must be submitted as a value along with the form. [17.12]

17.16 A successful control must be defined within a FORM element and 
must have a control name. [17.13.2]

17.17 HTML 4 user agents must support the established conventions in 
the following cases:

     * If the method is "get" and the action is an HTTP URI, the user 
agent takes the value of action, appends a `?' to it, then appends 
the form data set, encoded using the 
"application/x-www-form-urlencoded" content type. The user agent then 
traverses the link to this URI. In this scenario, form data are 
restricted to ASCII codes.
     * If the method is "post" and the action is an HTTP URI, the user 
agent conducts an HTTP "post" transaction using the value of the 
action attribute and a message created according to the content type 
specified by the enctype attribute. [17.13.3]

17.18 User agents must support the content types listed below. 
(application/x-www-form-urlencoded, multipart/form-data) [17.13.4]

17.19 Forms submitted with this content type 
(application/x-www-form-urlencoded) must be encoded as follows:

    1. Control names and values are escaped. Space characters are 
replaced by `+', and then reserved characters are escaped as 
described in [RFC1738], section 2.2: Non-alphanumeric characters are 
replaced by `%HH', a percent sign and two hexadecimal digits 
representing the ASCII code of the character. Line breaks are 
represented as "CR LF" pairs (i.e., `%0D%0A').
    2. The control names/values are listed in the order they appear in 
the document. The name is separated from the value by `=' and 
name/value pairs are separated from each other by `&'. [17.13.4]

17.20 (about INPUT element) attribute name required for all but 
submit and reset [17.4]

17.21 Visual user agents are not required to present a SELECT element 
as a list box; they may use any other mechanism, such as a drop-down 
menu. [17.6]

17.22 If a control doesn't have a current value when the form is 
submitted, user agents are not required to treat it as a successful 
control. [17.13.2]

18.01 (about type="content-type" in script element) Authors must 
supply a value for this attribute. [18.2.1]

18.02 If the src attribute is not set, user agents must interpret the 
contents of the element as the script. [18.2.1]

18.03 If the src has a URI value, user agents must ignore the 
element's contents and retrieve the script via the URI. [18.2.1]

18.04 Scripts are evaluated by script engines that must be known to a 
user agent. [18.2.1]

18.05 As HTML does not rely on a specific scripting language, 
document authors must explicitly tell user agents the language of 
each script. [18.2.2]

18.06 The type attribute must be specified for each SCRIPT element 
instance in a document. [18.2.2]

18.07 (about NOSCRIPT element) User agents that do not support 
client-side scripts must render this element's contents. [18.3.1]

18.08 User agents may still attempt to interpret incorrectly 
specified scripts but are not required to. [18.2.2]


-- 
Karl Dubost / W3C - Conformance Manager
           http://www.w3.org/QA/

      --- Be Strict To Be Cool! ---

Received on Saturday, 22 March 2003 10:56:45 UTC