markup spec [was: Re: Should we Publish a Language Specification?] from Jim Jewett on 2008-11-20 (public-html@w3.org from November 2008)

From: Jim Jewett <jimjjewett@gmail.com>
Date: Thu, 20 Nov 2008 12:09:21 -0500
To: "HTML WG" <public-html@w3.org>
Message-ID: <fb6fbf560811200909h5469a5ccoa7fb54bebeb154b7@mail.gmail.com>
(Sorry if you saw an earlier version already; I accidentally sent it
to the wrong list)

Julian wrote:

> I agree that the language HTML5 should have a singular normative
> definition. I'd prefer it not to be the same document that describes all
> the rest.

I'll go farther and say that http://www.w3.org/html/wg/markup-spec/ is
such a good start that I'm ready to start commenting on it.  Many of
these comments would apply to the original spec as well, but I kept
getting sort of lost there, because of the size.

And that size (plus the more general audience for HTML) is the reason
it is so important to separate out the various parts -- more important
than for some other specifications.


Section 2, Terminology:

Should "case-insensitive" be "ASCII-case-insensitive", which is the
term used in section 3.6. attributes?

Should "space characters" be called "spacing characters", to
distinguish them from the specific character named SPACE?  Should it
be called out explicitly that these are only a subset of the unicode
characters having the White_Space property?


Section 3, Syntax:

"an XML parser" should probably be in a dfn tag, if "an HTML parser"
is, unless you are intentionally delegating that definition ... but
then
be explicit.  The HTML parser definition should probably also be delegated to
the parsing (or at least processing/error-correction) document.


General, but first noticed in Section 3.1:

Should "MUST", "MUST NOT", "SHOULD", etc be capitalized, as in other
recent specs?


Section 3.4, Character Encoding

This should not always assume HTTP, so

       "... and if its encoding is not explicitly given by
Content-Type metadata,"
=>
       "... and if its encoding is not explicitly known from external
information, such as the HTTP Content-Type header,"


I couldn't quite make sense of the "or ... " clause for the meta
element.  My suggestion is

       then the encoding must be specified using a meta element with a
charset attribute or a meta element in the Encoding declaration state.
=>
       then the encoding must be specified using a meta element with a
charset attribute.


Section 3.5, Elements

       Attributes may be separated from each other
=>
       Attributes MUST be separated from each other


Section 3.5, Rule 6 implies that elements which *could* have content
cannot be self-closing.  Therefore, <div /> is illegal.  That is OK
with me, but it is worth being explicit, because this is arguably a change.


Section 3.6, Attributes

Is the Attribute Names rule correct?  It seems to imply that each of
the following single-character is a legitimate attribute name:  ";"
"("  "<" "\"

If so, should there at least be a SHOULD on using XML-compatible names?



Section 3.7, Text.

Why the extra work to ensure that <!--> is a valid escaping text span?
 (Similar question on comments.)  I understand that it is an edge case
which the parser needs to handle, but is there a reason to have such
an empy text span be valid?


Section 3.8, character references.

Why are non-ambiguous ampersands allowed?  Are they useful enough to
justify the extra complexity?  (Maybe... but I'm not sure.  To me, the
fact that &< is OK just makes the rules seem arbitrary.)


Section 4, the HTML elements

The assertions sections are very useful.

It is probably worth adding a classification subsection to each
element.  For example,
    a is interactive, and can be either phrasing or block.
    b is inline, is not interactive, and is a formatting element.
There should probably then be an informative paragraph to explain that
these element classifications are used by other standards, such as
extra error recovery for formatting elements in the processing
standard.


Element a:

I think a.elem.phrase is a strict subset of a.elem.prose, so it might
be worth adding a short note explaining the difference.  (Even if that
does violate the separation of concerns... but I think it doesn't.  I
think the difference is that using prose makes the tag itself a
block-level element instead of a phrase-level element.)


Element abbr:

There is a stray ` character after the name Philip in the example --
this seems to be copied straight from a similar typo in the full spec.


Element acronym:

Should this just be dropped from the valid markup spec, and included
only in the parsing-and-error-correction spec?  At the very least, it
should say "Use the abbr element instead."


Element address:

Should there be an invalid example that is still an address, but just
not a contact address, such as

My Dad lives at <address>123 Memory Lane</address>?


Element area:

needs some cleanup from the conversion, about the various state that
coords would represent.


Element canvas:

I think most of this could be left in the processing document.  Just
list the two attributes, and their default values.  Maybe specify that
the coordinate space is in abstract units, which may not correspond to
pixels or pica or ex.  Say it is typically used with scripting, but
maybe specify the default/initial appearance when no script is run.


Element col:

"If a col element has a parent ..."

What does it represent otherwise?  (The current spec doesn't say either.)

Maybe for the valid markup, just reword it to show proper usage.

"A col element represents one (or more) columns within its parent
colgroup element."


Element colgroup:

Similar issue to col.  Just drop the ", if it has a parent and that is
a table element."


(And no, I didn't finish reviewing all elements yet...)


-jJ
Received on Thursday, 20 November 2008 17:09:57 UTC