Re: XML and HTML differences (Re: XML namespaces on the Web) from Michael(tm) Smith on 2009-11-18 (public-html@w3.org from November 2009)

From: Michael(tm) Smith <mike@w3.org>
Date: Wed, 18 Nov 2009 13:15:06 +0900
To: Maciej Stachowiak <mjs@apple.com>
Cc: public-html@w3.org
Message-ID: <20091118041505.GA18553@sideshowbarker>

Maciej Stachowiak <mjs@apple.com>, 2009-11-17 15:40 -0800:

[...]
>  (1) XML has draconian error handling, while text/html has tolerant (and with 
>  HTM5 fully specified) error handling.
>  (2) XML supports arbitrary XML-style namespaces in the syntax, text/html 
>  supports only a short list of predefined namespaces.
>  (3) XML has a fairly strict conforming syntax, while even the conforming 
>  text/html syntax allows many shortcuts (even setting aside the 
>  error-tolerance).
>  (4) XML parsing is completely independent of the vocabulary, text/html 
>  parsing has many behaviors that are specific to the HTML vocabulary.
>  (5) XML has only a very small list of predefined entities with optional 
>  addition of more via DTD processing, text/html has a fairly extensive list 
>  of named entities.
> 
>  Are there more important high-level differences that I'm forgetting?

Not sure if the following is high-level on the same order as the
above, but I think it's an important difference that many people
are not aware of. That difference is: HTML has a few elements
within whose contents particular characters and sequences are
handled differently than they are in most other elements. What I
mean are the <title>, <textarea>, <script>, and <style> elements.

My attempt at trying to write up a concise description of that
feature of HTML (one that's also aligned with what's defined in the
HTML5 spec but that takes a slightly different approach) is here:

  http://dev.w3.org/html5/markup/syntax.html#text-syntax

In summary:

  - within <title>, <textarea>, <script>, and <style> elements, a
    "<" character does not mark the start of a tag -- instead,
    it's simply text like any other character

  - within <title>, <textarea>, <script>, and <style> elements,
    the character sequences "<!--" and "-->" are not comment
    delimiters -- instead they're simply text strings

  - within <script>, and <style> elements, a character sequence
    like "foo=bar&hoge=moge" is just a text string and does not
    cause a parse error (the "&" character does not mark the start
    of a character reference)

To put it in more general, high-level terms: There are three
different classes of character data in HTML, HTML elements fall
into three different types based on the particular class of
character data they are allowed to contain.

  --Mike

P.S. Before somebody bring its up, yeah, I realize SGML had RCDATA
and CDATA elements, and that the behavior around those was similar
to what's described the above. (Though I don't think SGML had a
way to enable the character sequences "<!--" and "-->" to be handled
as text instead of as comments.) And yeah, I know XML has CDATA
sections, but that's something different from the case of having
particular elements whose contents are processed differently even
though those contents are not otherwise marked up in any special way.

-- 
Michael(tm) Smith
http://people.w3.org/mike/

Received on Wednesday, 18 November 2009 04:51:26 UTC