- From: Michael(tm) Smith <mike@w3.org>
- Date: Wed, 18 Nov 2009 13:15:06 +0900
- To: Maciej Stachowiak <mjs@apple.com>
- Cc: public-html@w3.org
Maciej Stachowiak <mjs@apple.com>, 2009-11-17 15:40 -0800: [...] > (1) XML has draconian error handling, while text/html has tolerant (and with > HTM5 fully specified) error handling. > (2) XML supports arbitrary XML-style namespaces in the syntax, text/html > supports only a short list of predefined namespaces. > (3) XML has a fairly strict conforming syntax, while even the conforming > text/html syntax allows many shortcuts (even setting aside the > error-tolerance). > (4) XML parsing is completely independent of the vocabulary, text/html > parsing has many behaviors that are specific to the HTML vocabulary. > (5) XML has only a very small list of predefined entities with optional > addition of more via DTD processing, text/html has a fairly extensive list > of named entities. > > Are there more important high-level differences that I'm forgetting? Not sure if the following is high-level on the same order as the above, but I think it's an important difference that many people are not aware of. That difference is: HTML has a few elements within whose contents particular characters and sequences are handled differently than they are in most other elements. What I mean are the <title>, <textarea>, <script>, and <style> elements. My attempt at trying to write up a concise description of that feature of HTML (one that's also aligned with what's defined in the HTML5 spec but that takes a slightly different approach) is here: http://dev.w3.org/html5/markup/syntax.html#text-syntax In summary: - within <title>, <textarea>, <script>, and <style> elements, a "<" character does not mark the start of a tag -- instead, it's simply text like any other character - within <title>, <textarea>, <script>, and <style> elements, the character sequences "<!--" and "-->" are not comment delimiters -- instead they're simply text strings - within <script>, and <style> elements, a character sequence like "foo=bar&hoge=moge" is just a text string and does not cause a parse error (the "&" character does not mark the start of a character reference) To put it in more general, high-level terms: There are three different classes of character data in HTML, HTML elements fall into three different types based on the particular class of character data they are allowed to contain. --Mike P.S. Before somebody bring its up, yeah, I realize SGML had RCDATA and CDATA elements, and that the behavior around those was similar to what's described the above. (Though I don't think SGML had a way to enable the character sequences "<!--" and "-->" to be handled as text instead of as comments.) And yeah, I know XML has CDATA sections, but that's something different from the case of having particular elements whose contents are processed differently even though those contents are not otherwise marked up in any special way. -- Michael(tm) Smith http://people.w3.org/mike/
Received on Wednesday, 18 November 2009 04:51:26 UTC