Re: tidy, indentation and XML/SGML rules from Richard A. O'Keefe on 2002-10-13 (html-tidy@w3.org from October to December 2002)

From: Richard A. O'Keefe <ok@cs.otago.ac.nz>
Date: Mon, 14 Oct 2002 11:23:04 +1300 (NZDT)
To: Fred.Bone@dial.pipex.com, html-tidy@w3.org
Message-Id: <200210132223.g9DMN4ug114947@atlas.otago.ac.nz>

"Fred Bone" <Fred.Bone@dial.pipex.com> wrote (about element breaking):
	I'm no expert on XML, but as far as I can tell this would only be a 
	change of content if you have elem1 defined with 
	xml:space='preserve', and then only the blanks would be significant 
	(and not the newline).

No, the newline would be significant too.

	The HTML4.1 spec says (appendix B.3.1):
	> SGML (see [ISO8879], section 7.6.1) specifies that a line break
	> immediately following a start tag must be ignored
	and this should, AIUI, also be true in XML (though I can't find 
	anything in the XML spec corresponding to what I've quoted from the 
	HTML one).
	
That's because XML is different here.

Basically, SGML is set up on the assumption that you will *ALWAYS*
have the document type definition available, whereas XML is set up on
the assumption that you will _seldom_ have a DTD or look at it if you
have one.

In fact, the XML specification is confused and confusing.  It's a textbook
example of a Bad Specification.  (I'm planning to use it in a Formal
Methods paper as a horrible example of what happens when you don't use
formal methods.)  Amongst other things, the semantics of spaces depends
on whether you have a DTD (and your parser can handle it).

Suppose you have
    <!ELEMENT elem1 (elem2+)>
    <!ELEMENT elem2 (#PCDATA)>
and
    <elem1><elem2>foo</elem2><elem2>bar</elem2></elem1>
Then an XML parser is going to come up with the events
    START "elem1"
    START "elem2"
    DATA "foo"
    END "elem2"
    START "elem2"
    DATA "bar"
    END "elem2"
    END "elem1".
Now suppose you have
    <elem1>
      <elem2>foo</elem2>
      <elem2>bar</elem2>
    </elem1>
IF your parser knows about the dtd AND bothers to process it, you get
    START "elem1"
    SPACE "\n  "
    START "elem2"
    DATA "foo"
    END "elem2"
    SPACE "\n  "
    START "elem2"
    DATA "bar"
    END "elem2"
    SPACE "\n"
    END "elem1".
where a SPACE event says "here is a string of white space characters
occurring where element content is required; feel free to ignore it."
But if there isn't a dtd or if your parser can't handle it or doesn't
feel like handling it because it's Tuesday, you'll get
    START "elem1"
    DATA "\n  "
    START "elem2"
    DATA "foo"
    END "elem2"
    DATA "\n  "
    START "elem2"
    DATA "bar"
    END "elem2"
    DATA "\n"
    END "elem1".
Various XML specifications call for various ways around this, involving
more or less successful heuristics for deleting white space DATA.

I have often wondered why XML textbooks are so fond of indenting XML
examples when it is such a bad thing for the semantics.

Received on Sunday, 13 October 2002 19:00:37 UTC