[Parsing] When/how flagged as being HTML from Karl Dubost on 2007-07-10 (public-html@w3.org from July 2007)

From: Karl Dubost <karl@w3.org>
Date: Tue, 10 Jul 2007 10:47:29 +0900
To: HTMLWG WG <public-html@w3.org>
Message-Id: <1C0C09B8-90F8-470F-BFCE-9E657737002A@w3.org>
In a message in another thread, Henri said:
http://www.w3.org/mid/05FFFAC3-F914-451A-B2A7-BBEAC81A2537@iki.fi

Le 9 juil. 2007 à 17:04, Henri Sivonen a écrit :
> An HTML5 parser is a piece of software that implements the section  
> of the spec titled "Parsing HTML documents".
> http://www.w3.org/html/wg/html5/#parsing

Then following links through the spec, it is not obvious where to  
find the right information.

"HTML Document" points to the following definition:

     Document objects are assumed to be XML documents
     unless they are flagged as being HTML documents when
     they are created. Whether a document is an HTML
     document or an XML document affects the behaviour of
     certain APIs, as well as a few CSS rendering rules.
     [CSS21]
     -- http://www.w3.org/html/wg/html5/#html-
     Thu, 28 Jun 2007 21:11:41 GMT

The first thing which might lead to confusion is the "flagged as  
being HTML documents". It's a kind of circular. I have looked for  
what is an HTML document, and then I got an HTML document is an XML  
document except if it is an HTML document. I see at least 47  
occurrences of HTML Document in the document.

Maybe, we should defined what we mean by flagged … when they are  
created.

1. Created in the DOM?
2. Created on the filesystem?
3. Created in the Browser memory?

I have the feeling that most people will read 2.


But then there is an issue. What do we do with files which are  
accessed through the local filesystem. Usually ".html", ".htm" means  
for the browser, use the HTML parser. Though they are many cases  
where people might open a file with a PHP extension for example.

     The input to the HTML parsing process consists of
     a stream of Unicode characters, which is passed
     through a tokenisation stage (lexical analysis)
     followed by a tree construction stage (semantic
     analysis). The output is a Document object.
     […]
     In the common case, the data handled by the
     tokenisation stage comes from the network, but it
     can also come from script, e.g. using the
     document.write() API.
     -- http://www.w3.org/html/wg/html5/#parsing
     Thu, 28 Jun 2007 21:11:41 GMT

The data can come from the local filesystem as well. There is  
something which is called the "content model flag" related to the  
input stream.

     The exact behaviour of certain states depends on
     a content model flag that is set after certain
     tokens are emitted. The flag has several states:
     PCDATA, RCDATA, CDATA, and PLAINTEXT. Initially
     it must be in the PCDATA state. In the RCDATA and
     CDATA states, a further escape flag is used to
     control the behaviour of the tokeniser. It is
     either true or false, and initially must be set
     to the false state.
     -- http://www.w3.org/html/wg/html5/#content2
     Thu, 28 Jun 2007 21:11:41 GMT

But it isn't related to when the document or the input is flagged as  
being HTML. So back to the sentence

     Document objects are assumed to be XML documents
     unless they are flagged as being HTML documents when
     they are created.

* When an input stream is actually flagged as being HTML?
* How do we flag an input stream as being an HTML document?
     * HTTP text/html
     * local filesystem?

Related question:
A document sent with application/xhtml+xml must be treated by an XML  
parser.
What an HTML parser does when receiving such a document. ignores it?  
(in the case I have built an application which has only an HTML  
parser and not an XML Parser.)



-- 
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager, QA Activity Lead
   QA Weblog - http://www.w3.org/QA/
      *** Be Strict To Be Cool ***
Received on Tuesday, 10 July 2007 01:47:51 UTC