- From: Karl Dubost <karl@w3.org>
- Date: Tue, 10 Jul 2007 10:47:29 +0900
- To: HTMLWG WG <public-html@w3.org>
In a message in another thread, Henri said: http://www.w3.org/mid/05FFFAC3-F914-451A-B2A7-BBEAC81A2537@iki.fi Le 9 juil. 2007 à 17:04, Henri Sivonen a écrit : > An HTML5 parser is a piece of software that implements the section > of the spec titled "Parsing HTML documents". > http://www.w3.org/html/wg/html5/#parsing Then following links through the spec, it is not obvious where to find the right information. "HTML Document" points to the following definition: Document objects are assumed to be XML documents unless they are flagged as being HTML documents when they are created. Whether a document is an HTML document or an XML document affects the behaviour of certain APIs, as well as a few CSS rendering rules. [CSS21] -- http://www.w3.org/html/wg/html5/#html- Thu, 28 Jun 2007 21:11:41 GMT The first thing which might lead to confusion is the "flagged as being HTML documents". It's a kind of circular. I have looked for what is an HTML document, and then I got an HTML document is an XML document except if it is an HTML document. I see at least 47 occurrences of HTML Document in the document. Maybe, we should defined what we mean by flagged … when they are created. 1. Created in the DOM? 2. Created on the filesystem? 3. Created in the Browser memory? I have the feeling that most people will read 2. But then there is an issue. What do we do with files which are accessed through the local filesystem. Usually ".html", ".htm" means for the browser, use the HTML parser. Though they are many cases where people might open a file with a PHP extension for example. The input to the HTML parsing process consists of a stream of Unicode characters, which is passed through a tokenisation stage (lexical analysis) followed by a tree construction stage (semantic analysis). The output is a Document object. […] In the common case, the data handled by the tokenisation stage comes from the network, but it can also come from script, e.g. using the document.write() API. -- http://www.w3.org/html/wg/html5/#parsing Thu, 28 Jun 2007 21:11:41 GMT The data can come from the local filesystem as well. There is something which is called the "content model flag" related to the input stream. The exact behaviour of certain states depends on a content model flag that is set after certain tokens are emitted. The flag has several states: PCDATA, RCDATA, CDATA, and PLAINTEXT. Initially it must be in the PCDATA state. In the RCDATA and CDATA states, a further escape flag is used to control the behaviour of the tokeniser. It is either true or false, and initially must be set to the false state. -- http://www.w3.org/html/wg/html5/#content2 Thu, 28 Jun 2007 21:11:41 GMT But it isn't related to when the document or the input is flagged as being HTML. So back to the sentence Document objects are assumed to be XML documents unless they are flagged as being HTML documents when they are created. * When an input stream is actually flagged as being HTML? * How do we flag an input stream as being an HTML document? * HTTP text/html * local filesystem? Related question: A document sent with application/xhtml+xml must be treated by an XML parser. What an HTML parser does when receiving such a document. ignores it? (in the case I have built an application which has only an HTML parser and not an XML Parser.) -- Karl Dubost - http://www.w3.org/People/karl/ W3C Conformance Manager, QA Activity Lead QA Weblog - http://www.w3.org/QA/ *** Be Strict To Be Cool ***
Received on Tuesday, 10 July 2007 01:47:51 UTC