- From: Karl Dubost <karl@w3.org>
- Date: Tue, 10 Jul 2007 10:47:29 +0900
- To: HTMLWG WG <public-html@w3.org>
In a message in another thread, Henri said:
http://www.w3.org/mid/05FFFAC3-F914-451A-B2A7-BBEAC81A2537@iki.fi
Le 9 juil. 2007 à 17:04, Henri Sivonen a écrit :
> An HTML5 parser is a piece of software that implements the section
> of the spec titled "Parsing HTML documents".
> http://www.w3.org/html/wg/html5/#parsing
Then following links through the spec, it is not obvious where to
find the right information.
"HTML Document" points to the following definition:
Document objects are assumed to be XML documents
unless they are flagged as being HTML documents when
they are created. Whether a document is an HTML
document or an XML document affects the behaviour of
certain APIs, as well as a few CSS rendering rules.
[CSS21]
-- http://www.w3.org/html/wg/html5/#html-
Thu, 28 Jun 2007 21:11:41 GMT
The first thing which might lead to confusion is the "flagged as
being HTML documents". It's a kind of circular. I have looked for
what is an HTML document, and then I got an HTML document is an XML
document except if it is an HTML document. I see at least 47
occurrences of HTML Document in the document.
Maybe, we should defined what we mean by flagged … when they are
created.
1. Created in the DOM?
2. Created on the filesystem?
3. Created in the Browser memory?
I have the feeling that most people will read 2.
But then there is an issue. What do we do with files which are
accessed through the local filesystem. Usually ".html", ".htm" means
for the browser, use the HTML parser. Though they are many cases
where people might open a file with a PHP extension for example.
The input to the HTML parsing process consists of
a stream of Unicode characters, which is passed
through a tokenisation stage (lexical analysis)
followed by a tree construction stage (semantic
analysis). The output is a Document object.
[…]
In the common case, the data handled by the
tokenisation stage comes from the network, but it
can also come from script, e.g. using the
document.write() API.
-- http://www.w3.org/html/wg/html5/#parsing
Thu, 28 Jun 2007 21:11:41 GMT
The data can come from the local filesystem as well. There is
something which is called the "content model flag" related to the
input stream.
The exact behaviour of certain states depends on
a content model flag that is set after certain
tokens are emitted. The flag has several states:
PCDATA, RCDATA, CDATA, and PLAINTEXT. Initially
it must be in the PCDATA state. In the RCDATA and
CDATA states, a further escape flag is used to
control the behaviour of the tokeniser. It is
either true or false, and initially must be set
to the false state.
-- http://www.w3.org/html/wg/html5/#content2
Thu, 28 Jun 2007 21:11:41 GMT
But it isn't related to when the document or the input is flagged as
being HTML. So back to the sentence
Document objects are assumed to be XML documents
unless they are flagged as being HTML documents when
they are created.
* When an input stream is actually flagged as being HTML?
* How do we flag an input stream as being an HTML document?
* HTTP text/html
* local filesystem?
Related question:
A document sent with application/xhtml+xml must be treated by an XML
parser.
What an HTML parser does when receiving such a document. ignores it?
(in the case I have built an application which has only an HTML
parser and not an XML Parser.)
--
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager, QA Activity Lead
QA Weblog - http://www.w3.org/QA/
*** Be Strict To Be Cool ***
Received on Tuesday, 10 July 2007 01:47:51 UTC