Re: [Parsing] When/how flagged as being HTML from Henri Sivonen on 2007-07-10 (public-html@w3.org from July 2007)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Tue, 10 Jul 2007 08:49:46 +0300
To: Karl Dubost <karl@w3.org>
Cc: HTMLWG WG <public-html@w3.org>
Message-Id: <8F000A59-EE67-41A1-BD98-867C00D414C1@iki.fi>
On Jul 10, 2007, at 04:47, Karl Dubost wrote:

> In a message in another thread, Henri said:
> http://www.w3.org/mid/05FFFAC3-F914-451A-B2A7-BBEAC81A2537@iki.fi
>
> Le 9 juil. 2007 à 17:04, Henri Sivonen a écrit :
>> An HTML5 parser is a piece of software that implements the section  
>> of the spec titled "Parsing HTML documents".
>> http://www.w3.org/html/wg/html5/#parsing
>
> Then following links through the spec, it is not obvious where to  
> find the right information.

What do you mean when you say "right information"? The section I  
linked to defines the parsing algorithm for HTML5 (i.e. text/html as  
opposed to XHTML5 / application/xhtml+xml) documents.

> "HTML Document" points to the following definition:
>
>     Document objects are assumed to be XML documents
>     unless they are flagged as being HTML documents when
>     they are created. Whether a document is an HTML
>     document or an XML document affects the behaviour of
>     certain APIs, as well as a few CSS rendering rules.
>     [CSS21]
>     -- http://www.w3.org/html/wg/html5/#html-
>     Thu, 28 Jun 2007 21:11:41 GMT
>
> The first thing which might lead to confusion is the "flagged as  
> being HTML documents".

Note that the part you quoted talks about Document objects--that is,  
objects that implement the Document DOM interface. It is not talking  
about documents in general (e.g. as byte streams).

Is this not clear from the styling of the <code> element and from the  
context?

> I have looked for what is an HTML document, and then I got an HTML  
> document is an XML document except if it is an HTML document.

In this context, it only talks about telling apart two types of  
objects that implement the Document interface. Moreover, it is  
assumed that there are only two kinds of objects that implement the  
Document interface.

> Maybe, we should defined what we mean by flagged … when they are  
> created.

The note following the paragraph you quotes makes it even clearer  
what flagged means:
  | A Document object created by the createDocument() API on the
  | DOMImplementation  object is initially an XML document, but
  | can be made into an HTML document by calling document.open() on it.

The flagging happens in the first sentence under http://www.w3.org/ 
html/wg/html5/#page-load

See also step #8 under http://www.w3.org/html/wg/html5/#controlling

> 1. Created in the DOM?
> 2. Created on the filesystem?
> 3. Created in the Browser memory?
>
> I have the feeling that most people will read 2.

Which isn't what it means because you don't create objects that  
implement the Document interface in the file system.

The spec makes it obvious that it is talking about objects in the  
memory of the browser and that a flag is a bit in the memory.

> But then there is an issue. What do we do with files which are  
> accessed through the local filesystem. Usually ".html", ".htm"  
> means for the browser, use the HTML parser. Though they are many  
> cases where people might open a file with a PHP extension for example.

The draft is optimized for the Web and, hence, specifies things in  
detail for HTTP. If the bytes don't arrive via HTTP, there needs to  
be a mapping of some kind that gives the UA information whether the  
document arrived as if with Content-Type text/html or as if with an  
XML Content-Type. On common systems, it is reasonable to map .html to  
text/html, .xhtml to application/xhtml+xml and .xml to application/ 
xml when reading local files.

> The data can come from the local filesystem as well. There is  
> something which is called the "content model flag" related to the  
> input stream.

The content model flag is internal to the parsing algorithm for text/ 
html. I'm curious: Why you mention it?

> * When an input stream is actually flagged as being HTML?

The kind of flagging you quoted is about objects that implement the  
Document interface--not about streams.

However, an input stream is treated as HTML if it is labeled with the  
text/html Content-Type.

> * How do we flag an input stream as being an HTML document?
>     * HTTP text/html

Yes, but in that case "HTML Document" doesn't mean what it means in  
the part of the spec you quoted.

>     * local filesystem?

System-specific mapping to type information equivalent to HTTP.

> Related question:
> A document sent with application/xhtml+xml must be treated by an  
> XML parser.
> What an HTML parser does when receiving such a document. ignores  
> it? (in the case I have built an application which has only an HTML  
> parser and not an XML Parser.)

It is the responsibility of the application not to pass a byte stream  
that is not labeled as text/html to an HTML parser. If the  
application doesn't have an XML parser (very unlikely), it should say  
it isn't prepared to process XML. (Just like it would do with a PDF  
or Word file if it didn't have components for reading those.)

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Tuesday, 10 July 2007 05:49:58 UTC