[whatwg] Valid Unicode

On Dec 2, 2006, at 18:24, Sam Ruby wrote:

> It would not be wise for HTML5 to limit itself to the more constrained
> character set of XML.  In particular, the form feed character is
> pretty popular,
>
> This is yet another case where "take HTML5, read it into a DOM, and
> serialize it as XML, and voil?: you have valid XHTML" doesn't work.

What I am advocating is making sure that *conforming* HTML5 documents  
can be serialized as XHTML5 without dataloss. This is important in  
order to be able to promise that an "XML tool chain" can be used for  
processing *conforming* HTML5 by sticking an HTML5 parser in front of  
the processing pipeline (for *non-browser* use cases like data  
mining, content management or conformance checking where scripts  
aren't executed nor CSS rendering performed). The motivation is to  
make processing HTML5 in non-browser apps less expensive without  
giving an incentive for the solutions to violate the spec ad hoc on  
their own.

For example, an "XML tool chain" is important enough for my  
conformance checking service that if at this point the assumption of  
*conforming* HTML5 being convertible to XHTML5 was broken in corner  
cases, I'd probably come up with ad hoc trickery for masking it  
instead of throwing away the tool chain. I'd prefer not having to do  
that and not having to explain to everyone else who finds an "XML  
tool chain" to be of value what tricks I needed to pull off to fake it.

I am not suggesting that HTML5 browsers halt and catch fire upon  
finding a form feed. And it is obvious that lossless conversion of  
all possible non-conforming HTML5 documents to XML is impossible  
anyway, so making that a goal would not be worthwhile.

But what legitimate and popular use would a form feed have in HTML5?  
Why can't we call it non-conforming? Are there use cases other than  
converting .txt RFCs to HTML with regexps without bothering to get  
rid of the form feeds?

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/

Received on Saturday, 2 December 2006 15:42:11 UTC