Re: tag name state from Rand McRanderson on 2012-04-05 (public-xml-er@w3.org from April 2012)

From: Rand McRanderson <therandshow@gmail.com>
Date: Wed, 04 Apr 2012 20:39:49 -0400
To: public-xml-er@w3.org
Message-ID: <4F7CE9D5.9060306@gmail.com>

There's a good market for non-browsers wanting an algorithm to parse 
HTML into a XML compatible state (PHP has a function for this in its XML 
DOM extension "loadHTML", although that may come from the underlying 
libxml infrastructure). I guess the use-case can be distilled to if you 
have a workflow/tool-chain that utilizes XML/XML-based technologies and 
you want to be able to pull in arbitrary HTML.

Of course, if you are pulling in arbitrary HTML documents, and you're 
not a browser, then displaying the content as it was in the past is not 
important, so parsing priorities change a little. For example, instead 
of worrying about properly using the noscript parser step, you could 
just blanketly treat no-script as a CData section. On the other hand, 
you may need to want to preserve PI-tags in case someone embedded useful 
information there.

I guess what I am saying is non-browsers (generally) have a priority of 
preserving the information from the document while making it XML 
compatible. Browsers have a priority that the information be presentable 
in a way compatible with how it looked in the past. Those two priorities 
may not clash, but they might, and it would be nice to decide earlier 
rather than later how you want to handle this.

All that being said, I think a simple browser that could handle 75% of 
the web would be easier to implement on top of a forgiving XML parser 
rather than the HTML5 parser algorithm.

On a slightly different note, is there any non-HTML use-case for this? 
Are there large amounts of non-HTML XML-like documents that are badly 
formed? Do you want to stretch the reach of this parser to documents 
that are vaguely XML-like such as Apache configuration files?

Switching to a third note, one possibility with xml-er is instead of 
aiming for a forgiving xml parser, aim for a parser framework that could 
encompass both the HTML-5 parser + XML parser in a way that differs as a 
matter of configuration but not concept. I guess, xml-er could work on 
ways of specifying configuration to a parser of how to handle errors or 
how to not handle them.

Received on Thursday, 5 April 2012 00:40:47 UTC