- From: Kurt Cagle <kurt.cagle@gmail.com>
- Date: Wed, 22 Dec 2010 15:11:17 -0500
- To: Noah Mendelsohn <nrm@arcanedomain.com>
- Cc: John Cowan <cowan@mercury.ccil.org>, David Carlisle <davidc@nag.co.uk>, Henri Sivonen <hsivonen@iki.fi>, public-html-xml@w3.org
- Message-ID: <AANLkTikSFn79Zpre=HBF30J-kAMki1cPDgihwrJjv5f8@mail.gmail.com>
Not on the TC but a thought here about well-formedness vs. validation: The challenge that I see XML5 introducing is that it requires a change not only in validation behavior, but also in what is considered well-formedness, and I would argue that it is the latter issue that needs to be of bigger concern to both HTML and XML groups. At heart is this fundamental conflict: HTML's mandate is to provide markup language that is fault tolerant, based upon at least the one assumption that the authors of such HTML are likely not to be programmers, and as such may introduce could that would break in a stricter environment. XML's mandate is to provide markup language that is fault intolerant, because fault tolerance may end up introducing ambiguous or even erroneous assertions that can prove difficult to resolve, especially when you are dealing with processing thousands or even millions of such documents. Perhaps at least one solution to this particular dilemma is to ask whether such tolerance should reside not within the language itself but within the parser and serializer. Establish a parseLevel of #strict or #lax as a property for the relevant parsers that would interpret the XML content strictly as XML 1.0 when set as #strict, or HTXML when set to #lax. Serialization would similarly follow an XML or HTXML or model. This is a pre-validation step, it only handles the parsing. I think this would resolve a lot of things - because of the conflicting well-formedness mandates, I don't necessarily see any resolution on the XML/HTXML issue any time soon, and I'm increasingly wondering if that's all that good an idea anyway. It means that XML parsers can in fact consume ill-formed (from its perspective) HTML content and not choke, while at the same time working consistently with well-formed XML - this then becomes a case of caveat emptor from the developer's perspective - if you use lax parsing, expect the unexpected. Additionally, as an added benefit it resolves the reams upon reams of bad RSS2 content. It would require reworking the parsers, of course, but I see this in many ways as an easier step than dealing with billions of files of legacy XML and HTML. Kurt Cagle XML Architect *Lockheed / US National Archives ERA Project* On Wed, Dec 22, 2010 at 1:06 PM, Noah Mendelsohn <nrm@arcanedomain.com>wrote: > > > On 12/20/2010 4:25 PM, John Cowan wrote: > >> Noah Mendelsohn scripsit: >> >> > * Being liberal in what you accept has arguably proven useful on the >>> > Web, but we may offer better value in helping users to be conservative >>> > in what they send. FWIW: I find that XML validation of my (X)HTML >>> > sometimes trips on errors I wouldn't need to fix in practice, but >>> > often it catches errors that would cause a browser to skip significant >>> > content when rendering. So, I find XML validation to be valuable; >>> > maybe or maybe not a good HTML5 validator would meet the need instead. >>> > Anyway, I think we need to think about the right mix of XML and HTML >>> > validation, in cases where users wish to ensure that generated or >>> > hand-authored content is correct. >>> >> Validation is important, and I'm not arguing against it. What I don't >> think matters is XML*validity*. There are now many other useful ways >> to validate documents that are not XML-valid. >> > > Good catch. I said XML validation. I mostly meant well-formedness > checking. I didn't mean to suggest one way or the other whether > schema-level validation might also be useful, and if so, using what schema > languages. > > Noah > >
Received on Wednesday, 22 December 2010 20:12:21 UTC