- From: Karl Dubost <karl+w3c@la-grange.net>
- Date: Tue, 17 Nov 2009 22:18:10 -0500
- To: www-archive <www-archive@w3.org>
These are random notes about XML from another time and space. original mail modified from 2008-07-12 the XML specification, says http://www.w3.org/TR/REC-xml/#sec-terminology fatal error [Definition: An error which a conforming XML processor MUST detect and report to the application. After encountering a fatal error, the processor MAY continue processing the data to search for further errors and MAY report such errors to the application. In order to support correction of errors, the processor MAY make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor MUST NOT continue normal processing (i.e., it MUST NOT continue to pass character data and information about the document's logical structure to the application in the normal way).] Could we interpret this set of rules in this way? Context: A non well-formed document is sent to an application containing an XML processor. 1. The XML processor detects that the document is not well-formed and report it to the application. 2. The XML processor continue the processing of data and report data and errors to the application. 3. The XML processor delivered a character stream with identified broken information to the application 4. The application applies an XML recovery mechanism on the stream sent by the XML processor and do what it wants with it such as displaying the document if necessary. Some preliminary observations: * XML on the Web (HTTP environment) is very very small. * XML on the desktop, mainframe, back-ends is common. * XML vocabularies are powerful in a controlled environment (ex: Docbook, data transfer in banking, etc.) * XML used on the Web is often tortured, broken. * Many Web developers do not understand XML beyond the notion of well- formed. Understanding XML conformance and processing to find strategies for 1. Fixing broken XML on the Web 2. Improving the ecosystem The Web is a highly distributed environment with loose joints. *Socially* it has a lot of consequences. A good example of XML used on the Web is Atom. The language has been designed from scratch by strong XML advocates as chairs (Tim Bray and Sam Ruby). It was clean without broken content at the start. It is used by a very large community of people and tools (consumers AND producers). The language has been developed in a test driven way. Most of the implementers who matter in the area were inside the group implementing and testing at the same time it was developed. # PRODUCING BROKEN XML The fact is that many atom feeds are broken for many reasons. * edited by hand * created by templating tools which are not XML producers * mixing content from different sources (html, db, xml) with different encodings It means when designing an atom feed consumer, implementers are forced to recover the broken content to be able to make it usable by the crowd (social impact). Second part of the postel laws "Be liberal in what you accept". Integrity of the data is lost. But the cost/benefit between integrity loss/usability is higher on the usability side in the atom case. Does it show that *authoring rules* are usually poorly defined? We defined what must be a "conformant document", then we think, a "conformant producer" is a tool which produces "conformant document". But in the process we forget about authoring usability. Example 1: With an *XML* authoring tool, I create a document where I type markup by hand. The tool has an auto-save mode. I type "<foo><bar" then auto-save the document is already non well formed on the drive. It should not be an issue as long as the final document is well-formed. Though how do we define "final save"? There is an issue. And we have very often to modify document or to have temporary non well formed document. (not even talking about validity.) The example 1 was using an XML authoring tool, which is already a big step for writing a document. Many XML documents are produced from templating languages, sometimes in the code itself, sometimes a file with variable substitution. Some of these languages have not been designed to be self well-formed (non XML constructs which will be substituted.) These are possible sources of creating broken XML either by the template being wrong and/or the variable substitution. What are the requirements for creating better tools able to output good XML content? Something easy to integrate in a workflow, authoring libraries, etc. # CONSUMING BROKEN XML Then there is broken XML on the Web, a lot. How do we improve the ecosystem? How do we repair? Being too strict has usually two *social* effects: * people avoid to use it at all. Go to another language: JSON, HTML, etc. * People find non standard recipes to recover the content: non interoperable recovering parsers. If the recovering mechanism was well defined, it would help: 1. to create more well formed (sometimes valid) XML content. 2. to develop application with strict parsers (some applications would be more willing to go XML because less content would be broken.) The overall effect would make XML easier to use for people (good karma) and would create more XML documents on the Web. # INTEGRITY OF XML DOCUMENTS A recovered document MIGHT have lost its intended data integrity. Why not having a mechanism to flag content which has been recovered such as: * an xml attribute on the root element, i.d. xml:check="recovered" or something similar. * or an xml PI It warns people and processors that the information may contain poor data. It helps to design grass roots quality control mechanisms. The information is visible *in* the document, not outside. -- Karl Dubost Montréal, QC, Canada http://www.la-grange.net/karl/
Received on Wednesday, 18 November 2009 03:18:13 UTC