Some random ideas around (broken) XML

These are random notes about XML from another time and space.
original mail modified from 2008-07-12

the XML specification, says
http://www.w3.org/TR/REC-xml/#sec-terminology

	fatal error

	[Definition: An error which a conforming XML processor
	MUST detect and report to the application. After
	encountering a fatal error, the processor MAY continue
	processing the data to search for further errors and
	MAY report such errors to the application. In order
	to support correction of errors, the processor MAY make
	unprocessed data from the document (with intermingled
	character data and markup) available to the application.
	Once a fatal error is detected, however, the processor
	MUST NOT continue normal processing (i.e., it MUST NOT
	continue to pass character data and information about
	the document's logical structure to the application in
	the normal way).]


Could we interpret this set of rules in this way?

Context: A non well-formed document is sent to an application  
containing an XML processor.

1. The XML processor detects that the document is not well-formed and  
report it to the application.
2. The XML processor continue the processing of data and report data  
and errors to the application.
3. The XML processor delivered a character stream with identified  
broken information to the application
4. The application applies an XML recovery mechanism on the stream  
sent by the XML processor and do what it wants with it such as  
displaying the document if necessary.


Some preliminary observations:

* XML on the Web (HTTP environment) is very very small.
* XML on the desktop, mainframe, back-ends is common.
* XML vocabularies are powerful in a controlled environment
   (ex: Docbook, data transfer in banking, etc.)
* XML used on the Web is often tortured, broken.
* Many Web developers do not understand XML beyond the notion of well- 
formed.


Understanding XML conformance and processing to find strategies for

1. Fixing broken XML on the Web
2. Improving the ecosystem

The Web is a highly distributed environment with loose joints.  
*Socially* it has a lot of consequences. A good example of XML used on  
the Web is Atom. The language has been designed from scratch by strong  
XML advocates as chairs (Tim Bray and Sam Ruby). It was clean without  
broken content at the start. It is used by a very large community of  
people and tools (consumers AND producers). The language has been  
developed in a test driven way. Most of the implementers who matter in  
the area were inside the group implementing and testing at the same  
time it was developed.

# PRODUCING BROKEN XML

The fact is that many atom feeds are broken for many reasons.

* edited by hand
* created by templating tools which are not XML producers
* mixing content from different sources (html, db, xml) with different  
encodings

It means when designing an atom feed consumer, implementers are forced  
to recover the broken content to be able to make it usable by the  
crowd (social impact). Second part of the postel laws "Be liberal in  
what you accept".

Integrity of the data is lost. But the cost/benefit between integrity  
loss/usability is higher on the usability side in the atom case.

Does it show that *authoring rules* are usually poorly defined?
We defined what must be a "conformant document",
then we think, a "conformant producer" is a tool which produces  
"conformant document".
But in the process we forget about authoring usability.

	Example 1:
	With an *XML* authoring tool, I create a document where I type markup  
by hand.
	The tool has an auto-save mode.
	I type "<foo><bar" then auto-save the document is already non well  
formed on the drive.
	It should not be an issue as long as the final document is well-formed.
	Though how do we define "final save"?
	There is an issue. And we have very often to modify document
	or to have temporary non well formed document.
	(not even talking about validity.)


The example 1 was using an XML authoring tool, which is already a big  
step for writing a document. Many XML documents are produced from  
templating languages, sometimes in the code itself, sometimes a file  
with variable substitution. Some of these languages have not been  
designed to be self well-formed (non XML constructs which will be  
substituted.)
These are possible sources of creating broken XML either by the  
template being wrong and/or the variable substitution.

What are the requirements for creating better tools able to output  
good XML content?
Something easy to integrate in a workflow, authoring libraries, etc.



# CONSUMING BROKEN XML

Then there is broken XML on the Web, a lot.
How do we improve the ecosystem? How do we repair?

Being too strict has usually two *social* effects:

* people avoid to use it at all. Go to another language: JSON, HTML,  
etc.
* People find non standard recipes to recover the content: non  
interoperable recovering parsers.

If the recovering mechanism was well defined, it would help:

1. to create more well formed (sometimes valid) XML content.
2. to develop application with strict parsers (some applications would  
be more willing to go XML because less content would be broken.)

The overall effect would make XML easier to use for people (good  
karma) and would create more XML documents on the Web.



# INTEGRITY OF XML DOCUMENTS

A recovered document MIGHT have lost its intended data integrity.
Why not having a mechanism to flag content which has been recovered  
such as:

* an xml attribute on the root element, i.d.  xml:check="recovered" or  
something similar.
* or an xml PI

It warns people and processors that the information may contain poor  
data. It helps to design grass roots quality control mechanisms. The  
information is visible *in* the document, not outside.


-- 
Karl Dubost
Montréal, QC, Canada
http://www.la-grange.net/karl/

Received on Wednesday, 18 November 2009 03:18:13 UTC