Parsing and extracting information (valid or not)

Following the discussion about validity or not as an aspect of 
accessibility, I would like to share my thoughts about this topic:

Parsing documents

* In XML the concept of well-formedness allows for parsing a document
   that uses an unknown vocabulary (elements, attributes).
* In non-XML SGML there is no such concept. Parsers should ignore
   elements they do not know and try to parse the document somehow. If a
   non-XML SGML document is valid, it can easily be parsed. An invalid
   document may not be parsed the author-intended way. The outcome of
   parsing an invalid document is undefined.

So for not getting confused with sloppy markup nesting etc. an XML 
document has to be at least well-formed, and a non-XML SGML document 
should be valid.


Extracting information

When an application wants to extract information from a markup document 
(XML or not-XML) and present it to the user, the used vocabulary must be 
known. This requires the document to be valid - not only to some 
homebrewn, but to a published and accepted grammar. This grammar is the 
interface between the information provider and the information extractor.

-- 
Johannes Koch - Competence Center BIKA
Fraunhofer Institute for Applied Information Technology (FIT.LIFE)
Schloss Birlinghoven, D-53757 Sankt Augustin, Germany
Phone: +49-2241-142628

Received on Thursday, 23 June 2005 15:28:17 UTC