XML 1.0 - reading confusion - parsed vs unparsed from Kent M Pitman on 1998-04-24 (xml-editor@w3.org from April to June 1998)

From: Kent M Pitman <kmp@harlequin.com>
Date: Fri, 24 Apr 98 11:43:23 EDT
To: xml-editor@w3.org
Cc: kmp@harlequin.com
Message-Id: <9804241543.AA04367@excel.harlequin.com>
The introductory text in section 4, Physical Structures, is very
confusing.  It uses a meaning for "parsed" which is alien to any
meaning of "parsed" that I am familiar with.

If I understand at all, after many readings, the word "parsed" could
usefully be replaced by the word "XML" (or "XML entity" or "XML document"),
and "unparsed" by "non-XML" (or "non-XML entity" or "non-XML document").

As nearly as I can tell from your use of "parsed",

 (a) it has nothing to do with the issue of whether the text has 
     been changed from XML source characters to a structural
     representation of XML [the thing I normally associate with parsing].

and

 (b) it is both insulting to implementors of other systems, not to mention
     wholly confusing, to suggest that [for example] a database is not
     parsed.  The whole point of a database is that it IS parsed--it is NOT
     source representation [unparsed], but a highly structured 
     representation.

- - - - -

Here are some examples of confusions I had while reading this text, to help
you understand why the chosen text is not good:

(1) I was imagining that '<!ENTITY FOO "BAR">' was unparsed if
    represented as the string [character vector]:

       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |<|!|E|N|T|I|T|Y| |F|O|O| |"|B|A|R|"|>|
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

    and that it was parsed if it was represented as some structured object:

       +-------+----------------+
       | Class | XML Markup     |
       +-------+----------------+
       | Kind  | General Entity |
       +-------+----------------+      +-+-+-+
       | NAME  | +-------------------> |F|O|O|
       +-------+----------------+      +-+-+-+      +-+-+-+
       | VAL   | +--------------------------------->|B|A|R|
       +-------+----------------+                   +-+-+-+

(2) Then I worried that maybe the "parsed" part was "BAR".  That maybe
    instead of substituting the text vector "BAR", I was supposed to have
    pre-parsed that. For example, if I'd seen

           <DEFINE % ZAP '<!ENTITY FOO "BAR">'>

    that I wasn't supposed to substitute 

       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |<|!|E|N|T|I|T|Y| |F|O|O| |"|B|A|R|"|>|
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

    for %ZAP; where it occurs but I was instead supposed to substitute

       +-------+----------------+
       | Class | XML Markup     |
       +-------+----------------+
       | Kind  | General Entity |
       +-------+----------------+      +-+-+-+
       | NAME  | +-------------------> |F|O|O|
       +-------+----------------+      +-+-+-+      +-+-+-+
       | VAL   | +--------------------------------->|B|A|R|
       +-------+----------------+                   +-+-+-+

    But that didn't make sense because some objects can't be parsed without
    knowledge of their context and parameter entity definitions contain no
    notion of the content of their expansion.

(3) For a while, I also worried that "PEReference" meant "Parsed Entity 
    Reference" until I (fortunately) found mention of a "Parameter Entity
    Reference".  I *really* do not like cute little two-letter unintelligible
    abbreviations, like PE, and would prefer definition [69] (and its callers)
    refer to ParamEntityReference, not PEReference.   ("cp" is another 
    two-letter abbrev that annoyed me; my memory of SGML says it should be
    "content particle" but I use other systems where it means other things
    like "command processor" and using a short name encourages that confusion).

- - - - -

Here is what I *think* the section in 4. Physical Structures is trying to say:

[By the way, I find the remark in the first paragraph about how the 
 external dtd subset is not identified by name to be confusing.  If 
 it's external and it has no name, how can it not be identified by name??]

==============================================================================
 4. Physical Structures

 ...
 Entitites may be either XML documents themselves, or documents of
 other kinds not intended to be parsed by XML.  An XML document's
 contents are referred to as the `replacement text' for the `entity
 name' that names the XML document.

 A non-XML entity is a resource whose contents are either not text or,
 if text, are not to be interpreted as XML.  Each non-XML entity has
 an associated notation, identified by name.  Beyond a requirement
 that an XML processor make the identifiers for the entity and
 notation available to the application, XML places not constraints on
 the contents of non-XML entities.

 XML entities are invoked by name using entity references; non-XML
 entities are invoked by name, given the value of ENTITY or ENTITIES
 attributes.
 ...

==============================================================================

By the way, I think the ", see below," in paragraph 1 of Physical Structures
to be visually confusing and not helpful.  Also, immediately following, I
don't understand why an "external DTD subset" is not referred to by name.  How
can anything external ever be addressed if not by name?  I tried to find a
definition of "external DTD subset" which answered this question usefully, but
found nothing really helpful.
Received on Friday, 24 April 1998 11:40:07 UTC