Re: Whitespace from Bert Bos on 1997-05-11 (w3c-sgml-wg@w3.org from May 1997)

From: Bert Bos <bbos@mygale.inria.fr>
Date: Sun, 11 May 1997 22:11:45 +0200 (MET DST)
To: Peter@ursus.demon.co.uk
Cc: w3c-sgml-wg@w3.org
Message-Id: <199705112011.WAA11268@mygale.inria.fr>
Peter Murray-Rust writes:
 > In message <199705111729.TAA10393@mygale.inria.fr> Bert Bos writes:
 > [...]
 > >  > 
 > >  >  1. White space in element content
 > > 
 > > That is easy to fix by selecting a single whitespace handling method
 > > in the XML profile for SGML. `Keep-all-whitespace' is ugly, but
 >          ^^^^^^^^^^^
 > Please excuse my ignorance :-), but what is this and where does it get
 > implemented? 

It is the collection of additional constraints on the syntax of a
class of documents, beyond what can be expressed in the DTD. E.g., the
profile for HTML (all versions) includes constraints such as: PI's are
not allowed, marked sections are not allowed, document subsets are not
allowed, a document must be in a single entity, etc. For XML there is
a similar set of constraints.

I understand that there are people working on a formal,
machine-readable syntax for those profiles. But even without such a
formal syntax, you can create a profile that is written in English.

 > 
 > > workable; a better rule is be to simply ignore any newline directly
 > > after a '>' or directly before a '<'. The important thing is that this
 > > rule becomes part of the XML profile, and does not depend on the XML
 > > document itself.
 > 
 > Is it intended that the profile is uniques and unchanging for all XML 
 > documents?  If not, where does it get altered>

A profile should define a class of documents. I think XML is such a
class of documents. Individual documents implicitly refer to that
profile, when they declare themselves to be XML.

 > 
 > This then means that the content depends on the combination of the
 > document and the profile. 

Yes, it depends on the document and the XML specification, and the XML
specification *is* the profile.

 > 
 > > 
 > >  >  2. Default attributes
 > > 
 > > The previous XML-lang draft had a handy macro <?xml default...?> that
 > 
 > I liked this as well, and after its disappearance have vowed not to us
 > deafults in my own DTDs :-).

That's one solution. :-)

Although it is explicitly stated that avoiding verbose markup is *not*
a goal of XML, you can drive things too far. A simple macro like the
one above is very easy to implement and will make writing XML by hand
much more enjoyable.

 > > 
 > >  >  3. Attribute values that are space/case normalized only if you
 > >  >     read the DTD and know they are NMTOKEN or ID or something.
 > > 
 > > This is another thing that will have to be added to the XML profile
 > > for SGML: all attributes are always treated as CDATA and never
 > > normalized. NMTOKEN, NUMBER, etc. can still be used for validation,
 > > but do not influence the parsing. I.e., in the XML datamodel the
 > > attributes foo="7" and foo="07" are different, even though some
 > > application may treat them the same.
 > 
 > I would be grateful (perhaps on xml-dev) for some explanation of NMTOKEN and
 > why it is useful.

In SGML, NMTOKEN/NMTOKENS usually indicated that the attribute value
was not case-sensitive, as opposed to CDATA.

In most cases, the type-checking offered by SGML was much too limited
anyway (no booleans, no negative numbers, no dates and times, no URLs,
etc.), so declaring everything as CDATA and letting the application do
the checking was usually a better option.

 > 
 > An point here is that most *generic* applications do not need to know
 > what attribute type is used.  Obviously ID matters, because it's used
 > in TEIXptrs, and that isn't a parser matter.  Are there any other 
 > attribute types that applications need to know about?  Or can they assume
 > that any CDATA produced from the parser is typeless?  I can see that some
 > applications *might* be concerned as to whether something was a string or a 
 > number, but it's not easy to see how a generic application would react to
 > this.

I adopted the rule that an attribute *called* ID *is* an ID. That is
consistent with the XML-link spec, which also uses fixed names. It is
easy to understand and easy to implement. Only drawback: somebody
might want to use the name ID for something else ("Internet Draft"?);
well, tough luck.


Bert
-- 
  Bert Bos                                ( W 3 C ) http://www.w3.org/
  http://www.w3.org/pub/WWW/People/Bos/                      INRIA/W3C
  bert@w3.org                             2004 Rt des Lucioles / BP 93
  +33 4 93 65 77 71               06902 Sophia Antipolis Cedex, France
Received on Sunday, 11 May 1997 16:12:24 UTC