Re: [DM] white space from Michael Brundage on 2003-12-08 (public-qt-comments@w3.org from December 2003)

From: Michael Brundage <xquery@comcast.net>
Date: Mon, 08 Dec 2003 11:13:08 -0800
To: Michael Rys <mrys@microsoft.com>, David Carlisle <davidc@nag.co.uk>
Cc: XQuery Public Comments <public-qt-comments@w3.org>
Message-ID: <BBFA0D44.9AF%xquery@comcast.net>
I would add that it's naïve to anthropomorphize companies or committees as
if they had a single opinion on technical subjects, when in reality these
groups are made up of many different individuals with many different
viewpoints and voices.  Committees may sometimes decide to speak with a
unified voice (companies rarely do) but even then they are rarely unified
internally.

To suggest that Microsoft or the W3C or any other large organization has
some kind of detailed unified agenda is just dumb.


As for whitespace handling in particular implementations like MSXML, this is
clearly a topic for another forum (like the microsoft.public.xml newsgroup).
[Every MSXML whitespace issue I know of has to do with the separation
between the XSLT implementation and the XML parser implementation -- to
support XSLT processing requirements, the XML parser exposes whitespace
handling options not found in (or required by) XML itself.]

But understand that over most relational databases whitespace becomes
problematic, because fixed-width columns are padded with spaces and
collations sometimes render whitespace insignificant.

So within the W3C Working Group, when it comes to whitespace handling, you
clearly have at least three different factions with different requirements:
  1. Those who are using an existing XML parser and are limited by their
implementation in what options they can support
  2. Those who are writing a new XML parser/loader and find one particular
implementation choice easier than another
  3. Those who are implementing XQuery over relational databases and are
limited in what they can support

Meanwhile, you've got at least two different camps of XML users, some
(document-centric) who care deeply about whitespace and some (data-centric)
who couldn't care less, if anything whitespace gets in their way.

You've got the XML 1.0 spec which clearly preserves some whitespace, loses
some whitespace (like between attributes), and normalizes some whitespace
(like line endings).  You've got the XML 1.1 spec which does things
differently (including adding new whitespace characters).  You've got other
XML specs (like XSLT) which merge adjacent text nodes, drop empty text
nodes, and allow whitespace nodes in the query syntax to be dropped. You've
got characters like &#160; that are not whitespace in the XML standards, but
are whitespace in most XML applications (such as browsers).

Whitespace handling is just not as cut-and-dried as some people make it out
to be.


Cheers,
michael

On 12/8/03 10:17 AM, "Michael Rys" <mrys@microsoft.com> wrote:
> See below.
> 
> And to the person that claims I am trying to pass Microsoft's position
> as the WG position I can only say "hoghwash".
> 
> While I am certainly representing MS inside the WG, I am trying to
> explain the spec and its motivation on this list (unless I indicate
> otherwise, eg, when I submit our own comments).
> 
> Regards
> Michael
>> -----Original Message-----
>> From: David Carlisle [mailto:davidc@nag.co.uk]
>> Sent: Monday, December 08, 2003 2:26 AM
>> To: Michael Rys
>> Cc: public-qt-comments@w3.org
>> Subject: Re: [DM] white space
>> 
>> 
>>> For the data model: the WG, otherwise the data model spec would be
>>> different.
>> 
>> Not necessarily, some things just slip through by accident, that's the
>> point of a public review isn't it?
>> 
>> Ideally Xquery would adopt some version of xsl:strip-space into its
>> prologue and then the xslt and Xquery commands would be specified as
>> passing a specified flag to the data model building which would cause
>> white space text nodes to be dropped. Note that this only needs to
> apply
>> to building a data model instance by parsing an XML file (the point of
>> the section commented on in this thread) If the data model instance is
>> coming from some other source (eg straight from a database or
> whatever,
>> then its white space behaviour is out of scope for this spec, and I
> have
>> no objection to that.
> 
> [Michael Rys] I don't think adding such a flag to the prolog is a good
> idea. I think you probably would like the fn:doc() function to get an
> additional argument to indicate whitespace handling. In that way, we
> would have the semantics and the flag closer. However even then, you
> have the problem of fn:doc() implementations just referring to a cached
> document that already has either preserved or stripped the whitespace
> when being loaded. What should the flag do in that case?
> 
> I agree with you that the process of generating the data model should
> give the user the choice (and I try to get that into some of our
> products, currently with little success due to schedule issues), but
> given that this really affects a stage that is often outside of the data
> model specification's realm, I think all we can do is call out this
> dependency and let the users demand support for either.
> 
>> The text clearly can not stand as it is.  It is defined in terms of
>> "insignificant white space"
>> but this term is not defined in any spec that I have looked at (DM,
> XML
>> rec, infoset. Although the xml spec says
>> 
>>   On the other hand, "significant" white space that should be
> preserved
>>   in the delivered version is common, for example in poetry and source
>>   code.
>> 
>> This is juust an aside, and not part of any definition that can be
>> referenced.
> 
> [Michael Rys] I agree that we need to base the spec on defined terms.
> 
>> It is not acceptable to leave open the interpretation of this
> definition
>> of the implementor, especially as this thread has shown there are wide
>> differences in interpretation. I for example believe that inter-word
>> spaces in English language sentences are significant, but apparently
>> Michael Rhys does not.
> 
> [Michael Rys] I find inter-word spaces significant (as I do the absence
> of the h in my last name :-)), if there is either explicit indication
> that it should be preserved or it occurs with other words inside the
> same text node. They are not significant if they occur between markup
> tags without anything else.
> 
>> If for some reason the working groups do want to define "insignificant
>> white space" and allow implementations freedom to silently drop such
>> spaces (sacrificing interoperability for some unspecified gain) then
> any
>> definition will break the spirit of the XML recommendation which
> clearly
>> states:
>> 
>>   An XML processor must always pass all characters in a document that
>>   are not markup through to the application. A validating XML
> processor
>>   must also inform the application which of these characters
> constitute
>>   white space appearing in element content.
> 
> [Michael Rys] This goes into the definition of an XML processor. In our
> interpretation the XML processor is the process that generated the
> information set. The data model generation is an application...
> 
>> (Which I believe was chosen as the XML processing model to avoid the
>> problems shown up after many years of sgml experience of problems with
>> parsers trying to decide automatically which spaces to drop.)
>> As Micahel Rhys indicated you may claim that you are following the
> letter
>> of the specification if you claim that the parser is preserving the
>> spaces (but not showing them to anybody or anything) but they are
> being
>> dropped while building the datamodel instance. However this is clearly
>> just a legalistic fudge that does not help the end user, and any
>> browsing of xsl-list will quickly show that failure to achieve
>> interoperability in this area does seriously inconvenience the end
>> user. However if you really want to define this term I believe that
> the
>> only workable definition would be the definition alluded to in the
>> quotation above from the XML rec,
>> 
>>  white space appearing in element content.
>> 
>> ie white space nodes appearing in elements _declared_ (in DTD, or now,
>> schema) to take element (not mixed) content. Allowing processors to
>> siently drop such spaces would still harm interoperability but at
> least
>> it is unlikely to produce results that are simply wrong, such as
> losing
>> inter word spaces in English.
>> 
>> David
>> 
>> 
>> 
>> 
> ________________________________________________________________________
>> This e-mail has been scanned for all viruses by Star Internet. The
>> service is powered by MessageLabs. For more information on a proactive
>> anti-virus service working around the clock, around the globe, visit:
>> http://www.star.net.uk
>> 
> ________________________________________________________________________
>
Received on Monday, 8 December 2003 14:13:15 UTC