RE: [DM] white space from Michael Rys on 2003-12-08 (public-qt-comments@w3.org from December 2003)

From: Michael Rys <mrys@microsoft.com>
Date: Mon, 8 Dec 2003 10:17:33 -0800
To: "David Carlisle" <davidc@nag.co.uk>
Cc: <public-qt-comments@w3.org>
Message-ID: <EB0A327048144442AFB15FCE18DC96C70179E6BF@RED-MSG-31.redmond.corp.microsoft.com>
See below.

And to the person that claims I am trying to pass Microsoft's position
as the WG position I can only say "hoghwash".

While I am certainly representing MS inside the WG, I am trying to
explain the spec and its motivation on this list (unless I indicate
otherwise, eg, when I submit our own comments).

Regards
Michael
> -----Original Message-----
> From: David Carlisle [mailto:davidc@nag.co.uk]
> Sent: Monday, December 08, 2003 2:26 AM
> To: Michael Rys
> Cc: public-qt-comments@w3.org
> Subject: Re: [DM] white space
> 
> 
> > For the data model: the WG, otherwise the data model spec would be
> > different.
> 
> Not necessarily, some things just slip through by accident, that's the
> point of a public review isn't it?
> 
> Ideally Xquery would adopt some version of xsl:strip-space into its
> prologue and then the xslt and Xquery commands would be specified as
> passing a specified flag to the data model building which would cause
> white space text nodes to be dropped. Note that this only needs to
apply
> to building a data model instance by parsing an XML file (the point of
> the section commented on in this thread) If the data model instance is
> coming from some other source (eg straight from a database or
whatever,
> then its white space behaviour is out of scope for this spec, and I
have
> no objection to that.

[Michael Rys] I don't think adding such a flag to the prolog is a good
idea. I think you probably would like the fn:doc() function to get an
additional argument to indicate whitespace handling. In that way, we
would have the semantics and the flag closer. However even then, you
have the problem of fn:doc() implementations just referring to a cached
document that already has either preserved or stripped the whitespace
when being loaded. What should the flag do in that case?

I agree with you that the process of generating the data model should
give the user the choice (and I try to get that into some of our
products, currently with little success due to schedule issues), but
given that this really affects a stage that is often outside of the data
model specification's realm, I think all we can do is call out this
dependency and let the users demand support for either.

> The text clearly can not stand as it is.  It is defined in terms of
> "insignificant white space"
> but this term is not defined in any spec that I have looked at (DM,
XML
> rec, infoset. Although the xml spec says
> 
>   On the other hand, "significant" white space that should be
preserved
>   in the delivered version is common, for example in poetry and source
>   code.
> 
> This is juust an aside, and not part of any definition that can be
> referenced.

[Michael Rys] I agree that we need to base the spec on defined terms.

> It is not acceptable to leave open the interpretation of this
definition
> of the implementor, especially as this thread has shown there are wide
> differences in interpretation. I for example believe that inter-word
> spaces in English language sentences are significant, but apparently
> Michael Rhys does not.

[Michael Rys] I find inter-word spaces significant (as I do the absence
of the h in my last name :-)), if there is either explicit indication
that it should be preserved or it occurs with other words inside the
same text node. They are not significant if they occur between markup
tags without anything else.
 
> If for some reason the working groups do want to define "insignificant
> white space" and allow implementations freedom to silently drop such
> spaces (sacrificing interoperability for some unspecified gain) then
any
> definition will break the spirit of the XML recommendation which
clearly
> states:
> 
>   An XML processor must always pass all characters in a document that
>   are not markup through to the application. A validating XML
processor
>   must also inform the application which of these characters
constitute
>   white space appearing in element content.

[Michael Rys] This goes into the definition of an XML processor. In our
interpretation the XML processor is the process that generated the
information set. The data model generation is an application...
 
> (Which I believe was chosen as the XML processing model to avoid the
> problems shown up after many years of sgml experience of problems with
> parsers trying to decide automatically which spaces to drop.)
> As Micahel Rhys indicated you may claim that you are following the
letter
> of the specification if you claim that the parser is preserving the
> spaces (but not showing them to anybody or anything) but they are
being
> dropped while building the datamodel instance. However this is clearly
> just a legalistic fudge that does not help the end user, and any
> browsing of xsl-list will quickly show that failure to achieve
> interoperability in this area does seriously inconvenience the end
> user. However if you really want to define this term I believe that
the
> only workable definition would be the definition alluded to in the
> quotation above from the XML rec,
> 
>  white space appearing in element content.
> 
> ie white space nodes appearing in elements _declared_ (in DTD, or now,
> schema) to take element (not mixed) content. Allowing processors to
> siently drop such spaces would still harm interoperability but at
least
> it is unlikely to produce results that are simply wrong, such as
losing
> inter word spaces in English.
> 
> David
> 
> 
> 
>
________________________________________________________________________
> This e-mail has been scanned for all viruses by Star Internet. The
> service is powered by MessageLabs. For more information on a proactive
> anti-virus service working around the clock, around the globe, visit:
> http://www.star.net.uk
>
________________________________________________________________________
Received on Monday, 8 December 2003 13:17:37 UTC