xf:normalizedString (Was: Re: [xsl] comments on December F&O draft) from Jeni Tennison on 2002-01-04 (www-xml-query-comments@w3.org from January 2002)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Fri, 4 Jan 2002 21:33:11 +0000
To: www-xml-query-comments@w3.org
CC: David Carlisle <davidc@nag.co.uk>
Message-ID: <5525406522.20020104213311@jenitennison.com>
David C. wrote:
> 4.2.2 xf:normalisedString
> Is there any use case for this? It seems to be rather a bizarre
> thing. The normalisation could be done by the user using translate()
> if desired.

I believe that the xf:normalizedString constructor is there because
the xs:normalizedString data type exists, which is in turn because XML
makes a distinction between replacing whitespace (which is done for
attributes with a type of CDATA) and collapsing whitespace (which is
done for attributes with other types).

In other words, if you have a document that adheres to a DTD, and the
type of the bar attribute is CDATA, then the type of that attribute in
XPath 2.0 should, I think, be a xs:normalizedString.

I think that you therefore need a xs:normalizedString constructor to
create the value with which you're comparing it, so that if you have:

<foo bar="a
          b
          c" />

then you can do something along the lines of:

@bar eq normalizedString('a
          b
          c')

and get the answer true.

I admit, though, that it isn't clear to me why certain of the built-in
derived data types from XML Schema get their own constructors while
others (e.g. xs:positiveInteger) don't.

> The restriction on not having #xD in the argument will be almost
> impossible to maintain in non XML uses of Xpath. XML normalises all
> line ends to #xA but in a non XML setting line ends may well be #xD
> or #xD#xA pairs, in which case normalising just #xA and declaring
> #xD an error will mean that an Xquery breaks just by moving the text
> file containing it from one place to another (unless every host
> language for xpath does a similar line end normalisation)

I agree that the definition given within the F&O WD is off the mark.
Partly, I think, this is because the definition of xs:normalizedString
in XML Schema is slightly strange, but partly it's to do with how
white space is handled.

The only difference between an xs:string and an xs:normalizedString in
XML Schema is the whiteSpace facet, which has a value of "preserve"
for xs:string and "replace" for xs:normalizedString.

According to the definition of the whiteSpace facet in XML Schema
(http://www.w3.org/TR/xmlschema-2/#rf-whiteSpace), replacing
whitespace involves replacing all whitespace characters (tab, line
feed, and carriage return) with spaces. This differs markedly from
deleting all newline characters, which is what is described for
xf:normalizedString().

The XML Schema Datatypes Recommendation further says that the lexical
space of xs:normalizedString cannot contain the carriage return or tab
characters. However, this is guaranteed by the fact that
normalizedString values have white space replaced -- given the value
of an attribute or element, XML Schema will first replace all the
whitespace characters with a space character, and then check to see
whether the result is a valid normalizedString (with no carriage
returns or tab characters in), which logically it has to be anyway.
Therefore the extra assertion that normalizedStrings must not contain
carriage returns or tab characters is superfluous.

In short, the xf:normalizedString() constructor should not limit what
characters are allowed in the argument, and should permit both
carriage returns and tab characters. To create the normalizedString,
it should replace all whitespace characters in the argument string to
space characters.

---

The handling of whitespace in the constructors in XPath is now handled
properly (in my opinion) for numeric values, where whitespace is
collapsed (leading and trailing stripped, sequences of whitespace
replaced by a single space) prior to the value being assessed to see
if it fulfils the lexical requirements of the data type.

However, it is treated incorrectly for most other values. Aside from
xs:string and xs:normalizedString, all data types in XML Schema have a
'collapse' value for their whiteSpace facet. As with numbers,
whitespace collapsing should occur prior to the format of the value
being assessed. The reason this is important is that the following is
valid:

  <date xsi:type="xs:date"> 2002-01-04 </date>

since the leading and trailing whitespace is stripped prior to
checking. It will be incredibly confusing if:

  cast as xs:date(date)

raises an error because of the whitespace in the date element, despite
the fact that it validates fine using a schema validator.

This applies to all the constructors in the F&O document aside from
xf:string() (which should not undergo any changes to its whitespace)
and xf:normalizedString() (as above).

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/
Received on Friday, 4 January 2002 16:33:14 UTC