W3C home > Mailing lists > Public > public-exi@w3.org > October 2015

AW: Whitespace handling in TTFMS

From: Peintner, Daniel (ext) <daniel.peintner.ext@siemens.com>
Date: Wed, 14 Oct 2015 14:18:10 +0000
To: Takuki Kamiya <tkamiya@us.fujitsu.com>, "public-exi@w3.org" <public-exi@w3.org>
Message-ID: <D94F68A44EB1954A91DE4AE9659C5A980FD82E74@DEFTHW99EH1MSX.ww902.siemens.net>
Hi Taki,

Thank you very much for sharing the TTFMS rules.

I wonder whether we ought to add these rules to the Canonical EXI specification or whether we think this is specific to the application that generates the XML Infoset we use.

Further, to avoid any confusion I think we ought to be even more specific w.r.t. to the rules you provided such as.

1. What does "schema-less" mean?

The current context does not have schema information or the entire stream is schema-less?
What about
a) the stream is schema-informed and we deal with a deviation
b) the stream is schema-less but the current context has previously learned CH event

2. Complex data behaviour

The complex data rule says "For complex data (data between s+s, e+s, e+e), whitespaces nodes (i.e. strings that consist solely of whitespaces) are removed. "

I assume this means that one can or should trim other strings with leading and trailing whitespaces?

3. Effect of preserve.LexicalValue feature

I am not sure about the correlation of preserve.LexicalValue feature. The main idea of this feature is to support typed data preservation such as transforming a float value like "1E2" as is and not as "100.0".
However, one could also think this applies to characters and maybe also to whitespace characters.

4. What does "simple data" really mean?

Is b) and c) also considered to be simple data. I would think so, correct?

a) SE(foo) <simpleData> EE
b) SE(foo) AT(bla) <simpleData> EE
c) SE(foo) NS(uri:foo) <simpleData> EE

5. Requirement to use undeclared production

Let's pick the example you provided with schema-informed grammar, CH typed as xsd:int,  and xml:space is "preserve". Does an EXI processor really need to fallback to use undeclared productions for representing the value  " 123 "?
This seems to be somehow contradictory to me given that XML Schema [1] defines that "For all ·atomic· datatypes other than string the value of whiteSpace is collapse ..."


-- Daniel

[1] http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes.html#dt-whiteSpace

Von: Takuki Kamiya [tkamiya@us.fujitsu.com]
Gesendet: Mittwoch, 14. Oktober 2015 02:40
An: public-exi@w3.org
Betreff: Whitespace handling in TTFMS


A picture depicting the whitespace preservation rule currently implemented
in TTFMS in comparing the original document with the EXI-encoded document
can be seen at [1].

First of all, xml:space="preserve" is respected when it is in effect
in the document whether it is schema-informed or schema-less.
This means, all whitespaces are preserved.

When the current xml:space is *not* "preserve", the following rules apply.

If it is schema-informed:

 - For simple data (data between s+e i.e. start-tag followed by end-tag),
   apply lexical rule. We should use whiteSpace facet for this purpose.

 - For complex data (data between s+s, e+s, e+e), whitespaces nodes (i.e.
   strings that consist solely of whitespaces) are removed.

If it is schema-less:

 - Simple data (data between s+e) are all preserved.

 - For complex data, it is same as schema-informed case.

We could use a similar rules for defining how whitespaces in the input infoset
are treated.

There is an issue when the encoder uses schema-informed strict-grammar
and xml:space is "preserve". For example, " 123 " typed as xsd:int cannot
preserve the heading and trailing whitespace when typed datatype
representation is used.

[1] https://www.w3.org/XML/EXI/wiki/File:WhiteSpace_handling_in_TTFMS.jpeg

Takuki Kamiya
Fujitsu Laboratories of America
Received on Wednesday, 14 October 2015 14:18:46 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 14 October 2015 14:18:47 UTC