Whitespace handling in TTFMS from Takuki Kamiya on 2015-10-14 (public-exi@w3.org from October 2015)

From: Takuki Kamiya <tkamiya@us.fujitsu.com>
Date: Tue, 13 Oct 2015 17:40:47 -0700
To: "public-exi@w3.org" <public-exi@w3.org>
Message-ID: <23204FACB677D84EBD57175AB7B5A71C02FE37FD48AF@FMSAMAIL.fmsa.local>

Hi,

A picture depicting the whitespace preservation rule currently implemented 
in TTFMS in comparing the original document with the EXI-encoded document
can be seen at [1].

First of all, xml:space="preserve" is respected when it is in effect
in the document whether it is schema-informed or schema-less. 
This means, all whitespaces are preserved.

When the current xml:space is *not* "preserve", the following rules apply.

If it is schema-informed:

 - For simple data (data between s+e i.e. start-tag followed by end-tag), 
   apply lexical rule. We should use whiteSpace facet for this purpose.

 - For complex data (data between s+s, e+s, e+e), whitespaces nodes (i.e.
   strings that consist solely of whitespaces) are removed.

If it is schema-less:

 - Simple data (data between s+e) are all preserved.

 - For complex data, it is same as schema-informed case.

We could use a similar rules for defining how whitespaces in the input infoset
are treated.

There is an issue when the encoder uses schema-informed strict-grammar
and xml:space is "preserve". For example, " 123 " typed as xsd:int cannot
preserve the heading and trailing whitespace when typed datatype 
representation is used.

[1] https://www.w3.org/XML/EXI/wiki/File:WhiteSpace_handling_in_TTFMS.jpeg

Takuki Kamiya
Fujitsu Laboratories of America

Received on Wednesday, 14 October 2015 00:41:28 UTC