RE: Whitespace handling in TTFMS from Takuki Kamiya on 2015-10-14 (public-exi@w3.org from October 2015)

From: Takuki Kamiya <tkamiya@us.fujitsu.com>
Date: Wed, 14 Oct 2015 15:57:43 -0700
To: "Peintner, Daniel (ext)" <daniel.peintner.ext@siemens.com>, "public-exi@w3.org" <public-exi@w3.org>
Message-ID: <23204FACB677D84EBD57175AB7B5A71C02FE37FD49C6@FMSAMAIL.fmsa.local>
Hi Daniel,

Regardless of what we define, I think we would probably need to define
common rules so that we will be able to resolve the encoding differences
that we already observed.

Please see below my responses to each of the points that you raised.

Thank you,

taki


-----Original Message-----
From: Peintner, Daniel (ext) [mailto:daniel.peintner.ext@siemens.com] 
Sent: Wednesday, October 14, 2015 7:18 AM
To: Takuki Kamiya; public-exi@w3.org
Subject: AW: Whitespace handling in TTFMS

> Hi Taki,
> 
> Thank you very much for sharing the TTFMS rules.
> 
> I wonder whether we ought to add these rules to the Canonical EXI specification 
> or whether we think this is specific to the application that generates the 
> XML Infoset we use.
> 
> Further, to avoid any confusion I think we ought to be even more specific 
> w.r.t. to the rules you provided such as.
> 
> 1. What does "schema-less" mean?
> 
> The current context does not have schema information or the entire stream is schema-less?
> What about
> a) the stream is schema-informed and we deal with a deviation
> b) the stream is schema-less but the current context has previously learned CH event
> 

In EXI, each occurrence of simple data (data between s+e) in the infoset is 
either typed or untyped. For simple data, I meant typed text is schema-informed,
and untyped text is schema-less. This is a bit different from the distinction
between schema-informed stream and schema-informed, in the sense it is
context-based.

> 2. Complex data behaviour
> 
> The complex data rule says "For complex data (data between s+s, e+s, e+e), 
> whitespaces nodes (i.e. strings that consist solely of whitespaces) are 
> removed. "
> 
> I assume this means that one can or should trim other strings with leading 
> and trailing whitespaces?
> 

According to the rule implemented in TTFMS, only those strings that solely 
consist of whitespaces are removed. When there are any non-whitespace 
characters in the string, it is kept intact.

> 3. Effect of preserve.LexicalValue feature
> 
> I am not sure about the correlation of preserve.LexicalValue feature. 
> The main idea of this feature is to support typed data preservation such as 
> transforming a float value like "1E2" as is and not as "100.0".
> However, one could also think this applies to characters and maybe also to 
> whitespace characters.
> 

The use of preserve.LexicalValue and xml:space are independent, but are related.

The xml:space attribute is a way for parts of a document to request XML tool
chains to preserve all whitespaces in those parts. It only concerns with whitespaces.

One way to correlate the two is that we could advise users to use 
preserve.LexicalValue option when the document being canonicalized contains 
xml:space.

> 4. What does "simple data" really mean?
> 
> Is b) and c) also considered to be simple data. I would think so, correct?
> 
> a) SE(foo) <simpleData> EE
> b) SE(foo) AT(bla) <simpleData> EE
> c) SE(foo) NS(uri:foo) <simpleData> EE
> 

Yes, that's the way I used the term "simple data".

> 5. Requirement to use undeclared production
> 
> Let's pick the example you provided with schema-informed grammar, CH typed 
> as xsd:int,  and xml:space is "preserve". Does an EXI processor really need 
> to fallback to use undeclared productions for representing the value  " 123 "?
> This seems to be somehow contradictory to me given that XML Schema [1] 
> defines that "For all *atomic* datatypes other than string the value of 
> whiteSpace is collapse ..."
> 
> Thanks,
> 
> -- Daniel
> 
> [1] http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes.html#dt-whiteSpace
> 

XML schema associates data within a document with semantics (i.e. datatypes),
and its action of associating semantics is also just an application in the processing 
chains of XML. From this point of view, all whitespaces need to be preserved 
when xml:space value is "preserve".

As noted above, we could request users to turn on preserve.LexicalValue so
that we don't need to fallback to undeclared productions.

Thank you,

Takuki Kamiya
Fujitsu Laboratories of America


> ________________________________
> Von: Takuki Kamiya [tkamiya@us.fujitsu.com]
> Gesendet: Mittwoch, 14. Oktober 2015 02:40
> An: public-exi@w3.org
> Betreff: Whitespace handling in TTFMS
> 
> Hi,
> 
> A picture depicting the whitespace preservation rule currently implemented
> in TTFMS in comparing the original document with the EXI-encoded document
> can be seen at [1].
> 
> First of all, xml:space="preserve" is respected when it is in effect
> in the document whether it is schema-informed or schema-less.
> This means, all whitespaces are preserved.
> 
> When the current xml:space is *not* "preserve", the following rules apply.
> 
> If it is schema-informed:
> 
>  - For simple data (data between s+e i.e. start-tag followed by end-tag),
>    apply lexical rule. We should use whiteSpace facet for this purpose.
> 
>  - For complex data (data between s+s, e+s, e+e), whitespaces nodes (i.e.
>    strings that consist solely of whitespaces) are removed.
> 
> If it is schema-less:
> 
>  - Simple data (data between s+e) are all preserved.
> 
>  - For complex data, it is same as schema-informed case.
> 
> We could use a similar rules for defining how whitespaces in the input infoset
> are treated.
> 
> There is an issue when the encoder uses schema-informed strict-grammar
> and xml:space is "preserve". For example, " 123 " typed as xsd:int cannot
> preserve the heading and trailing whitespace when typed datatype
> representation is used.
> 
> [1] https://www.w3.org/XML/EXI/wiki/File:WhiteSpace_handling_in_TTFMS.jpeg
> 
> Takuki Kamiya
> Fujitsu Laboratories of America
> 
>
Received on Wednesday, 14 October 2015 22:58:28 UTC