AW: Whitespace handling in TTFMS

Hi Taki,

I updated the EXIficient library to change the behavior as defined in TTFMs
"only those strings that solely consist of whitespaces are removed. When there are any non-whitespace characters in the string, it is kept intact."

However, this does not reduce the number of diffs.
I looked into SVG test-cases and noticed some differences in the header/options part.

e.g.,mermaid_kurt_cagle_.svg sample

OpenEXI:
 header
  common
   schemaId
  strict

while EXIficient's header looks as follows:
header
 lesscommon
  preserve
    pis
 common
  schemaId

Somehow the implementations seem to be triggered differently (strict vs. pis)...

Thanks,

-- Daniel




________________________________
Von: Takuki Kamiya [tkamiya@us.fujitsu.com]
Gesendet: Donnerstag, 15. Oktober 2015 00:57
An: Peintner, Daniel (ext); public-exi@w3.org
Betreff: RE: Whitespace handling in TTFMS

Hi Daniel,

Regardless of what we define, I think we would probably need to define
common rules so that we will be able to resolve the encoding differences
that we already observed.

Please see below my responses to each of the points that you raised.

Thank you,

taki


-----Original Message-----
From: Peintner, Daniel (ext) [mailto:daniel.peintner.ext@siemens.com]
Sent: Wednesday, October 14, 2015 7:18 AM
To: Takuki Kamiya; public-exi@w3.org
Subject: AW: Whitespace handling in TTFMS

> Hi Taki,
>
> Thank you very much for sharing the TTFMS rules.
>
> I wonder whether we ought to add these rules to the Canonical EXI specification
> or whether we think this is specific to the application that generates the
> XML Infoset we use.
>
> Further, to avoid any confusion I think we ought to be even more specific
> w.r.t. to the rules you provided such as.
>
> 1. What does "schema-less" mean?
>
> The current context does not have schema information or the entire stream is schema-less?
> What about
> a) the stream is schema-informed and we deal with a deviation
> b) the stream is schema-less but the current context has previously learned CH event
>

In EXI, each occurrence of simple data (data between s+e) in the infoset is
either typed or untyped. For simple data, I meant typed text is schema-informed,
and untyped text is schema-less. This is a bit different from the distinction
between schema-informed stream and schema-informed, in the sense it is
context-based.

> 2. Complex data behaviour
>
> The complex data rule says "For complex data (data between s+s, e+s, e+e),
> whitespaces nodes (i.e. strings that consist solely of whitespaces) are
> removed. "
>
> I assume this means that one can or should trim other strings with leading
> and trailing whitespaces?
>

According to the rule implemented in TTFMS, only those strings that solely
consist of whitespaces are removed. When there are any non-whitespace
characters in the string, it is kept intact.

> 3. Effect of preserve.LexicalValue feature
>
> I am not sure about the correlation of preserve.LexicalValue feature.
> The main idea of this feature is to support typed data preservation such as
> transforming a float value like "1E2" as is and not as "100.0".
> However, one could also think this applies to characters and maybe also to
> whitespace characters.
>

The use of preserve.LexicalValue and xml:space are independent, but are related.

The xml:space attribute is a way for parts of a document to request XML tool
chains to preserve all whitespaces in those parts. It only concerns with whitespaces.

One way to correlate the two is that we could advise users to use
preserve.LexicalValue option when the document being canonicalized contains
xml:space.

> 4. What does "simple data" really mean?
>
> Is b) and c) also considered to be simple data. I would think so, correct?
>
> a) SE(foo) <simpleData> EE
> b) SE(foo) AT(bla) <simpleData> EE
> c) SE(foo) NS(uri:foo) <simpleData> EE
>

Yes, that's the way I used the term "simple data".

> 5. Requirement to use undeclared production
>
> Let's pick the example you provided with schema-informed grammar, CH typed
> as xsd:int,  and xml:space is "preserve". Does an EXI processor really need
> to fallback to use undeclared productions for representing the value  " 123 "?
> This seems to be somehow contradictory to me given that XML Schema [1]
> defines that "For all *atomic* datatypes other than string the value of
> whiteSpace is collapse ..."
>
> Thanks,
>
> -- Daniel
>
> [1] http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes.html#dt-whiteSpace
>

XML schema associates data within a document with semantics (i.e. datatypes),
and its action of associating semantics is also just an application in the processing
chains of XML. From this point of view, all whitespaces need to be preserved
when xml:space value is "preserve".

As noted above, we could request users to turn on preserve.LexicalValue so
that we don't need to fallback to undeclared productions.

Thank you,

Takuki Kamiya
Fujitsu Laboratories of America


> ________________________________
> Von: Takuki Kamiya [tkamiya@us.fujitsu.com]
> Gesendet: Mittwoch, 14. Oktober 2015 02:40
> An: public-exi@w3.org
> Betreff: Whitespace handling in TTFMS
>
> Hi,
>
> A picture depicting the whitespace preservation rule currently implemented
> in TTFMS in comparing the original document with the EXI-encoded document
> can be seen at [1].
>
> First of all, xml:space="preserve" is respected when it is in effect
> in the document whether it is schema-informed or schema-less.
> This means, all whitespaces are preserved.
>
> When the current xml:space is *not* "preserve", the following rules apply.
>
> If it is schema-informed:
>
>  - For simple data (data between s+e i.e. start-tag followed by end-tag),
>    apply lexical rule. We should use whiteSpace facet for this purpose.
>
>  - For complex data (data between s+s, e+s, e+e), whitespaces nodes (i.e.
>    strings that consist solely of whitespaces) are removed.
>
> If it is schema-less:
>
>  - Simple data (data between s+e) are all preserved.
>
>  - For complex data, it is same as schema-informed case.
>
> We could use a similar rules for defining how whitespaces in the input infoset
> are treated.
>
> There is an issue when the encoder uses schema-informed strict-grammar
> and xml:space is "preserve". For example, " 123 " typed as xsd:int cannot
> preserve the heading and trailing whitespace when typed datatype
> representation is used.
>
> [1] https://www.w3.org/XML/EXI/wiki/File:WhiteSpace_handling_in_TTFMS.jpeg
>
> Takuki Kamiya
> Fujitsu Laboratories of America
>
>

Received on Wednesday, 21 October 2015 16:24:30 UTC