AW: Whitespace handling in TTFMS

Hi Taki,

it seems I mixed up the diff files. Sorry for that confusion!
The latest check-in creates a clean run. Great!

We now ought to
* create more dedicated test-cases
* agree on whitespace handling rules in Canonical EXI

Thanks,

-- Daniel


________________________________
Von: Takuki Kamiya [tkamiya@us.fujitsu.com]
Gesendet: Mittwoch, 25. November 2015 00:03
An: Peintner, Daniel (ext); public-exi@w3.org
Betreff: RE: Whitespace handling in TTFMS

Hi Daniel,

I just checked the event sequences of the encoded OpenOffice document.

I wonder if it is EXIficient that outputs the CH(" ") between two EE events.

File size observation is consistent with my suspicion. EXificient result is
42,273 bytes, and OpenEXI result is 42,267 bytes.

Can you check?

Thank you,

Takuki Kamiya
Fujitsu Laboratories of America


-----Original Message-----
From: Peintner, Daniel (ext) [mailto:daniel.peintner.ext@siemens.com]
Sent: Tuesday, November 24, 2015 9:08 AM
To: Takuki Kamiya; public-exi@w3.org
Subject: AW: Whitespace handling in TTFMS

Hi Taki,

Thank you for finding this issue. This problem has been fixed in my local repository.

With regards to the other issue

# data\OpenOffice\Exclusive XML Canonicalization Version 1.fodt.exi%px%%%%%%
related to whitespace between EE and EE

see XML snippet

<text:p><text:a ...>http://www.w3.org/TR/2002/REC-xml-exc-c14n-20020718/</text:a> </text:p>

According to my understanding the event sequence should be as follows.

SE(p) SE(a) CH("http...") EE EE

while to me it looks like OpenEXI creates the following sequence

SE(p) SE(a) CH("http...") EE CH(" ") EE

My reasoning: The whitespace after EE belongs to a complex type (i.e. element p) and the TTFMS rule for complex data says "whitespaces nodes (i.e. strings that consist solely of whitespaces) are removed"

Further, I wonder whether we need to specify the whitespace rules in Canonical EXI. I think without it we can hardly provide interoperability..

Thanks,

-- Daniel





________________________________
Von: Takuki Kamiya [tkamiya@us.fujitsu.com]
Gesendet: Dienstag, 24. November 2015 00:17
An: Peintner, Daniel (ext); public-exi@w3.org
Betreff: RE: Whitespace handling in TTFMS

Hi Daniel,

For the issue of XML comment, say, we have the following XML.

<None>  <!-- abc -->   </None>

Exificient seems to generate EXI as a sequence of (SE,CM,CH,EE)
whereas OpenEXI's result is (SE,CH,CM,CH,EE).

Thank you,

Takuki Kamiya
Fujitsu Laboratories of America



-----Original Message-----
From: Peintner, Daniel (ext) [mailto:daniel.peintner.ext@siemens.com]
Sent: Friday, November 20, 2015 7:09 AM
To: Takuki Kamiya; public-exi@w3.org
Subject: AW: Whitespace handling in TTFMS

Hi Taki,

Unfortunately I did upload a wrong EXIficient library. Now it should be fine.

When running c14n for config/testCases-restricted/all-v1.xml I do now see 2 remaining issues.

# data\Miscellaneous\periodic.exi%c%%%%%%

seems to be related to whitespaces before comments.

# data\OpenOffice\Exclusive XML Canonicalization Version 1.fodt.exi%px%%%%%%
issue relates to whitespace between EE and EE

see snippet

 <text:a ...>http://www.w3.org/TR/2002/REC-xml-exc-c14n-20020718/</text:a> </text:p>


All other differences went away.

I will check the next days for the right behavior,

-- Daniel







________________________________
Von: Takuki Kamiya [tkamiya@us.fujitsu.com]
Gesendet: Donnerstag, 19. November 2015 20:16
An: Peintner, Daniel (ext); public-exi@w3.org
Betreff: RE: Whitespace handling in TTFMS

Hi Daniel,

EXIficient seems to add timezone of value 0 for dates/datetimes values
that originally do not have timezones specified in XML.

For example, in the file "FixML-4.4/AllocationInstructionAck.xml",
"2003-10-30" gets encoded as "2003-10-30Z", etc.

Can you check this?

taki


-----Original Message-----
From: Takuki Kamiya [mailto:tkamiya@us.fujitsu.com]
Sent: Monday, November 16, 2015 11:40 PM
To: Peintner, Daniel (ext); public-exi@w3.org
Subject: RE: Whitespace handling in TTFMS

Hi Daniel,

OpenEXI now supports proper whitespace preservation according to
xml:space settings. With this update, architetto_francesco_ro_01.svg
encoding differences went away.

I still need to do NS sorting...

Thank you,

Takuki Kamiya
Fujitsu Laboratories of America


-----Original Message-----
From: Peintner, Daniel (ext) [mailto:daniel.peintner.ext@siemens.com]
Sent: Thursday, November 12, 2015 4:42 AM
To: Takuki Kamiya; public-exi@w3.org
Subject: AW: Whitespace handling in TTFMS

Hi Taki,

thank you for looking into the issue.
You are right, one of the last check-ins must have resolved this issue. Thanks!

After rebuilding the framwork und re-running the tests I still encounter 21 differences for config/testCases-restricted/all-v1.xml.

I looked into in some failures and would like to figure out the issue step by step.

e.g.,

# data\SVGTinyCleaned\animals\architetto_francesco_ro_01.svg.exi%p%%%%%%

The issue seems to be related to whitespaces. The orginal XML document contains in its root element <svg> the attribute xml:space="preserve"

Hence, I think all sub-elements must preserve any whitespaces. This seems not to be the case for OpenEXI. Can you check that behavior.

# data\OpenOffice\XML Schema Part 1 Structures Second Edition.fodt.exi%px%%%%%%
# data\OpenOffice\XML Schema Part 2 Datatypes Second Edition.fodt.exi%px%%%%%%

There the differences seem to stem from namespace declarations.

Can you check whether OpenEXI sorts NS declarations according to NS prefix.

Let's see whether those two issues sort out some more..

Thanks,

-- Daniel









________________________________
Von: Takuki Kamiya [tkamiya@us.fujitsu.com]
Gesendet: Mittwoch, 11. November 2015 23:32
An: Peintner, Daniel (ext); public-exi@w3.org
Betreff: RE: Whitespace handling in TTFMS

Hi Daniel,

In my latest run, I did not see differences between EXIficient and
OpenEXI wrt. mermaid_kurt_cagle_.svg test case.

In both cases, its encoding seems to have resulted in the following
file name.

mermaid_kurt_cagle_.svg.exi%p%%%%%%

This means, in both cases, the same set of options were used.

I also took a look at TTFMS framework code to see what it currently does.

In TestCaseParameters.java, at line 276, strict mode setting is being
revoked when PI preservation is on. Here is a snippet.

if (isIot) {
  // The "strict" element MUST NOT appear in an EXI options document when one of
  // "dtd", "prefixes", "comments", "pis" or "selfContained" element is present.
  if (preserves.contains(PreserveParam.comments) ||
      preserves.contains(PreserveParam.dtds) ||
      preserves.contains(PreserveParam.prefixes) ||
      preserves.contains(PreserveParam.pis) ||
      selfContainedQNames.length > 0) {
    schemaDeviations = true; // revoke "strict" mode
  }
}

Can you try to refresh your CVS working copy, and rebuild the framework?

Thank you,

Takuki Kamiya
Fujitsu Laboratories of America


-----Original Message-----
From: Peintner, Daniel (ext) [mailto:daniel.peintner.ext@siemens.com]
Sent: Wednesday, October 21, 2015 9:24 AM
To: Takuki Kamiya; public-exi@w3.org
Subject: AW: Whitespace handling in TTFMS

Hi Taki,

I updated the EXIficient library to change the behavior as defined in TTFMs
"only those strings that solely consist of whitespaces are removed. When there are any non-whitespace characters in the string, it is kept intact."

However, this does not reduce the number of diffs.
I looked into SVG test-cases and noticed some differences in the header/options part.

e.g.,mermaid_kurt_cagle_.svg sample

OpenEXI:
 header
  common
   schemaId
  strict

while EXIficient's header looks as follows:
header
 lesscommon
  preserve
    pis
 common
  schemaId

Somehow the implementations seem to be triggered differently (strict vs. pis)...

Thanks,

-- Daniel




________________________________
Von: Takuki Kamiya [tkamiya@us.fujitsu.com]
Gesendet: Donnerstag, 15. Oktober 2015 00:57
An: Peintner, Daniel (ext); public-exi@w3.org
Betreff: RE: Whitespace handling in TTFMS

Hi Daniel,

Regardless of what we define, I think we would probably need to define
common rules so that we will be able to resolve the encoding differences
that we already observed.

Please see below my responses to each of the points that you raised.

Thank you,

taki


-----Original Message-----
From: Peintner, Daniel (ext) [mailto:daniel.peintner.ext@siemens.com]
Sent: Wednesday, October 14, 2015 7:18 AM
To: Takuki Kamiya; public-exi@w3.org
Subject: AW: Whitespace handling in TTFMS

> Hi Taki,
>
> Thank you very much for sharing the TTFMS rules.
>
> I wonder whether we ought to add these rules to the Canonical EXI specification
> or whether we think this is specific to the application that generates the
> XML Infoset we use.
>
> Further, to avoid any confusion I think we ought to be even more specific
> w.r.t. to the rules you provided such as.
>
> 1. What does "schema-less" mean?
>
> The current context does not have schema information or the entire stream is schema-less?
> What about
> a) the stream is schema-informed and we deal with a deviation
> b) the stream is schema-less but the current context has previously learned CH event
>

In EXI, each occurrence of simple data (data between s+e) in the infoset is
either typed or untyped. For simple data, I meant typed text is schema-informed,
and untyped text is schema-less. This is a bit different from the distinction
between schema-informed stream and schema-informed, in the sense it is
context-based.

> 2. Complex data behaviour
>
> The complex data rule says "For complex data (data between s+s, e+s, e+e),
> whitespaces nodes (i.e. strings that consist solely of whitespaces) are
> removed. "
>
> I assume this means that one can or should trim other strings with leading
> and trailing whitespaces?
>

According to the rule implemented in TTFMS, only those strings that solely
consist of whitespaces are removed. When there are any non-whitespace
characters in the string, it is kept intact.

> 3. Effect of preserve.LexicalValue feature
>
> I am not sure about the correlation of preserve.LexicalValue feature.
> The main idea of this feature is to support typed data preservation such as
> transforming a float value like "1E2" as is and not as "100.0".
> However, one could also think this applies to characters and maybe also to
> whitespace characters.
>

The use of preserve.LexicalValue and xml:space are independent, but are related.

The xml:space attribute is a way for parts of a document to request XML tool
chains to preserve all whitespaces in those parts. It only concerns with whitespaces.

One way to correlate the two is that we could advise users to use
preserve.LexicalValue option when the document being canonicalized contains
xml:space.

> 4. What does "simple data" really mean?
>
> Is b) and c) also considered to be simple data. I would think so, correct?
>
> a) SE(foo) <simpleData> EE
> b) SE(foo) AT(bla) <simpleData> EE
> c) SE(foo) NS(uri:foo) <simpleData> EE
>

Yes, that's the way I used the term "simple data".

> 5. Requirement to use undeclared production
>
> Let's pick the example you provided with schema-informed grammar, CH typed
> as xsd:int,  and xml:space is "preserve". Does an EXI processor really need
> to fallback to use undeclared productions for representing the value  " 123 "?
> This seems to be somehow contradictory to me given that XML Schema [1]
> defines that "For all *atomic* datatypes other than string the value of
> whiteSpace is collapse ..."
>
> Thanks,
>
> -- Daniel
>
> [1] http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes.html#dt-whiteSpace
>

XML schema associates data within a document with semantics (i.e. datatypes),
and its action of associating semantics is also just an application in the processing
chains of XML. From this point of view, all whitespaces need to be preserved
when xml:space value is "preserve".

As noted above, we could request users to turn on preserve.LexicalValue so
that we don't need to fallback to undeclared productions.

Thank you,

Takuki Kamiya
Fujitsu Laboratories of America


> ________________________________
> Von: Takuki Kamiya [tkamiya@us.fujitsu.com]
> Gesendet: Mittwoch, 14. Oktober 2015 02:40
> An: public-exi@w3.org
> Betreff: Whitespace handling in TTFMS
>
> Hi,
>
> A picture depicting the whitespace preservation rule currently implemented
> in TTFMS in comparing the original document with the EXI-encoded document
> can be seen at [1].
>
> First of all, xml:space="preserve" is respected when it is in effect
> in the document whether it is schema-informed or schema-less.
> This means, all whitespaces are preserved.
>
> When the current xml:space is *not* "preserve", the following rules apply.
>
> If it is schema-informed:
>
>  - For simple data (data between s+e i.e. start-tag followed by end-tag),
>    apply lexical rule. We should use whiteSpace facet for this purpose.
>
>  - For complex data (data between s+s, e+s, e+e), whitespaces nodes (i.e.
>    strings that consist solely of whitespaces) are removed.
>
> If it is schema-less:
>
>  - Simple data (data between s+e) are all preserved.
>
>  - For complex data, it is same as schema-informed case.
>
> We could use a similar rules for defining how whitespaces in the input infoset
> are treated.
>
> There is an issue when the encoder uses schema-informed strict-grammar
> and xml:space is "preserve". For example, " 123 " typed as xsd:int cannot
> preserve the heading and trailing whitespace when typed datatype
> representation is used.
>
> [1] https://www.w3.org/XML/EXI/wiki/File:WhiteSpace_handling_in_TTFMS.jpeg
>
> Takuki Kamiya
> Fujitsu Laboratories of America
>
>

Received on Wednesday, 25 November 2015 12:36:29 UTC