RE: EXI WG's inquiry about ISSUE-2050 from Takuki Kamiya on 2012-05-23 (www-svg@w3.org from May 2012)

From: Takuki Kamiya <tkamiya@us.fujitsu.com>
Date: Tue, 22 May 2012 17:21:30 -0700
To: Robin Berjon <robin@berjon.com>
CC: SVG public list <www-svg@w3.org>, "member-exi-wg@w3.org" <member-exi-wg@w3.org>
Message-ID: <23204FACB677D84EBD57175AB7B5A71C011860A641DA@FMSAMAIL.fmsa.local>
Hi Robin,

Rigorousness does not end at the sheer content model as you indicated
in your message. A good example is EXI's header options schema. It is
deliberately designed to take the frequency of elements into account, which
permits an encoding that can well compete with hand-optimized formats.

On the other hand, too much rigor often leads to a huge grammar set that
would not fit into small devices especially when there are myriad of elements
and attributes involved.

EXI is equipped with "built-in element grammars" that recalibrate themselves 
as they encounter new elements and attributes. The use of built-in element 
grammars instead of schema-informed grammars, combined with string tables 
prepopulated with known SVG elements and attribute names may turn out to work 
reasonably well. Seldom-used elements/attributes would not take bits this
way unless they actually occur in the document. We can incrementally add
more rigorous definitions as we found empirically that doing so greatly would
improve compactness without sacrificing memory spaces.

taki


-----Original Message-----
From: Robin Berjon [mailto:robin@berjon.com] 
Sent: Monday, May 21, 2012 5:04 AM
To: Takuki Kamiya
Cc: SVG public list; member-exi-wg@w3.org
Subject: Re: EXI WG's inquiry about ISSUE-2050

Hi Takuki,

On May 16, 2012, at 01:44 , Takuki Kamiya wrote:
> The rule of thumb in the better design of schemas for EXI is 
> that the more rigorous the schema is the more compactness 
> you can achieve out of EXI.

True, but "rigorous" can be hard to define :)

> Also, I am interested in knowing the aspects of the
> relaxNG schema SVG is exercising that are not supported 
> by XSD 1.0, and the rationale that led SVG 1.x to depend 
> on them. This is to see if it is totally out of whack to
> apply XML Schema, or is manageable.

It's been a loooooong time, so I do not claim that this description here is correct - it is based largely on old memories.

One thing that XML Schema could not capture but RNG could was the <a> element's content model. Essentially, wherever <a> is allowed, it is allowed to contain anything that is allowable at the same time as itself, minus itself (at any level down). To give an example, assume that <foo> can contain <a>, <foo>, and <bar>. An <a> inside a <foo> can therefore contain <foo> and <bar>, and if there is a <foo> inside an <a> it can only contain <foo> and <bar> as well (recursively). That's impossible to express in XSD 1.0 (to be fair, I'm not sure I understand how we captured that in RNG - it broke some brains).

XSD couldn't capture context-dependent constraints. For instance, at least in Tiny, the root <svg> element was allowed some attributes that were forbidden if <svg> appeared inside the document.

I believe that co-occurence constraints were also part of the picture, with some content-models depending on attributes present elsewhere.

I also recall having UPA problems all over when building an XML Schema for SVG. The only solution was to make the schema more permissive than it needed to be.

Note that this is experience from a while back. It's not impossible that in the meantime XML Schema 1.1 may have addressed a number of these issues. Also, the lack of interoperability in XML Schema processors did exclude some more creative constructs that we looked at (which ones, I don't recall). That's something that ought to be a lot better today.

Overall, I don't think that you will have major problems producing an XML Schema for SVG, but the result will be rather loose. SVG is an authoring syntax and a lot of stuff is optional in a lot of places. My experience with binarising SVG is that you gain most from custom codecs (or by changing the syntax, which is essentially the same) and less than you'd hope from the structural redundancy.

One thing that's worth testing on real world data is to exclude rare elements and attributes completely from your schema. For instance, <title> and <metadata> are acceptable in lots of places, but rarely used; same for attributes like contentScriptType and a *lot* of properties. You quickly end up in a situation where you have to encode a lot of 0 bits with each element. If you ditch those elements completely and encode in fault-tolerant mode (so that they aren't lost), the odds are you win overall (possibly by a lot).

-- 
Robin Berjon - http://berjon.com/ - @robinberjon
Received on Wednesday, 23 May 2012 00:22:09 UTC