Re: Request for Comments: Last Call WD of Widgets 1.0: Packaging & Configuration spec; deadline 31 Jan 2009

Hi Boris, 
On 1/23/09 2:25 PM, "Boris Zbarsky" <bzbarsky@MIT.EDU> wrote:
> Marcos Caceres wrote:
>> Ok. I'll need to run this by the working group as I had something like
>> this in very early drafts of the spec and received criticism for being
>> overly prescriptive (It could have been that I wrote the text
>> incorrectly). Can you please suggest some text that we could use?
> 
>    "There may be implementation-specific limits on the range of
>     integers allowed, and behavior outside such limits is undefined."
> 
> is one option.

Sounds good. Added that to the spec as a note in the relevant section.

You should probably tell the HTML5 guys about this too, as it may be
relevant there. 
 
>>> That really depends on what the goal is.  What _is_ the goal?
>> 
>> The goals are as follows:
>>   1. Widget engines optionally support SVG Tiny for the icon format
>> (though they can have the capability to render full SVG).
>>   2. For the purpose of widgets, icons are written by authors to
>> conform to SVG Tiny (not full)
>>   3. Widget engines that support full, can render icons in SVG Tiny...
>> but, for interop, widget engines should not render icons written in
>> SVG Full 1.1 (unless the icon also conforms to SVG Tiny).
> 
> Goal 3 is what my original comment was about, basically.  It means that
> a widget engine cannot make use of various existing SVG implementations
> that just happen to support more than just SVG Tiny.  In particular, it
> means that Gecko, say, would not be able to implement this specification
> without sprinkling code all over to validate SVG files against SVG Tiny
> (something we don't plan to do, since that's not the profile we're
> implementing).
> 
> I understand where you're coming from with this goal, but I'm not sure
> it's worth the restriction it imposes.
> 
> Things will get even worse once SVG Tiny 1.2 is a REC, since at that
> point I fully expect pretty much all SVG engines supporting SVG Tiny to
> implement that specification, and at that point there will be no SVG
> engines that can be used for Widgets at all (since all of them will
> render things that are not valid SVG Tiny 1.1).
> 
> So unless you really mean to exclude SVG engines that happen to
> implement SVG Full 1.1, SVG Tiny 1.2, SVG Basic 1.1 from being used in
> widget implementations (possibly forcing the widget UA to ship two
> separate SVG engines, one for widgets and one for everything else it's
> doing), I think you should drop goal 3 and leave the authoring requirement.
> 
> That is, just have image/svg+xml work the same way in Widgets as it does
> over HTTP, with the authoring requirement, presumably enforced by
> validators of widgets but not widget UAs, that the images conform to SVG
> Tiny (1.1 or any version; up to you).

Ok, that sounds like a completely reasonable proposal. And you are right, I
had thought about this in totally the wrong way. I did as you suggested:
  * widget engines may now support SVG 1.1.
  * authors, however, should try to conform to SVG Tiny 1.2.
  * conformance checkers should warn authors when their icons don't conform
to SVG tiny 1.2.  

>>> Sure.  I have no real opinions on the form this would take, to be honest.
>> 
>> Just to be clear, do you feel strongly that this should be a feature
>> in Widgets 1.0?
> 
> I'd think so, yes.  That would make it much easier to migrate existing
> web content into widgets as needed...

Ok. I have asked the chair add this to this week's teleconf agenda so the
group can make a final decision about this. I'm unsure as if I _personally_
want to support this for version 1.0, but will certainly want to see this in
version 2.0. Other members of the working group might want this, but none
who are actually implementing have asked for it so far. My concern is that
we will enter a world of hurt if we introduce this feature because we will
have to deal with all sorts of additional MIME related issues. However, if
it turns out that just defining a manifest format as I proposed in the
previous email is all that is needed, then this could be doable for 1.0.
 
>>> A number of them presumably do sniffing by extension.  Gecko certainly does
>>> for its jar: handling.  This specification explicitly prohibits that,
>>> though.
>> 
>> Sorry, I don't understand - we make file extension to MIME mapping a
>> priority over sniffing: Step 1  of section "Rules for Identifying the
>> MIME Type of a file" reads as follows:
>> 
>> "1. If the file entry has a file extension, attempt to match the file
>> extension to one in the first column in the file identification table.
>> If there is a match, then return the MIME Type value. "
> 
> When I say "sniffing by extension" I don't mean that table.  I mean
> looking for an extension-to-type mapping anywhere it can be found.  For
> example, Gecko will look in some built-in tables it has, in user
> preferences, in the list of past files the user has opened in helper
> applications, in the extension lists that NPAPI plug-ins install, and in
> the OS-wide extension registry.  This is the sort of sniffing that you
> presumably do not want, since it leads to poor interoperability (e.g.
> the results depend on the user's OS configuration and the filenames of
> past files the user has opened in helper applications).

Correct. So what is wrong with limiting sniffing to the table in the spec?
Or to the content-sniffing internet draft I pointed you to earlier?... I'm
not sure I'm understanding what you want me to specify here.
 
>>> That seems truly unfortunate, especially since from what I can tell ZIP
>>> libraries _also_ do whatever they want with character encodings.  If people
>>> are going to be forced to write their ZIP decompressors from scratch to
>>> implement this specification, what exactly are the benefits of using ZIP at
>>> all?
>> 
>> I guess the thing would be to lobby Microsoft, Apple, and others to
>> change/update their Zip implementations.
> 
> I'm not sure how that would help, since presumably widget UAs want to
> link to their own ZIP libraries to perform the various validation that
> the spec requires, as well as to allow in-memory operation as needed....

Oh ok, I was thinking about it from the developers point of view: I imagined
most people would develop widgets locally, say, using Dreamweaver, and then
they would Zip the files up using whatever zipping tool is provided by the
OS. 

>   Certainly if Gecko were implementing this specification that's what we
> would do.  We wouldn't want to depend on whatever happens (or not) to be
> installed on the operating system.

Understood. However, wouldn't you have to deal with the fact that
non-conforming zip implementations are used to create the widgets in the
first place. 

I'm unsure how to proceed with this issue.
 
>> The other thing is that widgets this will only be a
>> problem in some small segments of the market. Most people will only
>> write widgets in one language and distribute it amongst people who use
>> the same character encoding on their systems.
> 
> Do we have any data to support this supposition?  That's certainly how
> things work with web pages, and in small market segments like Western
> Europe there are multiple encodings in common use (ISO-8859-1 and
> UTF-8).  

No, not directly. I only have anecdotal evidence: a podcast from the Harvard
Business Review about globalization and the internet, but I don't have a
pointer. In that podcast, some research was presented that indicated that
only 15% of internet traffic actually leaves the boundaries of a country and
is decreasing. That means that 85% or more of all communication would, in
theory, be done using the same language and, by extension, the same
character encoding. So, for example, this would mean that widgets created on
Windows using Big-5, would be primarily shared with people using only Big-5.
So yes, 15% of people would still be affected, which may or may not be
significant in some markets.

> Not only that, but on Mac the default filesystem encoding is
> UTF-8, while on Windows that's not the case last I checked (and the
> situation is actually rather complicated in terms of what the default
> is, as I recall).

I reached similar conclusions through my own testing/research [1]. Note that
on Mac it is apparently some proprietary variant of UTF-8 in fully
decomposed canonical form. I'm not sure what different flavors of Linux use,
but again: things seem bad on the file name encoding front. In essence, you
can't share Zip files across OS if they contain characters outside the ASCII
range.   
 
>> This would mirror today's reality I guess.
> 
> I'm not sure what this is referring to.  Are there particular widget UAs
> out there now that behave in this way (basically just copying bytes and
> then treating them in some "native" charset)?

I don't know. I need to test this. My suspicion is that this will hold for
any UA that decompresses widgets to the hard-disk (e.g., yahoo!'s widget
engine and Vista sidebar gadgets).

By "reality" I meant the reality about zip implementations - i.e., no
respect for encodings.
 
>> And as you said, it does open an opportunity
>> for a vendor to create conforming packaging tools.
> 
> I guess it's not clear to me why we think that adding work for everyone
> in this regard is worth it. What benefits do we gain, precisely from
> using ZIP instead of, say, the MHTML format that was recently suggested?

MHTML *may* be more technically superior and architecturally better, but
there is more tool support for Zip than MTHML. AFAIK, MHTML packaging tools
do not ship with any operating system. Zipping tools do.

I don't have any statistics, but I assume Zip is used around the world - I
mean the fact that it is a standard tool on all OS has to mean something
significant. Also, Mozilla uses it to ship add-ons right? What, if any,
problems have you guys experienced wrt to zip in internationalized contexts?
Do developers simply say, "crap! can only use ASCII. But, c'est la vie" or
do they use non-ASCII characters anyway and suffer the consequences? As I'm
not affiliated with any company that produces widgets, it is difficult for
me to gather that data. I rely on the on those who already have engines on
the market to provide me with implementation experience about how they deal
with such problems.

Again, I'm not sure how to proceed.
 
>>>> 3. In result, convert any sequence of one or more U+000A LINE FEED
>>>> (LF) or U+000D CARRIAGE RETURN (CR) or U+0009 CHARACTER TABULATION
>>>> (tab) character into a single U+0020 SPACE.
>>> You probably want to include U+0020 SPACE in your list of things which are
>>> to be collapsed.  That said, why not just use the existing "space
>>> characters" that's already defined in this spec?
>> 
>> I guess I wrote it that way so single spaces don't get replaced with
>> single spaces. However, you raise a good point (that there will be
>> sequences of two or more space characters after the substitutions in
>> step 3 above has taken place). I added the following as step 4 "In
>> result, convert any sequence of two or more U+0020 SPACE characters
>> into a single U+0020 SPACE."
> 
> Sure, but that still leaves the other whitespace characters (vertical
> tab, form feed, etc, etc) not being collapsed.  Is that really desired
> (and if so, why?), or is it just an oversight

Oversight. I modified the text so it now reads:

"In result, excluding any U+0020 SPACE characters, convert any sequence of
one or more characters marked with the [Unicode] property "White_Space" into
a single U+0020 SPACE."

The next step collapses sequences of two or more U+0020 SPACE into a single
U+0020 SPACE.

Is that any better?

Kind regards,
Marcos

[1] http://datadriven.com.au/2008/12/08/zip-files-and-encoding-i-hate-you/

Received on Wednesday, 28 January 2009 12:49:00 UTC