Re: Request for Comments: Last Call WD of Widgets 1.0: Packaging & Configuration spec; deadline 31 Jan 2009

Hi Boris,
On Fri, Jan 9, 2009 at 10:20 PM, Boris Zbarsky <bzbarsky@mit.edu> wrote:
>
> A few comments:
>
> 1)  In section 7.3, boolean attributes are defined to use case-insensitive
> matching.  Why is that?  There doesn't seem to be a definition of
> case-insensitive here, which worries me, since case-folding is always tricky
> business (see below).  I would suggest requiring a case-sensitive match to
> "true" or "false" here.
>

Yikes! I thought I had fixed that! thanks for pointing that out.

> 2)  Section 8.2, step 2, second list, item 8 has a similar issue for
> filenames.  For example, consider the following pairs of filenames:
>
>  a) "i" and "I"
>  b) "i" and "ý"
>  c) "i" and "Ý"
>  d) "ý" and "I"
>  e) "ý" and "Ý"
>  f) "I" and "Ý"
>
> Here 'i' is U+0069, 'I' is U+0049, 'ý' is U+0131 and 'Ý' is U+0130.
>
> Which of these pairs should be considered to "upon normalization, case
> insensitively match"?  Seems like (a) should, (c) should, (d) should, right?
>  But (b) and (e), and (f) maybe should not?  That means the matching
> relation is non-transitive, of course.  Or should these all match?  Or
> something else?
>
> I'm not sure what the reason for this case-insensitive check is exactly; if
> there's a strong reason for it it needs to be defined.  Otherwise it needs
> to be removed.

Ok, I've removed it. This may cause implementations to override files
on systems that don't support case insensitive file names. This should
not be a real problem, as most file system won't let you create files
with the same name but different cases. And, on Windows at least, if
you try to add a file to a zip archive that already contains a file
with the same name (regardless of case), it will ask you to override
the file.

> 3)  When parsing a non-negative integer (Section 8.2, step 8), what's the
> expected behavior for integers larger than 2^32?  2^64?  Are implementations
> of this specification required to do integer arithmetic on arbitrarily large
> integers?  If not, is the behavior just implementation-dependent?

I think that is an implementation detail.

> 4)  Section 8.2, step 8, it would be good to make sure that the image
> identification table matches the one in HTML5 (possibly by having both
> specifications refer to a single table, if that's workable).

The tables match (because I ripped the values straight from HTML5).
HIxie and A. Barth are working on a separate internet draft for
sniffing [1]. We will probably end up referencing that.

> 5)  Section 8.2, step 8, I'm not sure why image/svg+xml is required to be
> processed according to SVGTiny.  This means that an SVG 1.1 or SVG 1.2 Full
> (whenever that happens) user-agent cannot implement this specification, as
> far as I can see.

Hmmm... that's not what I meant. Is SVGTiny a subset of 1.1 or 1.2?
How do you recommend we proceed here?

> 6)  Section 6.2 talks about using file extensions followed by content-type
> sniffing to determine MIME types.  This sounds to me like the exact process
> is up to the UA.  Then Section 8.2, step 8, has specific lists of extensions
> and magic numbers that UAs need to recognize.  Is the sniffing allowed in
> Section 6.2 required to be a superset of what Section 8.2 allows?  If so,
> this should be made clearer.

Understood. I added the following text:
"For sniffing the content type of images formats supported by this
specification, a widget user agent must use the Rules for Identifying
the MIME type of an Image. For other file formats supported by the
specification, a widget user agent must use the Rules for Identifying
the MIME Type of a file."

> If more sniffing is allowed than what's listed
> in 8.2, this can lead to security problems where two UAs (say a security
> checker and a web browser) treat the same file in a widget as having
> different types.  This is the sort of situation that HTML5 is trying very
> hard to avoid with its sniffing algorithm.  I feel that all sniffing that
> UAs are allowed to perform must be explicitly listed in the specification.
> If that means that not all files can have MIME types deduced, then an
> alternate mechanism needs to be provided to indicate MIME types for files.

We might need a manifest format... something like:
   <manifest>
      <resource type="some/type" src="/path/to/file" />
  <manifest>

 Or, better still...

<mediatypes>
   <type name="some/type" extension="gif"/>
</mediatypes>

Or a mix of both solutions.

We had thought about deferring that feature to version 2.0 (not widget
engine on the market has required such a manifest thus far because
they all seem to just rely on sniffing).

> 7)  It's not clear to me why Section 5.3 allows encoding of filenames using
> [CP437].  Why not just require UTF-8?

Because the Zip spec mandates CP437 unless the implementation supports
version 6.3 or above of the Zip spec. Sadly, most Zip implementations
do whatever they want when it comes to character encoding. This is
probably the biggest barrier to interoperability of packaging.

> 8)  The algorithm for getting text content in Section 8.2, step 2 doesn't
> look correct to me.  For example, consider an input element whose XML
> serialization looks like this:
>
>  <outer><inner1>First</inner1> <inner2>Second</inner2></outer>
>
> The text content of this input, according to the spec's algorithm, is the
> the string "FirstSecond".  I would expect to get "First Second" as the text
> content in this case.  Is there a reason to not just use textContent here?
>  Note that even the example in the specification gets this wrong.  There the
> markup is:
>
>   <name>
>     The <blink>Awesome</blink>
>     <author email="dude@example.com">Super <blink>Dude</blink></author>
>     Widget</name>
>
> for which this algorithm gives "The AwesomeSuper Dude     Widget" and not
> what the spec claims (I have also removed the carriage returns for
> legibility).

True (see rewrite below).

> In the same algorithm, there's mention of "the input's text nodes". This
> relationship is not defined in this specification or elsewhere.  I assume
> you mean the text nodes which have input as their ancestor, right?

The "input" is the element being processed.

> In the same algorithm, rule 4 doesn't make sense to me.  What's "position"?
>  Is it a character, or an index?  Or something else?  If you mean to say
> that input's nodeValue is to be appended to result, just say that.
>
> In the informative section following this algorithm, there is mention of
>  "getTextContent() DOM3 Java interface", whatever that is.  I'm not sure why
> we need to drag Java into this.  If we want to say something about the
> node's DOM3 textContent property, we should just say that, in my opinion.
>  There's no language binding involved here; the property is defined in the
> relevant IDL and its definition is language-agnostic.
>

Agreed. Ok, that section was totally screwed:) I've rewritten the the
algorithm and added a new algorithm that normalizes the white space:

==Rules for Getting Text Content==
The rules for getting text content are given in the following
algorithm. The algorithm always returns a string, which may be empty.

1. Let input be the element to be processed.
2. If the widget user agent supports [ITS]: If the element has the dir
attribute from the [ITS] namespace with a valid its:dir value, then
process its text nodes in accordance to the [ITS] specification.
3. Return the value of the textContent [DOM3Core] property for input.

==Rules for Getting Text Content with Normalized White Space==

The rules for getting text content with normalized white space are
given in the following algorithm.

1. Let input be the element to be processed.
2. Let result be the result of applying the Rules for Getting Text
Content to input.
3. In result, convert any sequence of one or more U+000A LINE FEED
(LF) or U+000D CARRIAGE RETURN (CR) or U+0009 CHARACTER TABULATION
(tab) character into a single U+0020 SPACE.
4. If  the first character in result is a U+0020 SPACE, remove the
first character.
5. If  the last character in result is a U+0020 SPACE, remove the last
character.
6. Return result.

I've used the Rules for Getting Text Content with Normalized White
Space for elements where space characters are unwanted (i.e., name and
author).

> 9) In the "Rules for Removing Whitespace" section in Section 8.2, Step 2
> have the following language:
>
>   While position doesn't point past the end of input and the
>   character at position is not one of the space characters,
>   append character to the end of result and let position become
>   the character in input.
>
> Here "character" is a Unicode character the first and second time it's
> mentioned, and seems to be an integer the third time?  Or something?  If
> you're trying to say that the position should move to the next character in
> input, say that, please.
>

Ok, turns out that the Rules for Removing White Space are not actually
needed anywhere (and would have cause problems because "10 00 11"
would have been interpreted as "100011" instead of an error). I
rewrote the Rules for Parsing Non-Negative Integer  skip space
characters instead (as should have been in the first place, and as if
defined in HTML5). Skipping white space is done as follows- note that
"space characters" is defined elsewhere in the spec:

1. Let input be the string being parsed.
2. Let result have the value 0.
3. If the length of input is 0, return an error.
4. Let position be a pointer into input, initially pointing at the
start of the string.
5. Let nextchar be the character in input at position.
6. If the nextchar is one of the space characters, increment position.
If position is past the end of input, return an error. Otherwise, go
to 5 in this algorithm.
...algorithm continues as was already defined.

> 10) Is there a reason to not have any JPEG images in the Image
> Identification Table in Section 8.2, Step 2?  I would have thought widgets
> might wish to include such images.
>

Added.

Kind regards,
Marcos

[1] http://webblaze.cs.berkeley.edu/2009/mime-sniff/mime-sniff.txt
-- 
Marcos Caceres
http://datadriven.com.au

Received on Thursday, 22 January 2009 14:06:40 UTC