Re: Proposed update to pxp:unzip: charset attribute from Imsieke, Gerrit, le-tex on 2011-10-09 (xproc-dev@w3.org from October 2011)

From: Imsieke, Gerrit, le-tex <gerrit.imsieke@le-tex.de>
Date: Sun, 09 Oct 2011 11:31:52 +0200
To: xproc-dev@w3.org
Message-ID: <4E916A08.1000605@le-tex.de>
As I’ve asked for it, I will of course support this proposal.

I just found this evidence (from 18 months ago) in which Vojtech 
considered the currently specified behavior as a bug in the EXProc spec, 
advocating a p:data-like treatment of unzipped text content:
http://lists.w3.org/Archives/Public/xproc-dev/2010Apr/0095.html

The main difference between http://www.w3.org/TR/xproc/#p.data and its 
default result c:data is that the pipeline designer doesn’t have control 
over which charset should be used. The XProc spec seems to expect that 
the input’s charset will somehow be known, for example announced by HTTP 
headers or detected by some implementation-dependent magic.

In your proposal, however, the pipeline developer will be able to 
specify the expected encoding. This is similar to xsl:unparsed-text, and 
I think this is desirable.

In the absence of a user-supplied charset, Calabash might use what has 
been given by Java’s -Dfile.encoding option or whatever charset Java 
uses by locale or default.

For my intended use case, which is reading CSS files from an EPUB Zip 
archive, the situation is as follows:

Currently, I’ll get base64 output. I’d have to use my own decoder 
extension or use a commercial Saxon version in order to convert it to text.

I still wouldn’t know whether this text was UTF-8, ISO-8859-1 or whatever.

In case of CSS, there may be a @charset rule at the beginning of the 
file. So a text parser might interpret the bytes of a purported text/ 
file as US-ASCII until the sequence @ c h a r s e t appears, and then 
interpret the whole text file according to what the @charset rule told 
it to do. So this might be a reasonable treatment for text/css. But the 
thing about @charset is: it doesn’t have to be present in a CSS file. 
Which leads us to the general case:

For the generalized text file, I don’t know whether it should be up to 
the XProc processor or to the pipeline designer to guess the correct 
encoding. If the processor doesn’t use charset autodetection magic, the 
pipeline developer has to specify the expected encoding.

In abovementioned application (epubcheck implemented in XProc) this will 
amount to several nested try/catch clauses: first try to read CSS as 
UTF-8 (which will work for US-ASCII, too) and if that fails, as ISO-8859-1.

Of course the processor might implement this try/catch approach in the 
background (if there’s no user-supplied charset attribute), thereby 
providing its own autodetection heuristics.

For the special case of text/css: An implementation may honor @charset 
rules in text/css files, but this should be overridden by an explicit 
charset attribute on pxp:unzip. (But if I were an implementor, I 
wouldn’t waste much time on honoring @charset rules in text/css.)

So for the few of you that read until here, I’m suggesting the following:
- U+1F44D (thumbs up) for an optional charset attribute with pxp:unzip, 
just like Norm suggested
- otherwise: implementation-defined:
   Implementors should decide whether to use Java’s file encoding 
setting, to honor a BOM, to honor an @charset rule, to use heuristic 
detection, either by using a library or by trying a list of known 
charsets (US-ASCII, UTF-8, ISO-8859-1, ...) in turn.
If “implementation-defined” is implemented well (good autodetection), 
more and more pipeline developers will omit explicit charset declaration.

Good example of a seemingly simpler data format (plain text) being 
harder to read than XML.

Gerrit


On 2011-10-08 22:10, Norman Walsh wrote:
> Hi folks,
>
> I propose that we add a charset option to the pxp:unzip step. If the
> specified content type is not an XML content type but is a text
> content type (begins "text/") and a charset parameter is specified,
> then the result is a c:data element containing the characters of the
> extracted file.
>
> In other words, for text in a known charset, we decode the content
> instead of returning it as a base64 encoded chunk.
>
> Thoughts?
>
>                                          Be seeing you,
>                                            norm
>

-- 
Gerrit Imsieke
Geschäftsführer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany
Phone +49 341 355356 110, Fax +49 341 355356 510
gerrit.imsieke@le-tex.de, http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

Geschäftsführer: Gerrit Imsieke, Svea Jelonek,
Thomas Schmidt, Dr. Reinhard Vöckler

------------------------------------------------------------------------
Besuchen Sie uns auf der Frankfurter Buchmesse

in Halle 4.2, Stand F410.


Mehr dazu unter www.le-tex.de/de/buchmesse.html
Received on Sunday, 9 October 2011 09:32:48 UTC