Re: Proposed update to pxp:unzip: charset attribute from Norman Walsh on 2011-10-11 (xproc-dev@w3.org from October 2011)

From: Norman Walsh <ndw@nwalsh.com>
Date: Tue, 11 Oct 2011 16:52:12 -0400
To: XProc Dev <xproc-dev@w3.org>
Message-ID: <m24nzfi743.fsf@nwalsh.com>
"Imsieke, Gerrit, le-tex" <gerrit.imsieke@le-tex.de> writes:
> As I’ve asked for it, I will of course support this proposal.
>
> I just found this evidence (from 18 months ago) in which Vojtech 
> considered the currently specified behavior as a bug in the EXProc spec, 
> advocating a p:data-like treatment of unzipped text content:
> http://lists.w3.org/Archives/Public/xproc-dev/2010Apr/0095.html

Ah, yes. That slipped off my radar, but I think Vojtech is entirely
correct. The trouble is that converting from text in the ZIP file to
Unicode in XML requires knowing the charset.

> The main difference between http://www.w3.org/TR/xproc/#p.data and its 
> default result c:data is that the pipeline designer doesn’t have control 
> over which charset should be used. The XProc spec seems to expect that 
> the input’s charset will somehow be known, for example announced by HTTP 
> headers or detected by some implementation-dependent magic.

HTTP makes the charset completely clear. Files off the filesystem
might make the charset clear, though in practice, I don't think they
do very often. Files in a ZIP are a whole other story, I think.

In any event, our current story about charset seems a bit scattered.

> In your proposal, however, the pipeline developer will be able to 
> specify the expected encoding. This is similar to xsl:unparsed-text, and 
> I think this is desirable.
>
> In the absence of a user-supplied charset, Calabash might use what has 
> been given by Java’s -Dfile.encoding option or whatever charset Java 
> uses by locale or default.

Yes. I think that rather than making a missing charset an error, we
should give implementations the freedom to guess. It'll all go pear
shaped if they guess wrong, but realistically, the pipeline author is
just going to insert a guess to work around the fact that it's
required and it's all going to go pear shaped when they guess wrong
too.

> For my intended use case, which is reading CSS files from an EPUB Zip 
> archive, the situation is as follows:

[ Useful summary of the sorts of problems that arise with encodings
  elided. ]

> So for the few of you that read until here, I’m suggesting the following:
> - U+1F44D (thumbs up) for an optional charset attribute with pxp:unzip, 
> just like Norm suggested
> - otherwise: implementation-defined:
>    Implementors should decide whether to use Java’s file encoding 
> setting, to honor a BOM, to honor an @charset rule, to use heuristic 
> detection, either by using a library or by trying a list of known 
> charsets (US-ASCII, UTF-8, ISO-8859-1, ...) in turn.
> If “implementation-defined” is implemented well (good autodetection), 
> more and more pipeline developers will omit explicit charset declaration.

Don't hold your breath:

http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream

                                        Be seeing you,
                                          norm

-- 
Norman Walsh
Lead Engineer
MarkLogic Corporation
Phone: +1 413 624 6676
www.marklogic.com
Received on Tuesday, 11 October 2011 20:52:52 UTC