- From: Norman Walsh <ndw@nwalsh.com>
- Date: Tue, 11 Oct 2011 16:52:12 -0400
- To: XProc Dev <xproc-dev@w3.org>
- Message-ID: <m24nzfi743.fsf@nwalsh.com>
"Imsieke, Gerrit, le-tex" <gerrit.imsieke@le-tex.de> writes: > As I’ve asked for it, I will of course support this proposal. > > I just found this evidence (from 18 months ago) in which Vojtech > considered the currently specified behavior as a bug in the EXProc spec, > advocating a p:data-like treatment of unzipped text content: > http://lists.w3.org/Archives/Public/xproc-dev/2010Apr/0095.html Ah, yes. That slipped off my radar, but I think Vojtech is entirely correct. The trouble is that converting from text in the ZIP file to Unicode in XML requires knowing the charset. > The main difference between http://www.w3.org/TR/xproc/#p.data and its > default result c:data is that the pipeline designer doesn’t have control > over which charset should be used. The XProc spec seems to expect that > the input’s charset will somehow be known, for example announced by HTTP > headers or detected by some implementation-dependent magic. HTTP makes the charset completely clear. Files off the filesystem might make the charset clear, though in practice, I don't think they do very often. Files in a ZIP are a whole other story, I think. In any event, our current story about charset seems a bit scattered. > In your proposal, however, the pipeline developer will be able to > specify the expected encoding. This is similar to xsl:unparsed-text, and > I think this is desirable. > > In the absence of a user-supplied charset, Calabash might use what has > been given by Java’s -Dfile.encoding option or whatever charset Java > uses by locale or default. Yes. I think that rather than making a missing charset an error, we should give implementations the freedom to guess. It'll all go pear shaped if they guess wrong, but realistically, the pipeline author is just going to insert a guess to work around the fact that it's required and it's all going to go pear shaped when they guess wrong too. > For my intended use case, which is reading CSS files from an EPUB Zip > archive, the situation is as follows: [ Useful summary of the sorts of problems that arise with encodings elided. ] > So for the few of you that read until here, I’m suggesting the following: > - U+1F44D (thumbs up) for an optional charset attribute with pxp:unzip, > just like Norm suggested > - otherwise: implementation-defined: > Implementors should decide whether to use Java’s file encoding > setting, to honor a BOM, to honor an @charset rule, to use heuristic > detection, either by using a library or by trying a list of known > charsets (US-ASCII, UTF-8, ISO-8859-1, ...) in turn. > If “implementation-defined” is implemented well (good autodetection), > more and more pipeline developers will omit explicit charset declaration. Don't hold your breath: http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream Be seeing you, norm -- Norman Walsh Lead Engineer MarkLogic Corporation Phone: +1 413 624 6676 www.marklogic.com
Received on Tuesday, 11 October 2011 20:52:52 UTC