- From: Imsieke, Gerrit, le-tex <gerrit.imsieke@le-tex.de>
- Date: Sun, 09 Oct 2011 11:31:52 +0200
- To: xproc-dev@w3.org
As I’ve asked for it, I will of course support this proposal. I just found this evidence (from 18 months ago) in which Vojtech considered the currently specified behavior as a bug in the EXProc spec, advocating a p:data-like treatment of unzipped text content: http://lists.w3.org/Archives/Public/xproc-dev/2010Apr/0095.html The main difference between http://www.w3.org/TR/xproc/#p.data and its default result c:data is that the pipeline designer doesn’t have control over which charset should be used. The XProc spec seems to expect that the input’s charset will somehow be known, for example announced by HTTP headers or detected by some implementation-dependent magic. In your proposal, however, the pipeline developer will be able to specify the expected encoding. This is similar to xsl:unparsed-text, and I think this is desirable. In the absence of a user-supplied charset, Calabash might use what has been given by Java’s -Dfile.encoding option or whatever charset Java uses by locale or default. For my intended use case, which is reading CSS files from an EPUB Zip archive, the situation is as follows: Currently, I’ll get base64 output. I’d have to use my own decoder extension or use a commercial Saxon version in order to convert it to text. I still wouldn’t know whether this text was UTF-8, ISO-8859-1 or whatever. In case of CSS, there may be a @charset rule at the beginning of the file. So a text parser might interpret the bytes of a purported text/ file as US-ASCII until the sequence @ c h a r s e t appears, and then interpret the whole text file according to what the @charset rule told it to do. So this might be a reasonable treatment for text/css. But the thing about @charset is: it doesn’t have to be present in a CSS file. Which leads us to the general case: For the generalized text file, I don’t know whether it should be up to the XProc processor or to the pipeline designer to guess the correct encoding. If the processor doesn’t use charset autodetection magic, the pipeline developer has to specify the expected encoding. In abovementioned application (epubcheck implemented in XProc) this will amount to several nested try/catch clauses: first try to read CSS as UTF-8 (which will work for US-ASCII, too) and if that fails, as ISO-8859-1. Of course the processor might implement this try/catch approach in the background (if there’s no user-supplied charset attribute), thereby providing its own autodetection heuristics. For the special case of text/css: An implementation may honor @charset rules in text/css files, but this should be overridden by an explicit charset attribute on pxp:unzip. (But if I were an implementor, I wouldn’t waste much time on honoring @charset rules in text/css.) So for the few of you that read until here, I’m suggesting the following: - U+1F44D (thumbs up) for an optional charset attribute with pxp:unzip, just like Norm suggested - otherwise: implementation-defined: Implementors should decide whether to use Java’s file encoding setting, to honor a BOM, to honor an @charset rule, to use heuristic detection, either by using a library or by trying a list of known charsets (US-ASCII, UTF-8, ISO-8859-1, ...) in turn. If “implementation-defined” is implemented well (good autodetection), more and more pipeline developers will omit explicit charset declaration. Good example of a seemingly simpler data format (plain text) being harder to read than XML. Gerrit On 2011-10-08 22:10, Norman Walsh wrote: > Hi folks, > > I propose that we add a charset option to the pxp:unzip step. If the > specified content type is not an XML content type but is a text > content type (begins "text/") and a charset parameter is specified, > then the result is a c:data element containing the characters of the > extracted file. > > In other words, for text in a known charset, we decode the content > instead of returning it as a base64 encoded chunk. > > Thoughts? > > Be seeing you, > norm > -- Gerrit Imsieke Geschäftsführer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@le-tex.de, http://www.le-tex.de Registergericht / Commercial Register: Amtsgericht Leipzig Registernummer / Registration Number: HRB 24930 Geschäftsführer: Gerrit Imsieke, Svea Jelonek, Thomas Schmidt, Dr. Reinhard Vöckler ------------------------------------------------------------------------ Besuchen Sie uns auf der Frankfurter Buchmesse in Halle 4.2, Stand F410. Mehr dazu unter www.le-tex.de/de/buchmesse.html
Received on Sunday, 9 October 2011 09:32:48 UTC