Re: encode-for-uri() and filenames?

Dear Gerrit, Martin, and XProc Dev,

Thank you for the quick responses, and for the reminder about p:urify().
The detail I had not understood is, as Gerrit writes:

This means that 'test-1a%7%.xml' is not a valid URI. The (relative) URI
that corresponds to this file name is 'test-1a%257%25.xml'. When it is
used to store the file, the percent encoding will be undone, resulting
in a file name 'test-1a%7%.xml'.


I knew that 'test-1a%7%.xml' was not a valid URI, which is why I tried to
pass it through encode-for-uri(), and the output I expected to emerge from
that was 'test-1a%257%25.xml'. Since the percent-encoded version is a valid
filename (even if not an especially user-friendly one), I expected that it
would be used as created, with the percent-encoding preserved in the
filename. I see, though, at
https://spec.xproc.org/master/head/steps/#c.store that the value of @href
on a <p:store> step is typed as xs:anyURI, and not as a string, which is
obvious and natural now that I think about it. My misunderstanding lay in
not expecting that the URI would be converted to a string by undoing the
percent encoding when the filename was created. Upon reflection, though, I
now think that's the behavior I should have expected, since applications
that ingest URIs and have to map them to file system resources need to undo
percent encoding as a matter of course.

This XProc inquiry was a follow-up on my earlier question on the exist-open
mailing list, where the eXist-db Java admit client refused to upload files
with names like "test-1a%7%.xml", with an error message to the effect that
the filename could not be converted to a URI. I think that behavior is
incorrect, since encode-for-uri() and p:urify() can convert the string
value of that filename to a percent-escaped URI, so the conversion does not
appear to be impossible. I asked on the eXist-db mailing list because I
thought that eXist-db should URI-ify the filename and upload the file, and
when it refused to do that, I then wondered whether the source of the
problem was that my XProc script should have used the percent-escaped value
of the filename when it created the file in the <p:store> step. I think I
am now back where I began, though, that is, that eXist-db declines to
construct a URI it can use from a filename that encode-for-uri() and
p:urify() are able to convert to a URI. I don't think I should expect
eXist-db to refuse to construct a URI from that filename, but that's a
question for the eXist-db mailing list, and I'll move it over there.

Thank you all again for the quick and helpful responses.

Best,

David

On Sun, Nov 22, 2020 at 12:56 PM Imsieke, Gerrit, le-tex <
gerrit.imsieke@le-tex.de> wrote:

>
>
> On 22.11.2020 18:12, Martin Honnen wrote:
> >>
> >> Because the percent signs as they are used in the filename are
> >> incompatible with URI encoding, I expect them to be percent-encoded
> >> themselves, with the modified filename echoed to stderr (in the
> >> <p:identity> step) and used to save the test file (in the <p:store>
> >> step). What happens instead is that the percent encoded value is
> >> written, as expected, to stderr:
> >>
> >>     test-1a%257%25.xml
> >>
> >> but the file is saved to the local filesystem as if encode-for-uri()
> >> had not been applied, that is, as:
> >>
> >>     test-1a%7%.xml
> >
> > I don't have an explanation for that, perhaps ask Achim by raising an
> > issue on Morgana on Sourceforge.
> >
>
> This is the correct behavior that you are observing.
>
> Quoting https://tools.ietf.org/html/rfc3986#section-2.4:
>
>   Because the percent ("%") character serves as the indicator for
>   percent-encoded octets, it must be percent-encoded as "%25" for that
>   octet to be used as data within a URI.
>
> This means that 'test-1a%7%.xml' is not a valid URI. The (relative) URI
> that corresponds to this file name is 'test-1a%257%25.xml'. When it is
> used to store the file, the percent encoding will be undone, resulting
> in a file name 'test-1a%7%.xml'.
>
> Instead of encode-for-uri(), you can also use p:urify()
> (https://spec.xproc.org/master/head/xproc/#f.urify) that will only
> encode the parts of the file name (or URI) that need to be encoded.
>
> For example, p:urify('c:\Users\gerrit\test-1a%7%.xml') will result in
> 'file:///c:/Users/gerrit/test-1a%257%25.xml'
>
> p:urify('c:\Users\gerrit\test-1a%257%25.xml') →
> 'file:///c:/Users/gerrit/test-1a%25257%2525.xml' (the input isn’t a URI,
> therefore '%25' will be regarded as a literal part of the file name that
> must be percent-encoded as '%2525' in a URI.
>
> p:urify('file:///c:/Users/gerrit/test-1a%257%25.xml') →
> 'file:///c:/Users/gerrit/test-1a%257%25.xml' (no additional encoding of
> the '%25's because “Implementations must not percent-encode or decode
> the same string more than once” as stated in the same Sect. 2.4 of RFC
> 3986).
>
> Morgana reports 'file:///c:/Users/gerrit/test-1a%25257%2525.xml' as the
> result of the last invocation. I think this is incorrect. Otherwise,
> Morgana seems to implement p:urify() incredibly well.
>
> Gerrit
>
>
>
>

Received on Sunday, 22 November 2020 18:39:36 UTC