Re: CWM bug + RFE: N3 to XML RDF fails on non-trivial Unicode from Sean B. Palmer on 2002-07-05 (www-archive@w3.org from July 2002)

From: Sean B. Palmer <sean@mysterylights.com>
Date: Fri, 5 Jul 2002 22:36:09 +0100
To: "Dan Connolly" <connolly@w3.org>, "Sampo Syreeni" <decoy@iki.fi>
Cc: "Tim Berners-Lee" <timbl@w3.org>, <www-archive+n3bugs@w3.org>
Message-ID: <012301c2246b$fc6d0920$35560150@localhost>

>             write(re.sub(r'([\x80-\xff])',
>                   lambda m: '&#x%02X;' % ord(m.group(1)), str(s[i:])))
> [...]
>         <con_Street>K&#xC3;&#xA4;mnerintie 4 A 22</con_Street>

Heh, whoops; the following bit of code encodes the characters as proper
Unicode rather than just encoding the bytes:-

            write(re.sub(ur'([\u0080-\uffff])',
                  lambda m: '&#x%02X;' % ord(m.group(1)), unicode(s[i:],
'utf-8')))

output:-

        <con_Street>K&#xE4;mnerintie 4 A 22</con_Street>

sorry 'bout that.

Whilst I'm writing again, I should note that the test case causes another
unicode conversion error when converting N3 to N3 (i.e. just running it
through the pretty printer). The traceback reveals:-

  File "/home/2000/10/swap/notation3.py", in strconst
    ustr = ustr + str[j:i]
UnicodeError: ASCII decoding error: ordinal not in range(128)

When I changed:-

        ustr = u""   # Empty unicode string

to:-

        ustr = ""   # Empty string

it gave this output:-

         :con_Street "K\u00c3\u00a4mnerintie 4 A 22" .

which is still wrong. You either want to give the UTF-8 encoded output with
the bytes present, or the proper code for the character, which is \u00e4.
As it is, it's just converting the bytes into unicode (um... a bit like I
mistakenly did with the code in my last email). I managed to rig up a very
silly and complex fix for this problem, but I'm sure there's a better way,
so I'll just bring your attentions to it instead.

--
Kindest Regards,
Sean B. Palmer
@prefix : <http://purl.org/net/swn#> .
:Sean :homepage <http://purl.org/net/sbp/> .

Received on Friday, 5 July 2002 17:36:20 UTC