- From: Sean B. Palmer <sean@mysterylights.com>
- Date: Fri, 5 Jul 2002 22:36:09 +0100
- To: "Dan Connolly" <connolly@w3.org>, "Sampo Syreeni" <decoy@iki.fi>
- Cc: "Tim Berners-Lee" <timbl@w3.org>, <www-archive+n3bugs@w3.org>
> write(re.sub(r'([\x80-\xff])', > lambda m: '&#x%02X;' % ord(m.group(1)), str(s[i:]))) > [...] > <con_Street>Kämnerintie 4 A 22</con_Street> Heh, whoops; the following bit of code encodes the characters as proper Unicode rather than just encoding the bytes:- write(re.sub(ur'([\u0080-\uffff])', lambda m: '&#x%02X;' % ord(m.group(1)), unicode(s[i:], 'utf-8'))) output:- <con_Street>Kämnerintie 4 A 22</con_Street> sorry 'bout that. Whilst I'm writing again, I should note that the test case causes another unicode conversion error when converting N3 to N3 (i.e. just running it through the pretty printer). The traceback reveals:- File "/home/2000/10/swap/notation3.py", in strconst ustr = ustr + str[j:i] UnicodeError: ASCII decoding error: ordinal not in range(128) When I changed:- ustr = u"" # Empty unicode string to:- ustr = "" # Empty string it gave this output:- :con_Street "K\u00c3\u00a4mnerintie 4 A 22" . which is still wrong. You either want to give the UTF-8 encoded output with the bytes present, or the proper code for the character, which is \u00e4. As it is, it's just converting the bytes into unicode (um... a bit like I mistakenly did with the code in my last email). I managed to rig up a very silly and complex fix for this problem, but I'm sure there's a better way, so I'll just bring your attentions to it instead. -- Kindest Regards, Sean B. Palmer @prefix : <http://purl.org/net/swn#> . :Sean :homepage <http://purl.org/net/sbp/> .
Received on Friday, 5 July 2002 17:36:20 UTC