- From: Sean B. Palmer <sean@mysterylights.com>
- Date: Fri, 5 Jul 2002 22:36:09 +0100
- To: "Dan Connolly" <connolly@w3.org>, "Sampo Syreeni" <decoy@iki.fi>
- Cc: "Tim Berners-Lee" <timbl@w3.org>, <www-archive+n3bugs@w3.org>
> write(re.sub(r'([\x80-\xff])',
> lambda m: '&#x%02X;' % ord(m.group(1)), str(s[i:])))
> [...]
> <con_Street>Kämnerintie 4 A 22</con_Street>
Heh, whoops; the following bit of code encodes the characters as proper
Unicode rather than just encoding the bytes:-
write(re.sub(ur'([\u0080-\uffff])',
lambda m: '&#x%02X;' % ord(m.group(1)), unicode(s[i:],
'utf-8')))
output:-
<con_Street>Kämnerintie 4 A 22</con_Street>
sorry 'bout that.
Whilst I'm writing again, I should note that the test case causes another
unicode conversion error when converting N3 to N3 (i.e. just running it
through the pretty printer). The traceback reveals:-
File "/home/2000/10/swap/notation3.py", in strconst
ustr = ustr + str[j:i]
UnicodeError: ASCII decoding error: ordinal not in range(128)
When I changed:-
ustr = u"" # Empty unicode string
to:-
ustr = "" # Empty string
it gave this output:-
:con_Street "K\u00c3\u00a4mnerintie 4 A 22" .
which is still wrong. You either want to give the UTF-8 encoded output with
the bytes present, or the proper code for the character, which is \u00e4.
As it is, it's just converting the bytes into unicode (um... a bit like I
mistakenly did with the code in my last email). I managed to rig up a very
silly and complex fix for this problem, but I'm sure there's a better way,
so I'll just bring your attentions to it instead.
--
Kindest Regards,
Sean B. Palmer
@prefix : <http://purl.org/net/swn#> .
:Sean :homepage <http://purl.org/net/sbp/> .
Received on Friday, 5 July 2002 17:36:20 UTC