- From: <mats.gls@gmail.com>
- Date: Fri, 4 Jun 2010 06:15:42 +0200
- To: Kurt J <kurtjx@gmail.com>
- Cc: public-lod@w3.org
Received on Friday, 4 June 2010 04:16:13 UTC
>
> this is a data set i really want too!!!! somebody know a way around
> the unicode problem???
>
> Maybe find stuff like these "ï" with a regexp and then replace
them with the correct unicode chars.
In Python something like this looped through each line of the files should
work I think:
import re
teststr = 'Tchaïkovsky'
regex = re.compile(r'(?<!(&#\d{3};))(&#\d{3};){2}(?!(&#\d{3};))')
rObj = re.search(regex, teststr)
if rObj is not None:
hexValues = [hex(int(rObj.group()[2:5])), hex(int(rObj.group()[8:11]))]
newChar = ''.join([chr(int(c, 16)) for c in hexValues]).decode('utf8')
print re.sub(regex, newChar, teststr)
output>Tchaïkovsky
I've posted a more complete version here http://pastebin.com/vuq72irC
Cheers,
Mats
Received on Friday, 4 June 2010 04:16:13 UTC