- From: <mats.gls@gmail.com>
- Date: Fri, 4 Jun 2010 06:15:42 +0200
- To: Kurt J <kurtjx@gmail.com>
- Cc: public-lod@w3.org
Received on Friday, 4 June 2010 04:16:13 UTC
> > this is a data set i really want too!!!! somebody know a way around > the unicode problem??? > > Maybe find stuff like these "ï" with a regexp and then replace them with the correct unicode chars. In Python something like this looped through each line of the files should work I think: import re teststr = 'Tchaïkovsky' regex = re.compile(r'(?<!(&#\d{3};))(&#\d{3};){2}(?!(&#\d{3};))') rObj = re.search(regex, teststr) if rObj is not None: hexValues = [hex(int(rObj.group()[2:5])), hex(int(rObj.group()[8:11]))] newChar = ''.join([chr(int(c, 16)) for c in hexValues]).decode('utf8') print re.sub(regex, newChar, teststr) output>Tchaïkovsky I've posted a more complete version here http://pastebin.com/vuq72irC Cheers, Mats
Received on Friday, 4 June 2010 04:16:13 UTC