Re: Discogs Linked Data

>
> this is a data set i really want too!!!!  somebody know a way around
> the unicode problem???
>
> Maybe find stuff like these "ï" with a regexp and then replace
them with the correct unicode chars.

In Python something like this looped through each line of the files should
work I think:

import re
teststr = 'Tchaïkovsky'
regex = re.compile(r'(?<!(&#\d{3};))(&#\d{3};){2}(?!(&#\d{3};))')
rObj = re.search(regex, teststr)
if rObj is not None:
  hexValues = [hex(int(rObj.group()[2:5])), hex(int(rObj.group()[8:11]))]
  newChar = ''.join([chr(int(c, 16)) for c in hexValues]).decode('utf8')
  print re.sub(regex, newChar, teststr)

output>Tchaïkovsky

I've posted a more complete version here http://pastebin.com/vuq72irC

Cheers,

Mats

Received on Friday, 4 June 2010 04:16:13 UTC