Re: Discogs Linked Data from mats.gls@gmail.com on 2010-06-04 (public-lod@w3.org from June 2010)

From: <mats.gls@gmail.com>
Date: Fri, 4 Jun 2010 06:15:42 +0200
To: Kurt J <kurtjx@gmail.com>
Cc: public-lod@w3.org
Message-ID: <AANLkTin5DyKHgzzFCb9ov-Rigx7qb-IC9Dmng40GA6cf@mail.gmail.com>

>
> this is a data set i really want too!!!!  somebody know a way around
> the unicode problem???
>
> Maybe find stuff like these "&#195;&#175;" with a regexp and then replace
them with the correct unicode chars.

In Python something like this looped through each line of the files should
work I think:

import re
teststr = 'Tcha&#195;&#175;kovsky'
regex = re.compile(r'(?<!(&#\d{3};))(&#\d{3};){2}(?!(&#\d{3};))')
rObj = re.search(regex, teststr)
if rObj is not None:
  hexValues = [hex(int(rObj.group()[2:5])), hex(int(rObj.group()[8:11]))]
  newChar = ''.join([chr(int(c, 16)) for c in hexValues]).decode('utf8')
  print re.sub(regex, newChar, teststr)

output>Tchaïkovsky

I've posted a more complete version here http://pastebin.com/vuq72irC

Cheers,

Mats

Received on Friday, 4 June 2010 04:16:13 UTC