Re: XML problems with percent encoding

On Wed, Nov 18, 2009 at 16:14, Jeremy Carroll <jeremy@topquadrant.com> wrote:
> Sebastian Hellmann wrote:
>>
>> Dear all,
>> we (especially Matthias Weidl @ KAIST)  are currently working on producing
>> a Korean DBpedia.
>> We encountered a problem again that we are not really able to solve but
>> can only produce a workaround. The property URIs in korean completely have
>> special Characters. If we try to URL encode them, serialisation in RDF/XML
>> is bound to fail.
>>
>> For a property like:
>> http://dbpedia.org/property/l%E3%A4ngengrad
>> Jena produces the following:
>> <ns0:ngengrad xmlns:ns0="http://dbpedia.org/property/l%E3%A4">
>> because % is not a valid character in an XML tag.
>> But if the property only contains special characters, it can not work any
>> more:
>> http://ko.dbpedia.org/property/%EA%B4%91%EC%9E%90
>>
>> In DBpedia we created a work around for this, replacing % with _percent_
>> but it is clearly not a satisfactory solution.
>>
>> How shall we resolve this matter?
>> Is XML conformity still necessary or is there a motion to only use turtle
>> in the future?
>>
>>
>
> Sorry I am late to this thread.
> Why are you percent encoding the special chars. Why not just leave them in
> Korean?
> Semantic Web standards are based on IRIs that allow all this chars
>
> Jeremy

Hi,

I don't know the details of IRIs, but that sounds like a good idea.

For a moment I thought that N-Triples wouldn't allow IRIs, but
then I realized that there's a difference between 'URI' and
'URI reference'. If I understand [1] and [2] correctly, we could
(and probably should) generate N-Triples like the following:

<http://dbpedia.org/resource/Glinde%2C_Schleswig-Holstein>
<http://dbpedia.org/property/l\u00E4ngengrad> "10/12/40/E"@de .

instead of

<http://dbpedia.org/resource/Glinde%2C_Schleswig-Holstein>
<http://dbpedia.org/property/l%C3%A4ngengrad> "10/12/40/E"@de .


(Besides, the encoded property URIs we currently use are
broken - E3 A4 is not even a valid UTF-8 byte sequence.
The correct UTF-8 encoding of 'ไ' is C3 A4.)

Christopher

[1] http://www.w3.org/TR/rdf-testcases/#sec-uri-encoding
[2] http://www.w3.org/TR/rdf-concepts/#dfn-URI-reference

Received on Wednesday, 18 November 2009 17:30:30 UTC