W3C home > Mailing lists > Public > www-rdf-validator@w3.org > May 2002

Re: [charmodReview-17] replacing all URIs with IRIs

From: Martin Duerst <duerst@w3.org>
Date: Mon, 27 May 2002 15:31:02 +0900
Message-Id: <4.2.0.58.J.20020527144521.00aa1810@localhost>
To: Aaron Swartz <me@aaronsw.com>, Misha.Wolf@reuters.com
Cc: www-tag@w3.org, www-rdf-validator@w3.org
[I'm copying www-rdf-validator@w3.org, because there is an error
report for the validator, and some suggestions of how to fix it.]

At 19:18 02/05/24 -0500, Aaron Swartz wrote:
>On Friday, May 24, 2002, at 05:04 PM, Misha.Wolf@reuters.com wrote:

>>Which utilities?
>
>All the current RDF tools, I think. I don't think any of them have been 
>updated to support normalization or Unicode storage. Certainly all the 
>tools I've written don't support it. If you take a look at the RDF 
>Validator[1] you'll find that it %-encodes characters like 端, as most of 
>the RDF tools I know do.

How much work would it be for the RDF Validator to change this?
My guess is that it would be quite easy, and it would result in
overall less code. I would be very glad to help.

By the way, I just tested the RDF Validator with some simple input.
While it gets to the correct %hh escaping in URIs, it messes up the
literals. That's because the validator input page is labeled as being
in iso-8859-1, and the output is labeled as being in UTF-8, but for
literals, there is no coversion in between.

To fix it, the following steps are needed:

- Set the encoding of http://www.w3.org/RDF/Validator/Overview.html
   to UTF-8. I can do that in about one minute. Please tell me when
   to do it.

- Find the place in the code where the URIs are converted from
   iso-8859-1 to UTF-8. Remove that conversion. This should be
   rather easy. Please tell me if you need help.

- Fix graphVis. This seems to currently run under the assumption that
   everything (.dot files,...) is in iso-8859-1. In the short run,
   it could be called by converting from UTF-8 to iso-8859-1 and
   replacing characters not representable in iso-8859-1 with something
   like a ? or so. In the long term, it should be changed so that it
   can correctly render more than just iso-8859-1. This applies only
   to PNG and GIF; for SVG, graphVis currently does gigo (garbage in,
   garbage out), but feeding it UTF-8 would do the right thing.
   For the others, the easiest would be to use a batch SVG renderer.

- Go through the collection of RDF saved for test purposes, and
   change the first line of anything that contains bytes higher
   than 0x7F from <?xml version="1.0"?> to
   <?xml version="1.0" encoding='iso-8859-1'?> and additionally
   check the data for garbage cases. I may be able to help with this,
   too.

My conclusions from this are:

- Yes, there are indeed problems with RDF tools and i18n.
- Such problems should be fixed asap.
- The problems start with literals, not with resource identifiers.
- Fixing the problems with literals will fix the problems with
   resource identifiers too, in most cases.
- For most part, fixing the problems probably takes less time
   than this discussion.

Regards,    Martin.
Received on Monday, 27 May 2002 02:34:08 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Sunday, 1 May 2011 06:15:15 GMT