- From: Joshua Allen <joshuaa@microsoft.com>
- Date: Thu, 6 Jun 2002 16:40:59 -0700
- To: "Aaron Swartz" <me@aaronsw.com>
- Cc: <www-rdf-interest@w3.org>, Bill de hÓra <dehora@eircom.net>, "Graham Klyne" <GK@ninebynine.org>
All this shows is that the GNU tools are hopelessly outdated. They are 30-years old technologies, and every modern system today supports UTF-16. The web is massively populated with UTF-16 data, and this is the fastest growing type of content. Within 10 years, more than half of the information on the web will represented with Asian character sets that are best serialized in UTF-16. And even if we ignore the moral repugnance of the Euro-centric bias in some of our information processing legacy, we have got to admit that the whole western world will be missing out on massive market opportunities (i.e. money) by having substandard ability to process global information. And besides, the GNU tools suck at more than just non-European languages, as your comment about "diff" points out. Even if you fixed all of the GNU tools to have a proper concept of "character" to include a Unicode character, they would still be limited by the fact that they see the universe through "linefeed" colored glasses. In XML, linefeeds are not relevant, so GNU diff is unable to do a proper diff of XML data, even *if* that data is purely 7-bit. When Aho, Kernigan, Wu et. al wrote those command-line tools, I am sure they did not expect for them to be turned into a "movement" (more like a "settlement") that would be standing in the way of progress and trying to squash solutions unrealistically into a 7-bit bucket. And finally, I would dispute that you can really use "diff" to determine changes between international documents expressed in the n-triples syntax. First, you are making the assumption that the escaping syntax described by the n-triples doc is actually deterministic, reliable, and formal enough that various implementations will *always* arrive at the same 7-bit ASCII for the same UTF-16 graph. Like I said, the syntax doesn't inspire trust in me, and I wouldn't at all be surprised if people start getting false negatives if they ever start testing internationalized graphs heavily. But even then, *assume* that the n-triples syntax represents a work of genius at cononicalizing utf-16 in 7-bit, and assume that any two different n3 strings are guaranteed to actually represent different resources. Diff *still* fails to detect isomorphism. For starters, bNode names could be changed without impacting isomorphism. I can imagine all sorts of n-triples documents that represent exactly the same graph, but would bereported by GNU diff as being different. So I think that catering to some ancient Unix command-line tools should be a very minor (if at all) consideration in syntax recommendation. > -----Original Message----- > From: Aaron Swartz [mailto:me@aaronsw.com] > Sent: Thursday, June 06, 2002 3:37 PM > To: Joshua Allen > Cc: www-rdf-interest@w3.org; Bill de hÓra; Graham Klyne > > On Thursday, June 6, 2002, at 05:33 PM, Joshua Allen wrote: > > > And FWIW, I think this is a major strike *against* the current n-triple > > serialization as a good test tool. In order to gain broad acceptance, > > RDF will have to handle languages like Chinese at least as good as XML > > (and XML is no paragon). Imagine merging and testing graphs of mixed > > Chinese, Arabic, and other Unicode languages. > > This is exactly why I'm glad it uses ASCII. That way I can use standard > UNIX tools like diff to make sure the N-Triples files conform, and not > have to worry about UNIX Unicode issues. (Replace for details of your > operating system.) > > -- > Aaron Swartz [http://www.aaronsw.com]
Received on Thursday, 6 June 2002 19:41:32 UTC