- From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
- Date: Mon, 4 Mar 2002 11:23:14 -0000
- To: <w3c-rdfcore-wg@w3.org>, <w3c-i18n-ig@w3.org>
- Message-ID: <CEECKEAMDAJDDEDGJNBEKEBNCAAA.jjc@hpl.hp.com>
I wish to make the NFC issue more real to the wg by constructing a realistic fraudulent piece of RDF which could be used as part of a scam. I have sent this as HTML so that I can set the character encoding to UTF-8. Still I cannot promise that it displays correctly. The example is also in attached in a zip file, along with some other cases. I suggest looking at them in both an environment with good unicode support and one with no unicode support. Consider the RDF file below (error002.rdf in the zip): <rdf:RDF> <!-- The é below is a single character #xE9 in NFC --> <rdf:Description rdf:about="#André"> <eg:owes>2000</eg:owes> </rdf:Description> <!-- The é below is two characters an e followed by #x301. It should be displayed identically to é. --> <rdf:Description rdf:about="#André"> <!-- eg:isOwed may be defined as the inverse of eg:owes at a higher layer, e.g. in DAML+OIL. --> <eg:isOwed>2000</eg:isOwed> </rdf:Description> </rdf:RDF> The two identifiers "#André" and "#André" are visually indistinguishable in many environments (particularly those that have correctly implemented unicode). Hence the reader of this RDF, when it is presented through any standard interface, may believe that the person identified by "#André" is both owed and owes the same amount. Unfortunately for him, the person who is owed 2000 has a different name "#André", and is likely to be a fraudster. If you are looking at this message in an environment that correctly implements unicode you will need to work hard to detect the differences between the two names. One is a string of 6 characters (including the #), one a string of 7 characters. I note that this example concerns normalization of URIrefs not of literals. When combined with something like daml:UnambiguousProperty a similar example possibly could be constructed using non-normalized literals. error001.rdf in the zip shows how an informal unique names convention can be abused using non-normalized character strings, creating moral confusion ('moral' as in 'moral rights' of copyright law). error003.rdf in the zip shows how a fraudulent promisary note could be constructed. error001a.rdf and error002a.rdf are identical as RDF to error001 and error002 but use XML character references so that the XML is in NFC, and the examples written in this way no longer appear fraudulent, just odd. error001b.rdf is in NFC but uses XML comments to separate the two parts of the non normalized characters. The current version of charmod makes the treatment of files like this as language dependent: http://www.w3.org/TR/2002/WD-charmod-20020220#sec-fully-normalized [[[ Formal languages define constructs, which are identifiable pieces occuring in instances of the language such as comments, element tags, processing instructions, runs of character data, etc. Which of those constructs need to be constrained not to begin with a composing character is language-dependent and depends on what processing the language undergoes. ]]] I18N-IG, if we were to make non normalized literals ill-formed RDF how easy is it to check that a string is in NFC? Is there public domain code we can use? Pointers please. I will try and give some analysis of the possible ways of addressing these cases in a later message. Jeremy
Attachments
- application/x-zip-compressed attachment: all.zip
Received on Monday, 4 March 2002 06:23:25 UTC