Re: very preliminary TURTLE comments from Norbert Lindenberg on 2012-08-15 (www-international@w3.org from July to September 2012)

From: Norbert Lindenberg <w3@norbertlindenberg.com>
Date: Tue, 14 Aug 2012 23:08:31 -0700
To: Richard Ishida <ishida@w3.org>
Cc: Norbert Lindenberg <w3@norbertlindenberg.com>, "Phillips, Addison" <addison@lab126.com>, "www-international@w3.org" <www-international@w3.org>
Message-Id: <61F2DA86-E13E-4425-B568-6B0CA017C4F1@norbertlindenberg.com>
A few comments on Addison's and Richard's comments below, and two new items:

Section 6.4, long form Unicode escape sequence: It's not clear why this should take eight hex digits when the first two are required to be 0. Also, the trend seems to be going towards \u{xxxxx}:
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#Escapes
http://unicode.org/reports/tr18/#Hex_notation

Section 6.4, both forms of Unicode escape sequence: The spec doesn't say at what stage the escape sequences are converted to their corresponding characters. Can \u0022 start or end a string literal (as it does in Java)?

Norbert


On Aug 7, 2012, at 9:13 , Richard Ishida wrote:

> Some notes on your notes below. Looking forward to discussing during the telecon.  I think this was quite thorough - i didn't find anything else.
> 
> RI
> 
> 
> Richard Ishida
> Internationalization Activity Lead
> W3C (World Wide Web Consortium)
> 
> http://www.w3.org/International/
> http://rishida.net/
> 
> On 01/08/2012 07:28, Phillips, Addison wrote:
>> Just my jotted notes on http://www.w3.org/TR/2012/WD-turtle-20120710/
> 
> I've been trying to jot down some ideas for guidelines that arise from these comments.  Here are some related to examples:
> 
> + Ensure that examples use language declarations in all the appropriate places.
> + Use non-ASCII examples for IRIs.
> + Try to use examples that are valid around the world, rather than specific to a particular country.
> 
> I'll add one or two more below, each time prefixed by +, and I'll add a couple of links where we already have relevant guidelines.
> 
>> 
>> 1. Example in the introduction includes a non-ASCII/non-English value, which is nice to see. However this example and the later example in 2.3 omit any language declaration on the English name "Spiderman".
>> 
>> 2. No reference is given on how to label language. This should refer to BCP 47.
> 
> + Use BCP 47 for language values.  Require well-formed and valid language values.

The introduction has a reference to
http://www.w3.org/TR/2012/WD-rdf11-concepts-20120605/
which I assume will replace the normative reference to
http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/

The newer one specifies BCP 47 for language tags, although it requires normalization to lower case, deviating from RFC 5646 section 4.5. It requires well-formedness, but not validity.

>> 3. Section 2.4 discusses IRIs, but does not use a non-ASCII example.
> 
>> 
>> 4. Section 2.5.1 references several character sequences quoting character sequences such as " / @ and ^^. These should probably list the Unicode character names and/or code points for clarity.
>> 
>> 5. Section 2.5.1 uses several examples that are highly culturally distinct to the United States. (One could quibble about Spiderman and Green Goblin as well, but at least these are part of a film seen internationally, as opposed to a television show of doubtful international distribution??)

Apparently it had some:
http://www.imdb.com/title/tt0165598/releaseinfo

But the French titles quoted in the spec seem to be made up...

>> 6. Section 2.5.2 has an example of a number:
>> 
>>      :censusYear 2007 ;              # xsd:integer
>> 
>> ... that is probably ill-advised. While years are represented by numbers, this might not be a good example of an "integer".
>> 
>> 6a. This example in the same section also is questionable?
>> 
>>     :gdpDollars 14074.2E9 ;         # xsd:double
>> 
>> 7. Section 2.6. These escapes are malformed or use a questionable syntax:
>> 
>> The characters -, \uB7, \u300 to \u36F and \u203F to 2040 are permitted anywhere except the first character.
> 
> + Use the U+XXXX or U+XXXXXX notation to refer to code points in the specification (rather than other escaped forms).
> 
> 
>> 8. Section 3. The reference to #xA; is making some assumptions about line terminators too :-)
> 
> + Don't make assumptions about line terminator characters.
> 
> 
>> An example of two identical triples containing literal objects containing newlines, written in plain and long literal forms. Assumes that line feeds in this document are #xA. (example3.ttl):
>> 
>> 9. Section 5.1. The following might need attention:
>> 
>> The media type of Turtle is text/turtle. The content encoding of Turtle content is always UTF-8. Charset parameters on the mime type are required until such time as the text/ media type tree permits UTF-8 to be sent without a charset parameter. See section B Internet Media Type, File Extension and Macintosh File Type for the media type registration form.
>> 
>> 10. Section 6. Refers to TURTLE documents as being encoded as UTF-8. In practice, UTF-8 is a serialization. The actually document should just be "a sequence of Unicode characters"

This probably needs to distinguish more clearly between processing and storage/transmission. For the former it's just a sequence of Unicode characters, for the latter it's UTF-8.

>> 11. Section 6.2. Says in part this: "continue to the end of line (marked by characters U+000D or U+000A)" which again makes assumptions about line terminators. Should there be a rule for line termination?

Such as this?
http://ecma-international.org/ecma-262/5.1/#sec-7.3

>> 12. Section 6.4. The \u (lowercase u) syntax allows:
>> 
>> 	A Unicode codepoint in the range U+0 to U+FFFF inclusive corresponding to the value encoded by the four hexadecimal digits interpreted from most significant to least significant digit.
>> 
>> This is probably wrong, given that the surrogate code points fall into this range. No mention is made of surrogate pair handling.

And since there is a second form that can handle the complete Unicode character set, surrogates should not be allowed.

> W3C I18N Techniques: Developing specifications  > Including and excluding character ranges
> http://www.w3.org/International/techniques/developing-specs#ranges
> 
>> 13. Section 6.4 contains this Note:
>> 
>> --
>> %-encoded sequences are in the character range for IRIs and are explicitly allowed in local names. These appear as a '%' followed by two hex characters and represent that same sequence of three characters. These sequences are not decoded during processing. A term written as <http://a.example/%66oo-bar> in Turtle designates the IRI http://a.example/%66oo-bar and not IRI http://a.example/foo-bar. A term written as ex:%66oo-bar with a prefix @prefix ex: <http://a.example/> also designates the IRI http://a.example/%66oo-bar.
>> --
>> 
>> Does it violate IRI to make this distinction?
>> 
>> 14. Section 6.5 (Grammar) defines LANGTAG far more permissively than BCP 47 does--even in its obsolete forms, to wit:
>> 
>> [144s] 	LANGTAG 	::= 	'@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*
>> 
>> It would be better to define this at least in terms of BCP 47's "obs-language-tag" production:
>> 
>>        obs-language-tag = primary-subtag *( "-" subtag )
>>        primary-subtag   = 1*8ALPHA
>>        subtag           = 1*8(ALPHA / DIGIT)

Since the RDF spec refers to BCP 47, this production should too - section 2.1. I'm sure Mark would want to have grandfathered tags excluded.

>> 15. Same section. PN_CHARS_BASE erases various Unicode ranges without explanation. This appears to be an attempt to eliminate combining marks and the surrogates. This probably isn't how to do this?
>> 
>> 16. Appendix B contains this note:
>> 
>> Encoding considerations:
>>     The syntax of Turtle is expressed over code points in Unicode [UNICODE]. The encoding is always UTF-8 [UTF-8].
>>     Unicode code points may also be expressed using an \uXXXX (U+0 to U+FFFF) or \UXXXXXXXX syntax (for U+10000 onwards) where X is a hexadecimal digit [0-9A-Fa-f]
>> 
>> 17. Appendix B contains security considerations, that reference UTR#36 (good). Should there also be reference to UTS#39??
>> 
>> 18. In "References", the Unicode reference is to Unicode 4.0, which is well out of date. The current version is 6.1, for example.
> 
> W3C I18N Techniques: Developing specifications  > Referencing the Unicode Standard
> http://www.w3.org/International/techniques/developing-specs#unicoderef
>
Received on Wednesday, 15 August 2012 06:09:03 UTC