very preliminary TURTLE comments

Just my jotted notes on http://www.w3.org/TR/2012/WD-turtle-20120710/ 

1. Example in the introduction includes a non-ASCII/non-English value, which is nice to see. However this example and the later example in 2.3 omit any language declaration on the English name "Spiderman".

2. No reference is given on how to label language. This should refer to BCP 47.

3. Section 2.4 discusses IRIs, but does not use a non-ASCII example.

4. Section 2.5.1 references several character sequences quoting character sequences such as " / @ and ^^. These should probably list the Unicode character names and/or code points for clarity.

5. Section 2.5.1 uses several examples that are highly culturally distinct to the United States. (One could quibble about Spiderman and Green Goblin as well, but at least these are part of a film seen internationally, as opposed to a television show of doubtful international distribution??)

6. Section 2.5.2 has an example of a number:

     :censusYear 2007 ;              # xsd:integer

... that is probably ill-advised. While years are represented by numbers, this might not be a good example of an "integer".

6a. This example in the same section also is questionable?

    :gdpDollars 14074.2E9 ;         # xsd:double

7. Section 2.6. These escapes are malformed or use a questionable syntax:

The characters -, \uB7, \u300 to \u36F and \u203F to 2040 are permitted anywhere except the first character.

8. Section 3. The reference to #xA; is making some assumptions about line terminators too :-)

An example of two identical triples containing literal objects containing newlines, written in plain and long literal forms. Assumes that line feeds in this document are #xA. (example3.ttl):

9. Section 5.1. The following might need attention:

The media type of Turtle is text/turtle. The content encoding of Turtle content is always UTF-8. Charset parameters on the mime type are required until such time as the text/ media type tree permits UTF-8 to be sent without a charset parameter. See section B Internet Media Type, File Extension and Macintosh File Type for the media type registration form.

10. Section 6. Refers to TURTLE documents as being encoded as UTF-8. In practice, UTF-8 is a serialization. The actually document should just be "a sequence of Unicode characters"

11. Section 6.2. Says in part this: "continue to the end of line (marked by characters U+000D or U+000A)" which again makes assumptions about line terminators. Should there be a rule for line termination?

12. Section 6.4. The \u (lowercase u) syntax allows:

 A Unicode codepoint in the range U+0 to U+FFFF inclusive corresponding to the value encoded by the four hexadecimal digits interpreted from most significant to least significant digit.

This is probably wrong, given that the surrogate code points fall into this range. No mention is made of surrogate pair handling.

13. Section 6.4 contains this Note:

--
%-encoded sequences are in the character range for IRIs and are explicitly allowed in local names. These appear as a '%' followed by two hex characters and represent that same sequence of three characters. These sequences are not decoded during processing. A term written as <http://a.example/%66oo-bar> in Turtle designates the IRI http://a.example/%66oo-bar and not IRI http://a.example/foo-bar. A term written as ex:%66oo-bar with a prefix @prefix ex: <http://a.example/> also designates the IRI http://a.example/%66oo-bar.

--

Does it violate IRI to make this distinction?

14. Section 6.5 (Grammar) defines LANGTAG far more permissively than BCP 47 does--even in its obsolete forms, to wit:

[144s]  LANGTAG  ::=  '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*

It would be better to define this at least in terms of BCP 47's "obs-language-tag" production:

       obs-language-tag = primary-subtag *( "-" subtag )
       primary-subtag   = 1*8ALPHA
       subtag           = 1*8(ALPHA / DIGIT)

15. Same section. PN_CHARS_BASE erases various Unicode ranges without explanation. This appears to be an attempt to eliminate combining marks and the surrogates. This probably isn't how to do this?

16. Appendix B contains this note:

Encoding considerations:
    The syntax of Turtle is expressed over code points in Unicode [UNICODE]. The encoding is always UTF-8 [UTF-8].
    Unicode code points may also be expressed using an \uXXXX (U+0 to U+FFFF) or \UXXXXXXXX syntax (for U+10000 onwards) where X is a hexadecimal digit [0-9A-Fa-f]

17. Appendix B contains security considerations, that reference UTR#36 (good). Should there also be reference to UTS#39??

18. In "References", the Unicode reference is to Unicode 4.0, which is well out of date. The current version is 6.1, for example.

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.

Received on Wednesday, 1 August 2012 06:29:36 UTC