Re: very preliminary TURTLE comments from Richard Ishida on 2012-08-07 (www-international@w3.org from July to September 2012)

From: Richard Ishida <ishida@w3.org>
Date: Tue, 07 Aug 2012 17:13:13 +0100
To: "Phillips, Addison" <addison@lab126.com>
CC: "www-international@w3.org" <www-international@w3.org>
Message-ID: <50213E99.6090602@w3.org>
Some notes on your notes below. Looking forward to discussing during the 
telecon.  I think this was quite thorough - i didn't find anything else.

RI


Richard Ishida
Internationalization Activity Lead
W3C (World Wide Web Consortium)

http://www.w3.org/International/
http://rishida.net/

On 01/08/2012 07:28, Phillips, Addison wrote:
> Just my jotted notes on http://www.w3.org/TR/2012/WD-turtle-20120710/

I've been trying to jot down some ideas for guidelines that arise from 
these comments.  Here are some related to examples:

+ Ensure that examples use language declarations in all the appropriate 
places.
+ Use non-ASCII examples for IRIs.
+ Try to use examples that are valid around the world, rather than 
specific to a particular country.

I'll add one or two more below, each time prefixed by +, and I'll add a 
couple of links where we already have relevant guidelines.

>
> 1. Example in the introduction includes a non-ASCII/non-English value, which is nice to see. However this example and the later example in 2.3 omit any language declaration on the English name "Spiderman".
>
> 2. No reference is given on how to label language. This should refer to BCP 47.

+ Use BCP 47 for language values.  Require well-formed and valid 
language values.

> 3. Section 2.4 discusses IRIs, but does not use a non-ASCII example.

>
> 4. Section 2.5.1 references several character sequences quoting character sequences such as " / @ and ^^. These should probably list the Unicode character names and/or code points for clarity.
>
> 5. Section 2.5.1 uses several examples that are highly culturally distinct to the United States. (One could quibble about Spiderman and Green Goblin as well, but at least these are part of a film seen internationally, as opposed to a television show of doubtful international distribution??)
>
> 6. Section 2.5.2 has an example of a number:
>
>       :censusYear 2007 ;              # xsd:integer
>
> ... that is probably ill-advised. While years are represented by numbers, this might not be a good example of an "integer".
>
> 6a. This example in the same section also is questionable?
>
>      :gdpDollars 14074.2E9 ;         # xsd:double
>
> 7. Section 2.6. These escapes are malformed or use a questionable syntax:
>
> The characters -, \uB7, \u300 to \u36F and \u203F to 2040 are permitted anywhere except the first character.

+ Use the U+XXXX or U+XXXXXX notation to refer to code points in the 
specification (rather than other escaped forms).


> 8. Section 3. The reference to #xA; is making some assumptions about line terminators too :-)

+ Don't make assumptions about line terminator characters.


> An example of two identical triples containing literal objects containing newlines, written in plain and long literal forms. Assumes that line feeds in this document are #xA. (example3.ttl):
>
> 9. Section 5.1. The following might need attention:
>
> The media type of Turtle is text/turtle. The content encoding of Turtle content is always UTF-8. Charset parameters on the mime type are required until such time as the text/ media type tree permits UTF-8 to be sent without a charset parameter. See section B Internet Media Type, File Extension and Macintosh File Type for the media type registration form.
>
> 10. Section 6. Refers to TURTLE documents as being encoded as UTF-8. In practice, UTF-8 is a serialization. The actually document should just be "a sequence of Unicode characters"
>
> 11. Section 6.2. Says in part this: "continue to the end of line (marked by characters U+000D or U+000A)" which again makes assumptions about line terminators. Should there be a rule for line termination?
>
> 12. Section 6.4. The \u (lowercase u) syntax allows:
>
>  A Unicode codepoint in the range U+0 to U+FFFF inclusive corresponding to the value encoded by the four hexadecimal digits interpreted from most significant to least significant digit.
>
> This is probably wrong, given that the surrogate code points fall into this range. No mention is made of surrogate pair handling.

W3C I18N Techniques: Developing specifications  > Including and 
excluding character ranges
http://www.w3.org/International/techniques/developing-specs#ranges

> 13. Section 6.4 contains this Note:
>
> --
> %-encoded sequences are in the character range for IRIs and are explicitly allowed in local names. These appear as a '%' followed by two hex characters and represent that same sequence of three characters. These sequences are not decoded during processing. A term written as <http://a.example/%66oo-bar> in Turtle designates the IRI http://a.example/%66oo-bar and not IRI http://a.example/foo-bar. A term written as ex:%66oo-bar with a prefix @prefix ex: <http://a.example/> also designates the IRI http://a.example/%66oo-bar.
> --
>
> Does it violate IRI to make this distinction?
>
> 14. Section 6.5 (Grammar) defines LANGTAG far more permissively than BCP 47 does--even in its obsolete forms, to wit:
>
> [144s]  LANGTAG  ::=  '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*
>
> It would be better to define this at least in terms of BCP 47's "obs-language-tag" production:
>
>         obs-language-tag = primary-subtag *( "-" subtag )
>         primary-subtag   = 1*8ALPHA
>         subtag           = 1*8(ALPHA / DIGIT)
>
> 15. Same section. PN_CHARS_BASE erases various Unicode ranges without explanation. This appears to be an attempt to eliminate combining marks and the surrogates. This probably isn't how to do this?
>
> 16. Appendix B contains this note:
>
> Encoding considerations:
>      The syntax of Turtle is expressed over code points in Unicode [UNICODE]. The encoding is always UTF-8 [UTF-8].
>      Unicode code points may also be expressed using an \uXXXX (U+0 to U+FFFF) or \UXXXXXXXX syntax (for U+10000 onwards) where X is a hexadecimal digit [0-9A-Fa-f]
>
> 17. Appendix B contains security considerations, that reference UTR#36 (good). Should there also be reference to UTS#39??
>
> 18. In "References", the Unicode reference is to Unicode 4.0, which is well out of date. The current version is 6.1, for example.

W3C I18N Techniques: Developing specifications  > Referencing the 
Unicode Standard
http://www.w3.org/International/techniques/developing-specs#unicoderef
Received on Tuesday, 7 August 2012 16:13:40 UTC