Re: N-triples white space question from Andy Seaborne on 2012-05-18 (public-rdf-wg@w3.org from May 2012)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Fri, 18 May 2012 15:31:49 +0100
To: RDF-WG <public-rdf-wg@w3.org>
Message-ID: <4FB65D55.8060600@epimorphics.com>
On 18/05/12 14:45, Sandro Hawke wrote:
> On Fri, 2012-05-18 at 14:08 +0100, Andy Seaborne wrote:
>>
>> On 18/05/12 13:12, Eric Prud'hommeaux wrote:
>>> * Richard Cyganiak<richard@cyganiak.de>   [2012-05-18 12:35+0100]
>>>>
>>>> On 18 May 2012, at 11:34, Eric Prud'hommeaux wrote:
>>>>> Does the existing body of N-Triples permit a grammar with no default whitespace rules?
>>>>>
>>>>>    triples: triple (LF triple)* LF?
>>>>>    triple: subject HWS predicate HWS object '.'
>>>>>
>>>>> I.e, do all the N-Triples out there look like "<s>   <p>   <o>."?
>>>>
>>>> This is what N-Triples as currently defined requires. Isn't that sufficient?
>>>>
>>>> ntripleDoc ::= line*
>>>> line  ::= ws* ( comment | triple )? eoln 
>>>> triple  ::= subject ws+ predicate ws+ object ws* '.' ws*
>>>> ws  ::= space | tab 
>>>> eoln  ::= cr | lf | cr lf 
>>>
>>> I was just interested to see how much your SHOULD:
>>> [[
>>> * Richard Cyganiak<richard@cyganiak.de>   [2012-05-18 11:06+0100]
>>>> I would even go one step further and add some SHOULD-level guidance on where to put what whitespace. Perhaps something like: exactly one space between s and p; exactly one space between p and o; no WS before or after the period; no WS at
>>> the start of a line; CR+LF as EOL.
>>> ]]
>>> could be turned into a MUST.
>>
>> I don't think a MUST is a good idea, partially because it's too late,
>> but also despite being a dump format, it's not pure binary.  Blank lines
>> and comments do have a roll here and the CR+LF is a mild inconvenience
>> in some text tools.
>>
>> There is variance in IRIs so from that point alone, NT has variations
>> enough to stop blindly processing with line-based tools.  I've seen the
>> :80 thing in messy data.
>>
>> processing based on appearance needs an extra step to be safe at scale
>> (i.e. not need checking afterwards).
>>
>> What a canonical form is good for is as a target for a simple tools to
>> process and output.  Hopefully, then tool makers will provide it by user
>> demand.
>
> So, no one would be writing a parser for n-triples that ONLY did
> canonical n-triples.  (At the point where you're writing something that
> can keep a table of b-nodes labels, scanning over multiple spaces
> between the subject and the predicate is pretty easy.)    But we'd say
> people SHOULD output "canonical" n-triples so that plain-text
> RDF-unaware tools like sort and grep would work.    Is that the
> proposal?

I wrote the original message for this thread as feedback on the 
nearly-LC-publishable Turtle document.  I included a way to resolve the 
confusion that has been pointed out to me elsewhere.

When Richard mentioned SHOULD-guidance on where to put what whitespace, 
I took that and tried to list the areas that I thought needed covering 
IF line based tools were to process N-triples in an RDF-unaware fashion 
- that is, manipulating bytes.  It is a usage we have discucssed here 
before.

I am suggesting we could describe a canonical form so people can use if 
they want.  An extra step to get NT to that canonical form may be needed 
when working at scale because it's a nuisance to find the billion and 
first triple is formatted differently.

That is not going as far as SHOULD (= "there may exist valid reasons in 
particular circumstances to ignore a particular item, but the full 
implications must be understood and carefully weighed before choosing a 
different course.")

. I'm happy with not saying anything.
. My tools do output that form as far as I know (I've not checked today) 
except it's ' .' because I like that.
. When I write such line-based tools on a case-by-case basis

 Andy

>
>      -- Sandro
>
>
>>  Andy
>>
>>>
>>>
>>>> Richard
>>>>
>>>>
>>>>
>>>>> I note that Oracle has been vigilent about preserving backwards-compatibility. Souri, do you have a sense of what Oracle has been using?
>>>>>
>>>>>> I also note that RDF 2004 N-Triples allows comments (only at the start of a line). This makes sense for the use as a test case format, but not much sense for the use as a dump format.
>>>>>>
>>>>>> Best,
>>>>>> Richard
>>>>>>
>>>>>>
>>>>>> [1] http://www.w3.org/TR/rdf-testcases/#ntriples
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 18 May 2012, at 10:04, Andy Seaborne wrote:
>>>>>>
>>>>>>> Gavin, Eric,
>>>>>>>
>>>>>>> rdf-turtle says:
>>>>>>>
>>>>>>> [1] ntriplesDoc ::= (triple)? (EOL triple)* (EOL)?
>>>>>>> [2] triple ::= subject predicate object '.'
>>>>>>> [8] EOL  ::= ([#xD#xA])+
>>>>>>>
>>>>>>> What are the white space rules?
>>>>>>>
>>>>>>> Does it inherit white space processing from the rest of Turtle? Comments seem to come from Turtle.
>>>>>>>
>>>>>>> If it does not inherit white space rules,
>>>>>>>     what about horizontal white space inside triples?
>>>>>>>
>>>>>>> If it does inherit white space rules,
>>>>>>>    that includes newlines within triples between S/P or P/O.
>>>>>>>
>>>>>>> The simplest solution is to add text in section 12.3 to say that horizontal white space outside tokens is discarded (which is different to Turtle).
>>>>>>>
>>>>>>>  Andy
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> -ericP
>>>>>
>>>>
>>>
>>
>>
>
>
>
Received on Friday, 18 May 2012 14:33:45 UTC