Re: Proposed RAND() defn from Steve Harris on 2010-12-02 (public-rdf-dawg@w3.org from October to December 2010)

From: Steve Harris <steve.harris@garlik.com>
Date: Thu, 2 Dec 2010 13:51:05 +0000
To: Andy Seaborne <andy.seaborne@epimorphics.com>
Cc: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <B9EF3122-38E8-4C7D-A96E-B8F3BA1CF85B@garlik.com>

On 2010-12-02, at 11:38, Andy Seaborne wrote:

>>> 
>>> Maybe we can specify RAND(seed) by simply saying that it will generate a pseudorandom sequence with the suggestion ("SHOULD") generate the same sequence on each run as a debugging aid.  This decouples it from solution sequences.
>> 
>> A "SHOULD" is probably a good idea. It's not just a debugging aid though, it's for repeatability generally.
>> 
>>> An implementation can be simply a random number generator like srand(N).
>> 
>> I'm not sure who's / which srand(n) you're referring to.
> 
> This one:
> 
> http://www.gnu.org/s/libc/manual/html_node/ISO-Random.html
> 
>> The key thing is that you get the same return value twice if you do something like:
>>    FILTER(RAND(1)>  0.5&&  RAND(1)<  0.6)
> 
> For me, that's not necessary.  For predictability, all I require is that each call of RAND(seed) returns the same number at the same point in execution across runs.
> 
> Maybe I don't understand RAND for SQL well enough but I thought that RAND() returns different numbers in
> 
>    FILTER(RAND()>  0.5&&  RAND()<  0.6)

It does, but not if you provide a seed number, the seed gives you a new number, per row. 

> (if you want the same number assign it in some way)

SQL doesn't have per-row assignment, and it's going to be problematic in SPARQL (see below)

> As RAND() returns different numbers, so
> 
>    FILTER(RAND(1)>  0.5&&  RAND(1)<  0.6)
> 
> should, just the same numbers at the same invocation count every run.

That doesn't make me comfortable.

The implementation in SQL is something like: [in very naive terms, obviously]

  srand(row_num + seed);
  return (double)rand() / (double)RAND_MAX+1.0;

Otherwise you have issues about execution order, which might not be stable between executions, or even execution phases.

Also,

OPTIONAL {
   ?x :a ?y
   FILTER(RAND(1) < 0.5)
}
OPTIONAL {
   ?s :b ?z
   FILTER(RAND(1) < 0.5)
}

Is going to have both undesirable, and unpredictable behaviour.

BIND(RAND(1) AS ?r)
OPTIONAL {
   ?x :a ?y
   FILTER(?r < 0.5)
}
...

Won't work, because of the scoping, right?

You could do something with nested OPTIONALs, but anyone who's familiar with SQL's behaviour is not going to be very happy.

- Steve

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD

Received on Thursday, 2 December 2010 13:51:41 UTC