Re: question about ShEx from Iovka Boneva on 2016-01-07 (public-data-shapes-wg@w3.org from January 2016)

From: Iovka Boneva <iovka.boneva@univ-lille1.fr>
Date: Thu, 7 Jan 2016 14:04:19 +0100
To: public-data-shapes-wg@w3.org
Message-ID: <568E6253.7070200@univ-lille1.fr>
Dear all,

this mail is very technical, and might be interesting only for very few 
people.

Le 04/01/2016 19:40, Peter F. Patel-Schneider a écrit :
> On 01/04/2016 08:13 AM, Iovka Boneva wrote:
>> Indeed, I've answered too quickly: representing the numbers in unary is not
>> enough to capture the size increase due to the unfolding of cardinality
>> expressions. The two things I called equivalent are actually not equivalent.
>>
>> Back to your question: for showing that validation is in NP, we need to
>> consider that a cardinality expression is a syntactic shortcut for the
>> corresponding unfolded expression (a syntactic shortcut that can be
>> exponentially smaller than the actual unfolded expression). We define the size
>> of the expression based on the unfolded one, and validation for ShEx w/o
>> arbitrary cardinalities (only *, +, ?) is in NP. I think that you agree with
>> this last statement.
> I think that this needs to be shown.  I also think that you need [1,1] as a
> cardinality.
Sure, but if you need [1,1], then you won't use the repetition construct 
at all.
>
> Consider something like
>
> { sub1, sub2 } *
>
> the size of the obvious guess could be arbitrarily large in the size of the
> expression as the number of repetitions of the subexpressions could be very large.
What do you mean by the "obvious guess" ?
Maybe you mean the guess of the candidate solution in the NP algorithm ? 
If yes, then actually we do not guess the number of repetitions.

Follows a sketch of an NP algorithm of validation.
I think you agree that the hard part is checking whether the 
neighbourhood of the focus node satisfies "locally" the shape 
expression, disregarding whether the references to other shapes (@ S) 
are satisfied. So, we show that, given N a set of triples that is the 
neighbourhood of the focus node, and E a shape expression, checking 
whether N locally satisfies E is in NP.

Consider the ShEx expression as an (unordered) regular expression on the 
alphabet of the triple constraints that appear in the expression. Note 
that we consider "pure" regular expressions, in the sense of the 
textbooks on regular languages. The allowed constructs are epsilon, 
letters of the alphabet, disjunction, concatenation, and Kleene star. 
Remark that a ShEX expression w/o arbitrary repetitions (only +, ?, *) 
can be transformed in such "pure" regular expression which size remains 
polynomial in the size of the initial expression.

For example:
((:a . | :b .), :a .) *
has 3 triple constraints (c1) :a . (c2) :b . (c3) :a .
Note that the two occurrences of :a . are considered distinct.
The corresponding (unordered) regular expression is then
((c1 | c2), c3)* where the comma stands for concatenation.
Note that such expression is single-occurrence (SORBE) as defined in 
[2], and in that same paper we showed that the membership problem for 
SORBE expressions is in NP. I'll give some details on it in the sequel.

Now, for the NP algorithm, a candidate solution is: for every triple in 
N, we guess the triple constraint to which it is mapped. The size of 
such candidate solution is polynomial in the size of N and E.

For example, consider these triples and the mapping (i.e. the triple 
constraint among c1, c2, c3 to which they were mapped).
<n> :a 1 .       -> c1
<n> :b 2 .       -> c2
<n> :b 3 .       -> c2
<n> :a 4 .       -> c3
<n> :a 5 .       -> c3
<n> :a 6 .       -> c3

Note that such mapping yields a bag of symbols, in the above example it 
is {{ c1, c2, c2, c3, c3, c3}}.

Then, using the SORBE algorithm from [2, Figure 4], we can check in 
polynomial time whether the candidate solution is indeed a solution. 
That is, check whether the bag of symbols belongs to the language of the 
unordered regular expression E.  Few words about this SORBE algorithm: 
it traverses the structure of E and constructs an interval (which is 
technical) based on the expression and on the bag which is being tested 
for membership. Note that the numbers that appear in this interval and 
in all the intermediate computations are bounded by the size of N. The 
algorithm works in a single traversal of E because, intuitively, the 
single-occurrence criterion discards the need of non-deterministic choices.

Does this sound convincing to you ?

Alternatively, for the NP proof we can use a result from [1], that shows 
that the membership problem for a vector (bag) to a semilinear set 
(Parikh set of a regular expression, which is basically an alternative 
representation of an unordered regular expression) is in NP, but for 
that proof I would need to argue that the size of the Parikh set is 
polynomial in the size of the expression, probably using arguments on 
solving linear equations, so the above sketch seems easier to explain.

[1] E. Kopczynski and A. To. Parikh images of grammars: Complexity and 
applications. In
LICS, pages 80–89, 2010.

[2] S. Staworko, I. Boneva, J. E. Labra Gayo, S. Hym, E. G. Prud’hom-
meaux, and H. Solbrig. Complexity and Expressiveness of ShEx for RDF.
In International Conference on Database Theory (ICDT) 2015.

>> The proof for NP-completeness is indeed made for the case w/o arbitrary
>> cardinalities (except on triple constraints, so no folding of arbitrary
>> cardinalities).
>>
>> Once again, we are talking here about a border line case, which would probably
>> never occur in practice. I am personally not concerned about the high
>> complexity of validation for this particular case.
> I am concerned.  Unexpected jumps in complexity are often indications that
> algorithms are incorrect.
>
>> Iovka
> peter
>
>
>> Le 04/01/2016 14:31, Peter F. Patel-Schneider a écrit :
>>> I am not convinced, particularly because of your explanation.
>>>
>>> The unfolding will result in an exponential size increase if numbers are
>>> represented in unary.
>>>
>>> Consider
>>>
>>> { ... { p @ a [n,n] } ... } [n,n]
>>>
>>> where the embedding depth is m.
>>>
>>> The expansion of this would need to either represent the number n^m or have
>>> n^m pieces.  Both of these would need size exponential in the size of the
>>> expression above.
>>>
>>> The ability to embed cardinalities inside cardinalities provides a way of
>>> efficiently representing large numbers even if numbers in expressions have to
>>> be written in unary.
>>>
>>> peter
>>>
>>>
>>> On 01/04/2016 04:16 AM, Iovka Boneva wrote:
>>>> Le 18/12/2015 18:00, Peter F. Patel-Schneider a écrit :
>>>>> With this construct I'm having some difficulty determining whether ShEx
>>>>> validation is in NP.  It seems to me that the outer numbers multiply the
>>>>> effort required to make the inner partitions, leading to an exponential
>>>>> number
>>>>> of choices.   Maybe there is some way to do this bottom up, but it certainly
>>>>> doesn't seem obvious to me.
>>>>>
>>>>> Is there an NP algorithm for ShEx validation with this feature?
>>>> Indeed, very good question. And a technical answer.
>>>> There is an NP algorithm if we consider that the min and max bounds
>>>> (1,2,3,4,6,7 in your example) are coded in unary (and not in binary) for
>>>> determining the size of the schema.
>>>>   From a complexity point of view, it is equivalent to consider that
>>>> p@a [1,2]
>>>> is a syntactic shortcut for
>>>> p@a , p@a?
>>>> i.e. equivalent to "unfolding" the multiplicities.
>>>>
>>>> Does it answer your question ?
>>>>
>>>> As Eric explained, such cases are not supposed to occur very often.
>>>>
>>>> Iovka
>>>>
>>>>> peter
>>>>>
>>>>>
>>>>> On 12/18/2015 01:12 AM, Eric Prud'hommeaux wrote:
>>>>>> * Peter F. Patel-Schneider <pfpschneider@gmail.com> [2015-12-17 22:20-0800]
>>>>>>> It appears to me that ShEx allows embeddings of cardinalities, as in
>>>>>>>
>>>>>>> twofoureight {
>>>>>>>      { { p @ a [2,2] } [2,2] } [2,2]
>>>>>>>     }
>>>>>>>
>>>>>>> or
>>>>>>>
>>>>>>> lots {
>>>>>>>      { p @a [2,4], p @b [1,3] } [6,7]
>>>>>>> }
>>>>>>>
>>>>>>> is this correct?
>>>>>> In principle, yes, but apart from optional groups, it's rare that one
>>>>>> wants to make sure that there are as many a's as there are b's. Optional
>>>>>> is quite reasonable, e.g.
>>>>>>
>>>>>>      <X> {
>>>>>>        :submitter @<UserShape>, :submissionDate xsd:dateTime,
>>>>>>        (:resolvedBy @<HackerShape>, :resolutionDate xsd:dateTime)?
>>>>>>      }
>>>>>>
>>>>>> The other place this frequently comes up is in repetitions of choices:
>>>>>>
>>>>>>      <Y> { (:observation .|:analysis .|:diagnosis .)+ }
>>>>>>
>>>>>> which can be compiled to
>>>>>>
>>>>>>      <Y> {
>>>>>>        (:observation .|:analysis .|:diagnosis .),
>>>>>>        :observation .*,
>>>>>>        :analysis .*,
>>>>>>        :diagnosis .*
>>>>>>      }
>>>>>>
>>>>>> with either a partion or a greedy semantics.
>>>>>>
>>>>>>
>>>>>>> peter
>>>>>>>
>>>>>>>
>>>>>>>
>>


-- 
Iovka Boneva
Associate professor (MdC) Université de Lille
http://www.cristal.univ-lille.fr/~boneva/
+33 6 95 75 70 25
Received on Thursday, 7 January 2016 13:04:53 UTC