Re: question about ShEx from Peter F. Patel-Schneider on 2016-01-07 (public-data-shapes-wg@w3.org from January 2016)

From: Peter F. Patel-Schneider <pfpschneider@gmail.com>
Date: Thu, 7 Jan 2016 09:29:08 -0800
To: Iovka Boneva <iovka.boneva@univ-lille1.fr>, public-data-shapes-wg@w3.org
Message-ID: <568EA064.2020607@gmail.com>
Yes, this appears to work.

I think that you can do better, though.

The proof of Theorem 5 in [2] can be improved.  The first condition for I(E*)
can be removed if the second is replaced by
  [min(1,min I(E)), \inf] if I(E) /= \null
Similarly I(E+) can be improved.

I think that SORBEs can be augmented to allow arbitrary cardinalities and
still have the same complexity, using a combination of the rules for I(E*),
I(E+), and I(a[n,m]).

I think that this would eliminate the need to expand out cardinalities in the
complexity proof for ShEx.

peter


On 01/07/2016 05:04 AM, Iovka Boneva wrote:
> Dear all,
> 
> this mail is very technical, and might be interesting only for very few people.
> 
> Le 04/01/2016 19:40, Peter F. Patel-Schneider a écrit :
>> On 01/04/2016 08:13 AM, Iovka Boneva wrote:
>>> Indeed, I've answered too quickly: representing the numbers in unary is not
>>> enough to capture the size increase due to the unfolding of cardinality
>>> expressions. The two things I called equivalent are actually not equivalent.
>>>
>>> Back to your question: for showing that validation is in NP, we need to
>>> consider that a cardinality expression is a syntactic shortcut for the
>>> corresponding unfolded expression (a syntactic shortcut that can be
>>> exponentially smaller than the actual unfolded expression). We define the size
>>> of the expression based on the unfolded one, and validation for ShEx w/o
>>> arbitrary cardinalities (only *, +, ?) is in NP. I think that you agree with
>>> this last statement.
>> I think that this needs to be shown.  I also think that you need [1,1] as a
>> cardinality.
> Sure, but if you need [1,1], then you won't use the repetition construct at all.
>>
>> Consider something like
>>
>> { sub1, sub2 } *
>>
>> the size of the obvious guess could be arbitrarily large in the size of the
>> expression as the number of repetitions of the subexpressions could be very
>> large.
> What do you mean by the "obvious guess" ?
> Maybe you mean the guess of the candidate solution in the NP algorithm ? If
> yes, then actually we do not guess the number of repetitions.
> 
> Follows a sketch of an NP algorithm of validation.
> I think you agree that the hard part is checking whether the neighbourhood of
> the focus node satisfies "locally" the shape expression, disregarding whether
> the references to other shapes (@ S) are satisfied. So, we show that, given N
> a set of triples that is the neighbourhood of the focus node, and E a shape
> expression, checking whether N locally satisfies E is in NP.
> 
> Consider the ShEx expression as an (unordered) regular expression on the
> alphabet of the triple constraints that appear in the expression. Note that we
> consider "pure" regular expressions, in the sense of the textbooks on regular
> languages. The allowed constructs are epsilon, letters of the alphabet,
> disjunction, concatenation, and Kleene star. Remark that a ShEX expression w/o
> arbitrary repetitions (only +, ?, *) can be transformed in such "pure" regular
> expression which size remains polynomial in the size of the initial expression.
> 
> For example:
> ((:a . | :b .), :a .) *
> has 3 triple constraints (c1) :a . (c2) :b . (c3) :a .
> Note that the two occurrences of :a . are considered distinct.
> The corresponding (unordered) regular expression is then
> ((c1 | c2), c3)* where the comma stands for concatenation.
> Note that such expression is single-occurrence (SORBE) as defined in [2], and
> in that same paper we showed that the membership problem for SORBE expressions
> is in NP. I'll give some details on it in the sequel.
> 
> Now, for the NP algorithm, a candidate solution is: for every triple in N, we
> guess the triple constraint to which it is mapped. The size of such candidate
> solution is polynomial in the size of N and E.
> 
> For example, consider these triples and the mapping (i.e. the triple
> constraint among c1, c2, c3 to which they were mapped).
> <n> :a 1 .       -> c1
> <n> :b 2 .       -> c2
> <n> :b 3 .       -> c2
> <n> :a 4 .       -> c3
> <n> :a 5 .       -> c3
> <n> :a 6 .       -> c3
> 
> Note that such mapping yields a bag of symbols, in the above example it is {{
> c1, c2, c2, c3, c3, c3}}.
> 
> Then, using the SORBE algorithm from [2, Figure 4], we can check in polynomial
> time whether the candidate solution is indeed a solution. That is, check
> whether the bag of symbols belongs to the language of the unordered regular
> expression E.  Few words about this SORBE algorithm: it traverses the
> structure of E and constructs an interval (which is technical) based on the
> expression and on the bag which is being tested for membership. Note that the
> numbers that appear in this interval and in all the intermediate computations
> are bounded by the size of N. The algorithm works in a single traversal of E
> because, intuitively, the single-occurrence criterion discards the need of
> non-deterministic choices.
> 
> Does this sound convincing to you ?
> 
> Alternatively, for the NP proof we can use a result from [1], that shows that
> the membership problem for a vector (bag) to a semilinear set (Parikh set of a
> regular expression, which is basically an alternative representation of an
> unordered regular expression) is in NP, but for that proof I would need to
> argue that the size of the Parikh set is polynomial in the size of the
> expression, probably using arguments on solving linear equations, so the above
> sketch seems easier to explain.
> 
> [1] E. Kopczynski and A. To. Parikh images of grammars: Complexity and
> applications. In
> LICS, pages 80–89, 2010.
> 
> [2] S. Staworko, I. Boneva, J. E. Labra Gayo, S. Hym, E. G. Prud’hom-
> meaux, and H. Solbrig. Complexity and Expressiveness of ShEx for RDF.
> In International Conference on Database Theory (ICDT) 2015.
> 
>>> The proof for NP-completeness is indeed made for the case w/o arbitrary
>>> cardinalities (except on triple constraints, so no folding of arbitrary
>>> cardinalities).
>>>
>>> Once again, we are talking here about a border line case, which would probably
>>> never occur in practice. I am personally not concerned about the high
>>> complexity of validation for this particular case.
>> I am concerned.  Unexpected jumps in complexity are often indications that
>> algorithms are incorrect.
>>
>>> Iovka
>> peter
>>
>>
>>> Le 04/01/2016 14:31, Peter F. Patel-Schneider a écrit :
>>>> I am not convinced, particularly because of your explanation.
>>>>
>>>> The unfolding will result in an exponential size increase if numbers are
>>>> represented in unary.
>>>>
>>>> Consider
>>>>
>>>> { ... { p @ a [n,n] } ... } [n,n]
>>>>
>>>> where the embedding depth is m.
>>>>
>>>> The expansion of this would need to either represent the number n^m or have
>>>> n^m pieces.  Both of these would need size exponential in the size of the
>>>> expression above.
>>>>
>>>> The ability to embed cardinalities inside cardinalities provides a way of
>>>> efficiently representing large numbers even if numbers in expressions have to
>>>> be written in unary.
>>>>
>>>> peter
>>>>
>>>>
>>>> On 01/04/2016 04:16 AM, Iovka Boneva wrote:
>>>>> Le 18/12/2015 18:00, Peter F. Patel-Schneider a écrit :
>>>>>> With this construct I'm having some difficulty determining whether ShEx
>>>>>> validation is in NP.  It seems to me that the outer numbers multiply the
>>>>>> effort required to make the inner partitions, leading to an exponential
>>>>>> number
>>>>>> of choices.   Maybe there is some way to do this bottom up, but it
>>>>>> certainly
>>>>>> doesn't seem obvious to me.
>>>>>>
>>>>>> Is there an NP algorithm for ShEx validation with this feature?
>>>>> Indeed, very good question. And a technical answer.
>>>>> There is an NP algorithm if we consider that the min and max bounds
>>>>> (1,2,3,4,6,7 in your example) are coded in unary (and not in binary) for
>>>>> determining the size of the schema.
>>>>>   From a complexity point of view, it is equivalent to consider that
>>>>> p@a [1,2]
>>>>> is a syntactic shortcut for
>>>>> p@a , p@a?
>>>>> i.e. equivalent to "unfolding" the multiplicities.
>>>>>
>>>>> Does it answer your question ?
>>>>>
>>>>> As Eric explained, such cases are not supposed to occur very often.
>>>>>
>>>>> Iovka
>>>>>
>>>>>> peter
>>>>>>
>>>>>>
>>>>>> On 12/18/2015 01:12 AM, Eric Prud'hommeaux wrote:
>>>>>>> * Peter F. Patel-Schneider <pfpschneider@gmail.com> [2015-12-17
>>>>>>> 22:20-0800]
>>>>>>>> It appears to me that ShEx allows embeddings of cardinalities, as in
>>>>>>>>
>>>>>>>> twofoureight {
>>>>>>>>      { { p @ a [2,2] } [2,2] } [2,2]
>>>>>>>>     }
>>>>>>>>
>>>>>>>> or
>>>>>>>>
>>>>>>>> lots {
>>>>>>>>      { p @a [2,4], p @b [1,3] } [6,7]
>>>>>>>> }
>>>>>>>>
>>>>>>>> is this correct?
>>>>>>> In principle, yes, but apart from optional groups, it's rare that one
>>>>>>> wants to make sure that there are as many a's as there are b's. Optional
>>>>>>> is quite reasonable, e.g.
>>>>>>>
>>>>>>>      <X> {
>>>>>>>        :submitter @<UserShape>, :submissionDate xsd:dateTime,
>>>>>>>        (:resolvedBy @<HackerShape>, :resolutionDate xsd:dateTime)?
>>>>>>>      }
>>>>>>>
>>>>>>> The other place this frequently comes up is in repetitions of choices:
>>>>>>>
>>>>>>>      <Y> { (:observation .|:analysis .|:diagnosis .)+ }
>>>>>>>
>>>>>>> which can be compiled to
>>>>>>>
>>>>>>>      <Y> {
>>>>>>>        (:observation .|:analysis .|:diagnosis .),
>>>>>>>        :observation .*,
>>>>>>>        :analysis .*,
>>>>>>>        :diagnosis .*
>>>>>>>      }
>>>>>>>
>>>>>>> with either a partion or a greedy semantics.
>>>>>>>
>>>>>>>
>>>>>>>> peter
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>
> 
>
Received on Thursday, 7 January 2016 17:29:41 UTC