Re: Proposed EXPath module: resource collections from Hans-Juergen Rennau on 2015-02-18 (public-expath@w3.org from February 2015)

From: Hans-Juergen Rennau <hrennau@yahoo.de>
Date: Wed, 18 Feb 2015 22:18:56 +0000 (UTC)
To: "mike@saxonica.com" <mike@saxonica.com>, "jonathan.robie@emc.com" <jonathan.robie@emc.com>, "ndw@nwalsh.com" <ndw@nwalsh.com>, "christian.gruen@gmail.com" <christian.gruen@gmail.com>, "public-expath@w3.org" <public-expath@w3.org>, "msokolov@gmail.com" <msokolov@gmail.com>, "hrennau@yahoo.de" <hrennau@yahoo.de>
Message-ID: <350625034.1884648.1424297936539.JavaMail.yahoo@mail.yahoo.com>
Iunderstand and appreciate the approach to start with a minimalistic functionwhich is “general enough to achieve similar levels of capability by layeringthings on top”. But it is also important to remember from early on keycapabilities aspired to, as otherwise the hoped for layering may become impossible. Whatinterests me above all is the possibility to handle huge node collections,selecting a manageable node sequence, and selecting it in a meaningful way.This perspective suggests to me several objections to the approach as it hasbeen sketched and discussed so far. Although at first sight these objectionsmay seem to make things a little less simple – a sequence of map items isindeed wonderfully simple – I think that things might quickly turn out to be farsimpler to manage, implement and extend, compared with the apparently simpler, but more rigidalternative. Here come my thoughts, ordered by headlines. I apologize for the volume.
 “do notequate resource descriptors with XDM map items”============================================
I thinkthat we should define the resource descriptors (= property maps, in Michael’swording) as a logical set of name/value pairs, which may be *represented* as amap item, but is not equal to a map item. It may also be represented by an XMLfragment, or even not represented at all within the XDM! The latter variant maybe appropriate when using non-XML technologies to store the resourcedescriptors and possibly also the resources themselves. No problem to apply aSQL SELECT to 20 GB of relational data, but possibly a problem to filter 20 GBof map items. We should avoid any necessity to model the operation of filteringas a filtering of map items. Such a filtering should be one possibility, not more, not less.
 To illustratethis refusal to equate resource descriptors with map items, consider a use case.The resources are serialized XML log events stored as blobs in a relationaldatabase, accompanied by other columns (within the same or other, linkedtables) providing the resource properties, e.g. timestamp, error code andmessage names. The filtering, expressed on the level of XQuery code as a set ofconditions applied to the properties (e.g. “t>2015-02-18 and error=ERR12345”)should be pushed into the SQL layer, turned into a SELECT statement producing a(hopefully not too large) result set from which the document texts are retrieved. To goal to be able to deal with huge collections makes itessential that we can push the query operation beyond the boundary of XMLtechnology, returning from over there with serialized nodes and over hereparsing them into the result set ready for further processing.The filtershould be expressed by a generic query syntax (like “a=1 and b=x”) which isbehind the scenes translated into a construct appropriate for the technologyactually used – e.g. a predicate applied to a sequence of maps, or a predicateapplied to an XML document, or a SQL SELECT, etc. etc. “collection-specific models of resource properties”======================================
It isdifficult for me to imagine important scenarios in which a collection withresource descriptors restricted to generic properties is very interesting. Howwould it enable me to select the log events with particular error codes? Tofind the schema documents containing address-related type definitions? We should consider taking a fundamental step: recognize collections as newentities of information, components with properties of their own. Inparticular, it is interesting to associate a collection with a collection-specificmodel of resource properties. So a collection of log events might confer to itsmember resources properties like timestamp, processID, errorCode, clientIP,msgName, etc. And an inventory of schema documents might confer properties liketargetNamespace, simpleTypeNames, complexTypeNames, elemNames, attNames, etc.Working with such collections opens completely different possibilities,compared to a collection decorating its resources with generic properties only.The unease and worry about what is generic enough etc. stems exactly from thislacking distinction of *individual* collections; it becomes obsolete when weregard a collection as an individual entity, marked by an individual model orresource properties.
 Ofparticular interest are here “canonical properties” described as a tripleconsisting of a property name, an XQuery expression and a type constraint – e.g.{targetNamespace, /*/@targetNamespace, xs:string}. Thinking of the exampleproperties enumerated above, it becomes clear how very simple the definition ofsuch a collection-specific properties model is, and it does not appear difficultto write generic software which ingests resources into a collection, creatingthe associated resource descriptor.  “make resource reference purely data based”===================================
I believe therelationship of a resource descriptor to a resource should be captured by data,rather than a function, ensuring simplicity and portability. Proposed concept:a node constructor, consisting of two strings, one identifying the type ofconstructor (typically “URI” or “serialized node”), the other providing thevalue from which the node can be constructed. So the node constructor is justanother name/value pair (name=type, value=value), accompanied by the name/valuepairs representing resource properties. Adhering to conventions concerning theused name, the function (“fetch”) becomes superfluous. Data is sufficient andsimple.
 “collectionfiltering as atomic operation”==============================
Rememberingthe goal to deal with huge collections, the filtering should be pulled into thecollection access function, rather than being kept separate and appliedafterwards. At any rate, it seems to me that the implementation of a two-parameter function
    filteredCollection(‘coll-uri’, ‘filter-descriptor’) is so mucheasier, given the task to extract from 20 GB a result of 10 MB, compared to arc:resource-collection(…) followed by a predicate.
Wrapping up==========
Once more, my apologies for this undue length. I understand the attitude to start simple and layer on top later. But those objections are motivated by the feeling that it is significant how we define such a simple start, and I have doubts if the proposed approach (equating resource properties with map items, etc.) can bring us on the track where I would like us to move.
Hans-Juergen Rennau
Received on Wednesday, 18 February 2015 22:19:29 UTC