Re: Streams in an unreliable world. from Jean-Paul on 2013-10-07 (public-rsp@w3.org from October 2013)

From: Jean-Paul <jp.calbimonte@upm.es>
Date: Mon, 7 Oct 2013 16:53:31 +0200
To: Rinne Mikko <mikko.rinne@aalto.fi>
Cc: Andy Seaborne <andy@apache.org>, "public-rsp@w3.org" <public-rsp@w3.org>
Message-ID: <CA+vpjuXbBqBhZW9UagT7Rcwr2sorC0V+kcVAQ4LkfzC=Am8D_w@mail.gmail.com>
Hi all,

I see that after having these first discussions on models an query
processing efforts done in RDF streams, we might need to work as a group in
defining together and with more details our 'working use cases'.

Although some of us have already contributed to this in the wiki, it would
be useful to have a set of (agreed) use cases from where we can draw the
requirements of whatever RDF stream+processing model we would like to
propose.

Now coming back to the issues and design points noted by Andy, I think many
of them need to be considered. For instance out of order arrivals or
temporal unavailability, are quite common in use cases we've had with
sensor networks.

For example one interesting thing about query processing RDF streams is
that sometimes you can have several different correct answers for a query.
This may depend on things like the starting time of a windowing operation
or the operational semantics of a query engine. In occassions different
results can also be the product of differetly ordered results, etc.

Jp



2013/10/7 Rinne Mikko <mikko.rinne@aalto.fi>

>
> Hi Andy!
>
> > I picked that out as a representtaive and wanted to ask you if there is
> behind your comments a sense that the work of the CG is defining the
> processing model at a single place?
>
>
> Technically the group hasn't defined any scope or requirements as of yet,
> so I only speak for myself. :-)
>
> Thank you for the MillWheel reference! It has a lot of good ideas about
> building a large-scale streaming system (e.g. timer implementation and
> impacts), but what I didn't find on a quick read, was how system-wide
> issues would influence the syntax or the semantics of the streams
> themselves? MillWheel seemed to be mainly working on timestamped tuples.
>
> There can be a lot of different uses of an RDF stream. I confess to be
> primarily thinking of a case, where the producer is broadcasting a stream
> without direct knowledge of whether there are 0, 1 or 1,000 consumers.
> Equally well the stream can exist inside a single computer (= perfect
> delivery) or over a point-to-point connection without strict delay
> limitations (possibility for retransmission and guaranteed in-order
> delivery). To be able to start somewhere, it would seem logical to start
> RDF stream processing from defining how the stream looks like, *unless*
> there are system- or transmission-based impacts to the format of the
> stream, in which case they need to be handled together. It would be helpful
> for me to see some examples on this.
>
> If streams are defined infinite, we always have to assume that we are only
> processing a segment of the stream and all parts of the system have to
> tolerate that. For signalling out-of-order situations, I wrote some
> examples in my previous email. In addition to that, we can always insert
> counters, which can be used to detect out-of-order and missing event
> situations. Such generic tools should be based on need, a documented use
> case and related requirements, but what would be the additional
> requirements to the stream set by a more complete system model? Or is a use
> case actually just a loosely defined system model. :-)
>
> As applications are many, there will certainly be a lot of questions
> related to flexibility and scalability.
>
> BR,
>
> Mikko
>
>
> On 6. Oct 2013, at 8:00 PM, Andy Seaborne wrote:
>
> > Hi Mikko,
> >
> >> If the group can give the tools, everyone can build their fault
> > > tolerance into the platforms and / or the queries.
> >
> > There is a lot behind that statement!  I picked that out as a
> representtaive and wanted to ask you if there is behind your comments a
> sense that the work of the CG is defining the processing model at a single
> place?
> >
> > That's not a trivial scope but I don't think that done in isolation
> assuming that the characteristics of the web, unreliability in many forms,
> are taken care of elswhere will lead to take-up.  These characteristics do
> show through and affect the processing model (e.g. [1] but also the stream
> systems it references).
> >
> > In deploying a real system, much of the effort is going to be in dealing
> with the imperfections arising from scale.  Different applications will
> want different tradeoffs (timeliness vs in-order delivery for example; or
> synchronization bwteen two streams).
> >
> >       Andy
> >
> > [1] http://research.google.com/pubs/pub41378.html
> >
> > On 05/10/13 20:40, Rinne Mikko wrote:
> >>
> >> Hi Andy & al!
> >>
> >> Excellents points! I would look at this from a couple of angles:
> >>
> >> 1) Imperfections known at the stream producer
> >>
> >> These could be e.g. due to aggregating inputs from multiple sensors. I
> did some initial drafting on a stream description vocabulary and example
> datasets earlier this week. Testing different ideas, I also looked at some
> quality-related information, which could be stream-specific and known a
> priori:
> >>
> >> a) Ordering of events or facts in the stream
> >> - strict / approximate / unordered
> >> - if not strictly ordered, possibility to define some "maximum
> out-of-order delay"?
> >>
> >> b) Sequence information fields (timestamps, counters etc.)
> >> - ordering reliability estimate
> >> - sequence information field x present: sometimes / usually / always
> (e.g. it is known that some sensors include it while some don't)
> >>
> >> I think next telco was dedicated to querying, but perhaps we could
> resume the stream discussion in Sydney?
> >>
> >> 2) Robustness of the stream against errors in transmission
> >>
> >> The stream can be made more robust against transmission errors; e.g. by
> transmitting multiple copies of the same events / facts in cases where
> losses are more critical than ordering. If this is needed, it could also be
> a part of the stream description to help the receiver to understand what is
> going on?
> >>
> >> 3) Imperfections created during transmission
> >>
> >> This information cannot be explicitly included into the stream or the
> events in it, as it is not known at stream generation time. The protocols
> to transmit RDF streams were mentioned in the first telco, but we haven't
> discussed them after that. Are transmission protocols expected to impact
> how streams are generated other than generic imperfections (loss, jitter,
> out-of-order)? If yes, they should definitely be in scope and discussed
> together with stream construction. If not, I don't have a strong opinion on
> scope, but would prefer to make progress with stream formats first and then
> move onto transmission.
> >>
> >> 4) Fault tolerance in the consuming platform (= receiver)
> >>
> >> This I wouldn't immediately count into group scope. If the group can
> give the tools, everyone can build their fault tolerance into the platforms
> and / or the queries.
> >>
> >> 5) Producer and consumer variations
> >>
> >> Stream descriptions can help both in producer and consumer variations:
> >> - can be referenced by catalogues, which can help to manage changes in
> producers
> >> - can contain all format and prefix information to help consumers to
> join streams at any time.
> >>
> >>
> >> Just some initial thoughts on this.
> >>
> >> Cheers,
> >>
> >> Mikko
> >>
> >>
> >> On 5. Oct 2013, at 7:48 PM, Andy Seaborne wrote:
> >>
> >>> On the web, and indeed in any system of more than modest size and in
> one management domain, issues of
> >>>
> >>> * out of order delivery, including arbitrarily late arrival
> >>> * new stream producers coming online, old stream producers ending
> >>>     (discovery, joining, leaving)
> >>> * consumers joining and leaving
> >>> * streams becoming unavailable
> >>> * ... then restarting (with or without loss of potential events)
> >>>
> >>> leading to design points on
> >>>
> >>> + choice of timestamps
> >>> + delivery ordering semantics
> >>> + delivery guarantees (at least once, exactly once, at most once)
> >>> + persistence, and for how long
> >>>   (forwards, for guaranteed delivery and backwards for consumers to
> >>>    catch up).
> >>>
> >>> What are your thoughts on these issues?  In-scope or out-of-scope of
> the CG? Necessary or optimal to consider?
> >>>
> >>>     Andy
> >>>
> >>>
> >>
> >
>
>
>


-- 
Jean-Paul Calbimonte
Ontology Engineering Group
Universidad Politécnica de Madrid
Received on Monday, 7 October 2013 14:54:03 UTC