Re: Streams in an unreliable world. from Rinne Mikko on 2013-10-07 (public-rsp@w3.org from October 2013)

From: Rinne Mikko <mikko.rinne@aalto.fi>
Date: Mon, 7 Oct 2013 13:56:57 +0000
To: Andy Seaborne <andy@apache.org>
CC: "public-rsp@w3.org" <public-rsp@w3.org>
Message-ID: <8BB81902-3612-40F3-A22C-6A69665D5E06@aalto.fi>
Hi Andy!

> I picked that out as a representtaive and wanted to ask you if there is behind your comments a sense that the work of the CG is defining the processing model at a single place?


Technically the group hasn't defined any scope or requirements as of yet, so I only speak for myself. :-)

Thank you for the MillWheel reference! It has a lot of good ideas about building a large-scale streaming system (e.g. timer implementation and impacts), but what I didn't find on a quick read, was how system-wide issues would influence the syntax or the semantics of the streams themselves? MillWheel seemed to be mainly working on timestamped tuples.

There can be a lot of different uses of an RDF stream. I confess to be primarily thinking of a case, where the producer is broadcasting a stream without direct knowledge of whether there are 0, 1 or 1,000 consumers. Equally well the stream can exist inside a single computer (= perfect delivery) or over a point-to-point connection without strict delay limitations (possibility for retransmission and guaranteed in-order delivery). To be able to start somewhere, it would seem logical to start RDF stream processing from defining how the stream looks like, *unless* there are system- or transmission-based impacts to the format of the stream, in which case they need to be handled together. It would be helpful for me to see some examples on this.

If streams are defined infinite, we always have to assume that we are only processing a segment of the stream and all parts of the system have to tolerate that. For signalling out-of-order situations, I wrote some examples in my previous email. In addition to that, we can always insert counters, which can be used to detect out-of-order and missing event situations. Such generic tools should be based on need, a documented use case and related requirements, but what would be the additional requirements to the stream set by a more complete system model? Or is a use case actually just a loosely defined system model. :-)

As applications are many, there will certainly be a lot of questions related to flexibility and scalability.

BR,

Mikko


On 6. Oct 2013, at 8:00 PM, Andy Seaborne wrote:

> Hi Mikko,
> 
>> If the group can give the tools, everyone can build their fault
> > tolerance into the platforms and / or the queries.
> 
> There is a lot behind that statement!  I picked that out as a representtaive and wanted to ask you if there is behind your comments a sense that the work of the CG is defining the processing model at a single place?
> 
> That's not a trivial scope but I don't think that done in isolation assuming that the characteristics of the web, unreliability in many forms, are taken care of elswhere will lead to take-up.  These characteristics do show through and affect the processing model (e.g. [1] but also the stream systems it references).
> 
> In deploying a real system, much of the effort is going to be in dealing with the imperfections arising from scale.  Different applications will want different tradeoffs (timeliness vs in-order delivery for example; or synchronization bwteen two streams).
> 
> 	Andy
> 
> [1] http://research.google.com/pubs/pub41378.html
> 
> On 05/10/13 20:40, Rinne Mikko wrote:
>> 
>> Hi Andy & al!
>> 
>> Excellents points! I would look at this from a couple of angles:
>> 
>> 1) Imperfections known at the stream producer
>> 
>> These could be e.g. due to aggregating inputs from multiple sensors. I did some initial drafting on a stream description vocabulary and example datasets earlier this week. Testing different ideas, I also looked at some quality-related information, which could be stream-specific and known a priori:
>> 
>> a) Ordering of events or facts in the stream
>> - strict / approximate / unordered
>> - if not strictly ordered, possibility to define some "maximum out-of-order delay"?
>> 
>> b) Sequence information fields (timestamps, counters etc.)
>> - ordering reliability estimate
>> - sequence information field x present: sometimes / usually / always (e.g. it is known that some sensors include it while some don't)
>> 
>> I think next telco was dedicated to querying, but perhaps we could resume the stream discussion in Sydney?
>> 
>> 2) Robustness of the stream against errors in transmission
>> 
>> The stream can be made more robust against transmission errors; e.g. by transmitting multiple copies of the same events / facts in cases where losses are more critical than ordering. If this is needed, it could also be a part of the stream description to help the receiver to understand what is going on?
>> 
>> 3) Imperfections created during transmission
>> 
>> This information cannot be explicitly included into the stream or the events in it, as it is not known at stream generation time. The protocols to transmit RDF streams were mentioned in the first telco, but we haven't discussed them after that. Are transmission protocols expected to impact how streams are generated other than generic imperfections (loss, jitter, out-of-order)? If yes, they should definitely be in scope and discussed together with stream construction. If not, I don't have a strong opinion on scope, but would prefer to make progress with stream formats first and then move onto transmission.
>> 
>> 4) Fault tolerance in the consuming platform (= receiver)
>> 
>> This I wouldn't immediately count into group scope. If the group can give the tools, everyone can build their fault tolerance into the platforms and / or the queries.
>> 
>> 5) Producer and consumer variations
>> 
>> Stream descriptions can help both in producer and consumer variations:
>> - can be referenced by catalogues, which can help to manage changes in producers
>> - can contain all format and prefix information to help consumers to join streams at any time.
>> 
>> 
>> Just some initial thoughts on this.
>> 
>> Cheers,
>> 
>> Mikko
>> 
>> 
>> On 5. Oct 2013, at 7:48 PM, Andy Seaborne wrote:
>> 
>>> On the web, and indeed in any system of more than modest size and in one management domain, issues of
>>> 
>>> * out of order delivery, including arbitrarily late arrival
>>> * new stream producers coming online, old stream producers ending
>>>     (discovery, joining, leaving)
>>> * consumers joining and leaving
>>> * streams becoming unavailable
>>> * ... then restarting (with or without loss of potential events)
>>> 
>>> leading to design points on
>>> 
>>> + choice of timestamps
>>> + delivery ordering semantics
>>> + delivery guarantees (at least once, exactly once, at most once)
>>> + persistence, and for how long
>>>   (forwards, for guaranteed delivery and backwards for consumers to
>>>    catch up).
>>> 
>>> What are your thoughts on these issues?  In-scope or out-of-scope of the CG? Necessary or optimal to consider?
>>> 
>>> 	Andy
>>> 
>>> 
>> 
>
Received on Monday, 7 October 2013 13:57:31 UTC