Live vs. recorded streams (was RSP next calls) from Rinne Mikko on 2013-11-22 (public-rsp@w3.org from November 2013)

From: Rinne Mikko <mikko.rinne@aalto.fi>
Date: Fri, 22 Nov 2013 12:47:37 +0000
To: "public-rsp@w3.org" <public-rsp@w3.org>
Message-ID: <03FD6311-DE72-4189-B5AE-EA8182B6F449@aalto.fi>

Hello, Manfred!

In the concepts and definitions document<http://www.w3.org/community/rsp/wiki/Concepts_and_Definitions> I tried to list some characteristics which differentiate a recorded stream from a live one:

• Possibility for non-realtime (faster / slower) processing
• Possibility for repeated processing of the same stream
• Need for explicit time information for streaming objects (either in the streaming objects themselves or specified for the whole stream), unless the elements are static objects.
• Finite length with both a beginning and an end.

I do not see any conceptual difference between a live of recorded stream. Why differentiate? Am I missing something?

To me the two have partially differing problematics and solutions. Giving two examples:

1. Defining t(0) e.g. for the purpose of windowing on a recorded stream can be trivial: t(0) is taken from the timestamp of the first streaming object in the file. This is important with e.g. benchmarking, if we want to get the same results by calculating something over those windows. Defining a common t(0) on an infinite live stream is not equally trivial, and it may not even matter, because we usually cannot expect the same results from two platforms processing a live stream (unless we can guarantee that they start from the same object).

2. Assigning timestamps based on the reception time of objects in a live stream may still be reasonably well connected to the occurrence times of the data in those objects. For a recorded stream no such connection exists.

Summarizing, I would agree that by stating that "streams incorporate all the characteristics of both live and recorded streams" we should be able to capture everything, but I'd still argue that the distinction between live and recorded streams is useful and illustrative in defining requirements, as it gives a clearer connection to how a given requirement is derived.

I suggest we try to avoid the distinction in writing the requirements and if it turns out that it wasn't needed at all, we'll just merge the definitions.

BR,

Mikko

On 22. Nov 2013, at 12:56 PM, Manfred Hauswirth wrote:

Hi Rinne, all,

a) We have no definition of "time-series". The problem with the one
given in email ("a non decreasing timestamp") is that it effectively
hides all problematics of a distributed system, where streaming objects
may both have timestamps assigned by non-synchronized clocks and the
objects may arrive to the stream processor out-of-order. Restricting to
non-decreasing timestamps would effectively limit us to cases, where
timestamps are assigned by a singular place per stream (a stream
processing agent). Ability to properly support distributed,
heterogeneous environments is a big motivator for using RDF in the first
place. Therefore I fear that by restricting ourselves to non-decreasing
timestamps we lose more than may first appear.

I'm still uncertain about whether to include implicit timestamps to the
scope. On one hand it seems overkill to require every low-level
streaming object to have an explicit timestamp, especially if the next
stream processing agent on the path is going to assign it a new one
anyway, but on the other hand it is hard to imagine a device capable of

This may be the case but cannot be required - and may not be necessary. This is a parameter that any deployment may decide on implicitely.

producing RDF without any kind of clock circuit. Perhaps it is not so
much the capability of the devices, but rather the extra trouble of
trying to keep the clocks of e.g. 5,000 sensors in sync, if the
transmission network to the stream aggregator has very low delay
variation. As a result I still believe that implicit timestamps should
be in the scope, but it would be fairly easy convince me otherwise.

Distributed time sync is one of the hardest problem in CS. We can and will not solve it in this XG. What we need is a way of modelling various models to capture time and the let the processor (under guidance of the user) decide how to handle it (and this will differ from system to system). E.g., one application may require re-calculations while in another application the late arriving value may be treated as irrelevant and be thrown away. We should not try to support everything automatically - that usually leads to telephone book standard that nobody can use. Pragmatics over completeness, IMHO.

b) "Infinite" rules out recorded streams. Since recorded streams are the
most verifiable part of this work, it seems strange to proceed without
them. Also, I'm not sure how the inclusion of recorded streams would
slow down the work, because a segment of a live stream can always be
recorded and played back. The biggest difference is that a recorded
stream requires explicit knowledge of timestamps, but if our live stream
work acknowledges the presence of explicit timestamps, the "feature" is
already included by default.

I do not understand why this is an issue. If something is recorded and replayed or "infinite" is a mere definition and of mere theoretical interest that should not change a model to be developed.

*2. Ordered / not ordered*

We had some discussion with Alasdair on the wiki, but there's not a
proposal on the table (afaik) for what to write into the
characterization of a recorded stream. For me the issue of ordering
itself is relative: Any stream processing agent (including stream
producers) can insert a counter, which can be used to restore order
observed at that point in time. Whether that is the "correct" order, is
a much more philosophical question. In theory the "correct" order of
observations is the one in which the observations were made, but due to
system inaccuracies we may not have the data to restore that order. Is
our stream ordered if the timestamps are properly increasing, but the
sound of lightning is observed before the flash?

Again - that depends on the application semantics: If you calculate an average of something which does not change quickly, order is of low relevance. If your data item order is important to describe dependencies, the the game changes. But, why are we discussing things here that distributed systems guys have discussed for 30 years? We should not enshrine a specific model but rather allow the user to model everything as needed which guidance what that means for the results that the processor produces. There is no wrong or right - it depends on what level of consistency an application needs or can tolerate.

Without other proposals on the table my suggestion would be to remove
"ordered / not ordered" from the definition of recorded streams (it
doesn't differentiate recorded from live - live can also arrive
out-of-order), and instead write something about the capability of the
stream to support order restoration once we get to requirements.

I do not see any conceptual difference between a live of recorded stream. Why differentiate? Am I missing something?

*3. Streams and stream elements lacking reference to RDF*

Emanuele was commenting on this. I can, of course, do the following
replacements:

Streams => RDF Streams

(Live) Data Stream: An unbounded sequence of time-varying data elements.
=> (Live) RDF Data Stream: An unbounded sequence of time-varying data
elements encoded in RDF.

Recorded Data stream: A stream saved e.g. to a computer file.
=> Recorded RDF Data stream: An RDF stream saved e.g. to a computer file.

Elements in A Stream => Elements in An RDF Stream

...etc. etc. But is there something more substantial that we need to
say? Does it add value to insert RDF everywhere, or is it enough that
the group scope is RDF streams?

RDF is just a format IMHO that allows you to easily integrate the data and results with other data. Nothing more. All streams are the same in whatever notation they come. What RDF gives you is a general framework and a nice general data model, e.g., a graph which changes its shape and constituting data over time and RDF as a nice formalism to capture this.

*4. Streaming and background information, hybrid objects*

First-off, this morning I finally managed to answer "yes" to my old
question in the wiki on whether we are going to have hybrid objects,
i.e. objects that are both state objects and event objects at the same
time. It is actually rather easy to come up with a scenario, where two
or more stream producers will send state objects (temporally valid
data), which will get single timestamps from an aggregator stream
processing agent when merging the streams. And perhaps another set of
timestamps assigned by a stream consumer upon reception. At this point
every streaming object will contain both single timestamps and intervals
and whether a streaming object is seen as a state object or event object
no longer depends on the object itself, but rather on what a stream
processing agent wants to do with it.

=> This is what I mean by application semantics. The model should allow you to do this.

This is what I was looking for to break my earlier proposal, which was
to have state objects and event objects as special cases of streaming
objects. I'll try to merge the current state object and event object
definitions somehow under the streaming object definition. As this is a
bit bigger repair, I'll do it by deprecating the old "Elements in A
Stream" section, copying into a new version of the section and editing
that. We will then have the option of reverting to the old one, if the
meeting thinks it was a bad idea or didn't come out right. I hope this
change at least aligns with half of the comment from Emanuele requesting
to only distinguish data as streaming and background.

On the other half of Emanuele's comment, "background information", I
unfortunately somewhat disagree. To me "background information" stands
for static datasets, which are typically retrieved all at once and can
be processed with "normal" SPARQL semantics. I have no problem adding a
definition for "background information" (we are contribution-driven!
:-), but I wouldn't think of that as "an element in a stream".
"Background" to me refers only to the data, not the method of delivery.
The streaming of background data is possible, of course. Also, I would
still keep the "static object" at least to:
a) indicate that we understand the difference
b) indicate that static objects can theoretically also be sent as a stream
c) be able to say what we don't do in the first phase.
As to problems with the word "static" I'm happy to discuss other
proposals. My requirement would just be to keep it compact and
understandable. To me "static" is fine with the interpretation "true
until stated otherwise", which also differentiates it nicely from facts
with temporal or instantaneous validity. "static or slowly changing" is
too long. "semi-static" is ok, but in my opinion doesn't really add
value in this case.

Why is this important how we call it? Whether something is "background" is in the eye of the beholder, i.e., the application designer, IMHO. All we need is streams + static data IMHO. If you send static data in a streaming fashion (e.g. for efficiency reasons), it becomes stream data. We are splitting hairs here. Let's be pragmatic.

*5. Time as annotation vs. time as data*

The intention with defining the streaming objects was to encapsulate
both. I'll try to include the annotation aspect of time when updating
the streaming object definition. At the same time I'm trying to avoid
building assumptions of the solution into the definitions, because these
definitions are only there to help us write requirements and should not
prematurely fix solutions.

The same issue applies to location. You should be able to model it as meta-data (static sensor) or as data (moving sensor).

Best,

Manfred

--------------------------

Those were the things I had in mind, next I'll try to work a bit on the
Wiki. One observation is that we haven't had many proposals for new
definitions since the initial set. It's nice if we can manage with
these, but more likely we will just have to revisit this document once
we start work on the requirements and see what we really need.

BR,

Mikko

On 14. Nov 2013, at 4:41 PM, Emanuele Della Valle wrote:

Hello Jean-Paul, Eva, Avi and all,

what Eva says and Jean-Paul confirm is also what I think, sorry to
create confusions.

My answer to Avi was putting emphasis on the fact that by
"time-series" I do not mean a *sequence of numbers ordered by
recency*, but a *sequence of RDF triples ordered by recency*.

Concerning how to describe time from the application perspective, my
position is the following one:
- 0 timestamps (i.e., relying on the temporal distance between the
received triples ) makes compatibility with RDF straight forward, but
it may hide problems (e.g., the temporal distance between two triples
may be influenced by network delays)
- 1 (point in time semantics for application time) allows for handling
out of orders and for basic temporal operators (e.g., follows,
precedes, contemporaryWith)
- 2 (interval based semantics for events) allows for expressive
temporal operators, but, at least in many scenarios I target, it is an
overkill

Most of the commercial DSMS/CEP take either 0 or 1. The only
commercial CEP that I know supporting 2 is Microsoft StreamInsight.

Time from the system perspective is a different issue. Whether system
time should be externalised is something I still wonder.

Cheers,

Emanuele

On Nov 14, 2013, at 1:58 AM, Jean-Paul <jp.calbimonte@upm.es<mailto:jp.calbimonte@upm.es>
<mailto:jp.calbimonte@upm.es>>
wrote:

Hello all,

Yes, I think Oscar's diagram (check it here:
http://www.w3.org/community/rsp/wiki/Meeting_22.10.2013) more or less
reflects part of the discussion we had about the scope.

We seem to agree that ordered streams of elements (infinite or
'recorded' streams as well) are in scope (green ticks in the
diagram). In these cases the order might be of different natures but
we agreed to focus on time-based order. I don't think we agreed yet
on focusing only in point in time timestamps or intervals. For the
moment it is just time-based order, I believe.

Then there are datasets which may not be streaming in nature but
might be needed to processed in a streaming fashion (e.g. a very
large dataset). I understood we are not ruling this case out, but
might not focus on it in a first stage.

Thanks to Emanuele for the input about the scope. As Eva pointed out,
there are some discrepancies that we can fix in the wiki. I am also a
bit unconfortable with calling the streams in our scope as
'time-series', I think this term has other connotations in related areas.

well, this is just a personal comment as well, but I'm happy to
continue this discussion. We can also continue modifying the wiki
until we have the Telco, and afterwards.

best
jp

2013/11/13 Eva Blomqvist <evabl444@gmail.com<mailto:evabl444@gmail.com> <mailto:evabl444@gmail.com>>

Hi!
I think that some of those who were in the meeting also might
have slightly differing interpretations of what was said... I
agree to that there were two alternative interpretations of "data
stream" discussed, but as far as I understood those differed in
the sense that 1) was an *infinite* stream, where the elements of
the stream could somehow be *associated with time* (whether a
timestamped triple, a timestamped graph, or just a stream where
time is implicit from the arrival times of elements etc), and 2)
was a *finite* stream of elements where *time is not necessarily
an aspect*, e.g. triples from a data store being processed in a
streaming fashion.

I would be reluctant to at this stage limit ourselves to a
specific model, e.g. RDF statements with a single timestamp each.
Just my 2c..
/Eva

On 12/11/2013 17:33 , Emanuele Della Valle wrote:
Hi Abram,

I mean a list of tuples <s,t> where s is an RDF statement and t
is a non decreasing timestamp.

Cheers,

Emanuele

--
prof. Emanuele Della Valle
DEIB - Politecnico di Milano
m. +393389375810 <tel:%2B393389375810>
w. http://emanueledellavalle.org <http://emanueledellavalle.org/>

On Nov 12, 2013, at 12:27 PM, Abraham Bernstein
<bernstein@ifi.uzh.ch<mailto:bernstein@ifi.uzh.ch> <mailto:bernstein@ifi.uzh.ch>>
wrote:

Emanuele, all

I am slightly confused.... so just to clarify When you talk
about time-series: do you mean a series of numbers (expressed
in triples) or a time-ordered series of triples?

Cheers

Avi

On 12.11.2013, at 03:05, Jean-Paul <jp.calbimonte@upm.es<mailto:jp.calbimonte@upm.es>
<mailto:jp.calbimonte@upm.es>> wrote:

Hello,

Thanks for your input. 4th Telco will be on nov 22 15:00 CET.
We will be discussing about the Streams concepts and
definitions that we have started drafting in the wiki.
Please feel free to provide your input there already:

http://www.w3.org/community/rsp/wiki/Concepts_and_Definitions

...specialy if there is a key concept missing that you
consider we should include.

Cheers,
jp

PS
Please, if Danh or Manfred can help us again with Webex, we
will be very thankful.

2013/11/6 Jean-Paul <jp.calbimonte@upm.es<mailto:jp.calbimonte@upm.es>
<mailto:jp.calbimonte@upm.es>>

Yes, I see. That will make everyone's life easier.
We'll dicuss it.

thanks

2013/11/6 Axel Polleres <axel@polleres.net<mailto:axel@polleres.net>
<mailto:axel@polleres.net>>

Thanks, BTW, may I suggest that instead of a single
doodle per Telco, to doodle for one fixed timeslot per
week, e.g. "Tue 15:00" or alike, as usual in other
WGs? I think this should make planning easier. Maybe
we can discuss this in the Telco.

thanks & best regards,
Axel

--
Prof. Dr. Axel Polleres
Institute for Information Business, WU Vienna
url: http://www.polleres.net/ twitter: @AxelPolleres

On Nov 2, 2013, at 11:40 PM, Jean-Paul
<jp.calbimonte@upm.es<mailto:jp.calbimonte@upm.es> <mailto:jp.calbimonte@upm.es>>
wrote:

> Hello All,
>
> Thanks to all who could attend the meeting at ISWC,
and specially to those who made it through WebEx
(although couldn't interact too much, unfortunately)
>
> The meeting went quite well, and we received input
from people of other sub-communities and with
different background. Others showed interest, at least
as 'observers' of what we are trying to do.
>
> One result of the meting is the intention of
clarifying the scope of our work. A first step to do
this is to have written some of the key concepts and
definitions that we should agree on. Mikko has already
provided a first version as he already commented, and
the purpose of the next telecon will be to discuss them:
>
>
http://www.w3.org/community/rsp/wiki/Concepts_and_Definitions
>
> Until then, I invite you all to contribute to that (
I see some have already started, great!) so that we
can have material for discussion.
>
> Please, also indicate your preferences for the next
calls:
>
> http://doodle.com/a8ggni2v4su7c88b
>
> http://doodle.com/6i97qvmaqiwnwvsa
>
> http://doodle.com/hixgfbv9drxbu4in
>
>
> Thanks to all,
>
> jp
>
> --
> Jean-Paul Calbimonte
> Ontology Engineering Group
> Universidad Politécnica de Madrid

--
Jean-Paul Calbimonte
Ontology Engineering Group
Universidad Politécnica de Madrid

-----------------------------------------------------------------
| Professor Abraham Bernstein, PhD
| University of Zürich, Department of Informatics
| web: http://www.ifi.uzh.ch/ddis/bernstein.html

--
Jean-Paul Calbimonte
Ontology Engineering Group
Universidad Politécnica de Madrid

--
Prof. Manfred Hauswirth
Digital Enterprise Research Institute (DERI)
National University of Ireland, Galway (NUIG)
http://www.manfredhauswirth.org/

Received on Friday, 22 November 2013 12:48:11 UTC