- From: Paolo Martini <paolo.martini.relex@chello.be>
- Date: Fri, 13 Jul 2007 11:25:59 +0200
- To: <www-multimodal@w3.org> <www-multimodal@w3.org> <www-multimodal@w3.org>
- Message-Id: <db0f4f3e717c7d8c838dad99a39c28f0@chello.be>
Thanks, Michael and all, for your explanation. While the intended use of emma:start and emma:end inside an emma:arc is now clear to me, I cannot affirm the same about the overall issue of time anchoring an emma:lattice. As I understand it, the current version of the specification allows two mechanisms for anchoring an emma:node: an absolute anchoring, by means of emma:start and emma:end, and a relative anchoring, by means of emma:offset-to-start related to emma:time-ref-uri and emma:time-ref-anchor-point set for the whole emma:lattice. If, as you announce, the absolute anchoring is removed, the possibility to anchor emma:node will remain thanks to the relative mechanism. But, as the whole relative anchoring is at ìrisk of removal due to potential lack of implementationî (from ìStatus of this Documentî), the answer to the question ìCan an emma:node be anchored to time?î seems to depend more on current and known implementation needs than on a solid ontological model of the ìsignifiedî of an emma:lattice. I will therefore try to elaborate my concerns and my personal view on this issue, a view which I understand could go well beyond the focus or the interest of the Workgroup. I see at least three conceptually distinct time axis, which we can ìinput time axisî, ìmodel time axisî and ìoutput time axisî. The input time axis is the time axis of the received (by the sensor) input signal, while the output time axis is the time of ìreplayî of the annotated signal (a trivial example can be subtitle presentation). In the middle, the model time axis which is the temporality, if any, of the object model representing the event source of the signal. Here the emma:lattice. Leaving aside for the moment if the emma:lattice has an internal temporality, I now understand that emma:arc.emma:start and emma:arc.emma.end anchor the emma:arc to the input time axis, that is they identify the part of the input motivating the label of the emma:arc. If annotation is a function, the input time axis is its domain, while the values of the labels are in the codomain. Symmetrically, a label could be anchored to an output time axis. In the case of subtitles, to allow time to read it, the presentation of a label often last longer than the segment of the replayed signal motivating that label. This distinction of time axis between object model and input (output) signal was at the origin of my misunderstanding of emma:arc.emma:start,end periods, more than the nature of the emma:lattice itself. It is a point easy to lose sight of, and actually in your answer even you seem to end up mixing the two when, after pointing out that time pertains to the label, you comment ìthe end of the arc [..] may be later thenÖî. Please forgive me for being so picky on your words, but they show how much that point needs to be stressed in the presentation of EMMA. Even suggesting tags which more explicitly refer to input and, why not, to output signal time axis. A couple of side notes on this, about timescales and number of input signals. The current specification prescribes timestamps in ìmilliseconds for ease in processing timestampsî. It seems to me quite a strong restriction, fitting more an API than a ìdata interchange formatî, especially in a multimodal environment. Wouldnít it be enough to have a default to ìmillisecondsî while leaving the possibility to specify other unites (including samples, percentage, etc.)? About the number of input signals, the source event could be received through multiple inputs (from a couple of microphones up to more complex sensor arrays). There are multiple situations where the recognition of a communication event is the result of elaboration and fusion of the projections of that event on different sensors. I would say that EMMA cannot currently handle that situation, focused more on annotating the event projection inside a single signal than on representing the event as pre-existing its registration by any single sensor. It would be useful to leave room for this. Just as a teaser, think about describing the McGurk effect in a ìhuman recogniserî, keeping into account also the delay of the visual information, whose transduction takes some 20msec more than the acoustic one. We arrive then to the temporality of the object model. Iím afraid I cannot agree with you when you mention ìa typed text stringî suggesting there the lack of ìa time sequenceî, as it is not a simple set but a true sequence of discrete linguistic units, and the order in the sequence is definitely a temporal one. Tense-relationships are there well defined, though it is true that those units cannot, alone, be anchored to an external Time Of Day absolute or relative timescale. The way I see it is as if the syntagmatic axis of the string were a ìshower curtainî with its well defined holes for the hooks, independently of actual hooks ìanchoringî each hole, and therefore the curtain, to the bar, which would be a ruler-like time axis. The anchors (the hooks) put in relation the (internal) time axis of the string (the curtain) with an external time axis (the bar) which provides a more useful timescale. Every single path in an acyclic directed graph has the same kind of ordered discrete units and internal time scale that a ìtyped text stringî has. Actually, it seems to me that, conceptually, the graph of an emma:lattice and the graph of an ATLAS Annotation Graph coincide when there are no alternative paths, for EMMA, and only one annotation layer, for ATLAS. I argue that an emma:lattice does ìnecessarily have to correspond to a time sequenceî, because the expression of the ìrange of possible interpretations of a signalî, corresponding to the paradigmatic axis, takes only part of an emma:lattice, as I understand it, while most of the emma:lattice produced by a good (robust?) recogniser has syntagmatic meaning. Always at conceptual level, I think that EMMA and ATLAS Annotation Graphs could converge to a unified data interchange format able to account at the same time for: - temporality of the syntagmatic axis of the object model - anchoring of the syntagmatic units to input and output time axis - paradigmatic choices (both in production and in recognition) - multilayer annotation Out of such a unified format, if necessary, restrictive profiles could be defined to simplify the treatment in specific applications (serving the same purpose of the millisecond restriction in the current EMMA specification). This would allow a much broader applications than the ones foreseen in the W3C Multimodal Interaction Framework, reaching out to linguistic human-to-human and psycholinguistic annotations. I stressed that this is at a conceptual level, with the object models, because at a representation level, with the data models, the ontological statute of arcs seems to be quite different in EMMA and in ATLAS. As I understand it, emma:lattice is kept very close to an automaton, while I guess that a unified representation would demand the automaton to be extracted from a potentially more complex graph. Anyway, I suggest that in ìmaking clear that lattices represent abstractions of the signalî, you elaborate the rationale for this ìlimitationsî imposed to the emma:lattice. Thanks again for your attention and for your work, Paolo Martini Le 04-juil.-07, à 16:41, <johnston@research.att.com> a écrit : > > Dear Paolo Martini, > > Thanks Paolo for your thoughtful comments on the EMMA specification. > The > W3C Multimodal Working group has discussed these comments in some > detail > and formulated the following response. > > You are correct that emma:node elements are intended to correspond to > instants. > Regarding 1., we agree that as it stands the ability to place both > emma:start and emma:end on emma:node appears to allow a duration. This > is an error in the current draft as we did not intend for emma:start > and emma:end > to be used on emma:node. In the next draft of the EMMA specification > and > the corresponding schema we will remove the emma:start and emma:end > attributes from emma:node. > > With respect to point 2. The primary motivation for the addition of > emma:node was to provide a place for annotations > which apply specifically to nodes rather than to > arcs. For example, in some representations of speech recognition > lattices, > confidences or weights are placed on nodes in addition to arcs. > For this reason we define both nodes and arcs. > It is critical that we have both timestamps and node start end > annotations on arcs as they serve different purposes. The role of > the 'from' and 'to' annotations on arcs is to define the topology of > the > graph. On the other hand the timestamps emma:start and emma:end are > annotations > which describe temporal properties associated not necessarily with the > arc but > with the label on the arc. There is in fact no guarantee that the > emma:end on > 'flights' in your example will be equivalent to the emma:start on 'to'. > If they were required to be the same, the transition point from one > arc to the next > would have to be assigned to an arbitrary point in the silence between > the two > words. Similarly if there is no silence between two words in sequence > and > in fact they may share a geminate consonant, for example"well lit" > "gas station" > word timings from the recognizer may in fact overlap, that is the end > of > the arc for the word "well" may be later than beginning of "lit". > > Perhaps the even stronger case for having both time and the 'from' 'to' > annotations is that in the lattice representation being at a particular > time point does not guarantee that you are on the same node in the > lattice. > For example, imagine a lattice representing two possible strings: > > 'to boston' > 'two blouses' > > The lattice representation: > > <emma:lattice initial="1" final="4"> > <emma:arc from="1" to="2" start="1000" end ="2000">to</emma:arc> > <emma:arc from="1" to="3" start="1000" end ="2000">two</emma:arc> > <emma:arc from="2" to="4" start="2000" end ="4000">boston</emma:arc> > <emma:arc from="3" to="4" start="2000" end ="4000">blouses</emma:arc> > </emma:lattice> > > Note that even though the first two arcs end at the same time point > those arcs lead to different states 2 vs. 3, encoding which path > has been taken in the graph. > > The critical factor here is that the lattice representation does not > necessarily have to correspond to a time sequence. The lattice > representation > is used to encode a range of possible interpretations of a signal. It > is > often the case that the left to right sequence of symbols in the > lattice corresponds to > time but there is no guarantee. For example, the lattice may represent > interpretations of a typed text string rather than speech. It is also > possible that > a semantic representation encoded as a lattice could have time > annotations > on the first arc which are later than time annotations on the final > arc. > Since lattices represent abstractions over the signal we cannot assume > that time annotations define their topology. > > In order to clarify this we will add text to the > specification making clear that lattices represent abstractions of the > signal, and that time annotations may describe labels rather than arcs. > > We would greatly appreciate if you would review this response and > respond within three weeks indicating whether this resolves > your concern. If we do not receive a response within three weeks we > will assume that this response resolves your concern. > > > best > Michael Johnston > W3C Multimodal Working Group > > > > > Dear W3C Multimodal working group, > > I approached only recently EMMA and I have some problems understanding > the temporal anchoring of an emma:node. > > I would instinctively expect a node to correspond to what ISO 8601 > calls an "instant", a "point on the time axis". > > With reference to paragraph 3.4, if I read correctly the document: > > 1. An emma:node can be anchored with absolute or relative timestamps. > In the absolute mode, the optional emma:start and emma:end attributes > seem to allow a duration, while in the relative mode, the optional > emma:offset-to-start (with emma:duration not allowed) seems to force an > instant status. > If, conceptually, a node is allowed to correspond to a segment of the > signal, I would welcome a comment on the rationale for that. If not, I > would suggest to replace emma:start and emma:end with a single "time > point"-like attribute or, at least, to forbid emma:end, implicitely > adding ambiguity in the semantics of emma:start. > > 2. An emma:arc implicitly asserts the existence of two nodes, but I > would say that the temporal attributes of the arcs, if present, define > those nodes. A node could be therefore defined more than once. I > simplify the example in 3.4.2: > <emma:arc from="1" to="2" > emma:start="1087995961542" emma:end="1087995962042">flights</emma:arc> > <emma:arc from="2" to="3" > emma:start="1087995962042" emma:end="1087995962542">to</emma:arc> > Being node 2 the same, what if emma:end in the "flights" arc and > emma:start in the "to" arc do not have the very same value? > Again, if this is conceptually allowed, I would welcome an explanation > of the rationale. Otherwise, I would prefer enforcing a coherent > description directly in the language instead of relying on validity > checks. For example, restricting the "definition" of nodes inside > node:element, i.e. forbidding timestamps in arcs. > > I went through the document and the list archive and I wasnít able to > find answers to these doubts. Nevertheless, I apologize if these points > have already been addressed. > Thanks for your help and your work, > > Paolo Martini >
Attachments
- text/enriched attachment: stored
Received on Friday, 13 July 2007 18:17:32 UTC