Re: emma:node anchoring on signal time axis from JOHNSTON, MICHAEL J (MICHAEL J) on 2007-08-03 (www-multimodal@w3.org from August 2007)

From: JOHNSTON, MICHAEL J (MICHAEL J) <johnston@research.att.com>
Date: Fri, 3 Aug 2007 13:29:24 -0400
To: "Paolo Martini" <paolo.martini.relex@chello.be>
Cc: <www-multimodal@w3.org>
Message-ID: <0C50B346CAD5214EA8B5A7C0914CF2A45CF64D@njfpsrvexg3.research.att.com>
Dear Paolo Martini,

 

Thank you for your detailed and thoughtful contributions on emma:lattice. 

The EMMA subgroup have discussed your comments in detail and formulated

the following responses. 

 

Regarding your first point about the relative time mechanism on emma:node.

We agree that like the absolute timestamps, the relative time stamp mechanism

was also not intended to apply to emma:node and will remove relative

timestamps from emma:node in the specification. 

 

Regarding the three different time axes you describe (input, model, output),

the scope of the EMMA specification addresses only the input axis at this 

point in its development. In the longer term we hope to extend EMMA for representation

of system output as well as user inputs but for EMMA 1.0 we address only input. 

Your comments regarding the output time axis are particularly relevant for output

representation in EMMA and will provide valuable input for future versions of EMMA.

 

Regarding the connections between emma:lattice representations and annotation

graphs such as ATLAS, again this is a very good feedback. The initial intention behind 

emma:lattice is to capture and provide a standard representation for the graph outputs

that vendors of speech recognition and other modality processing components 

currently provide in proprietary representations. Over the course of this work more use cases

have come up and there is growing interest in the potential use of EMMA more broadly

for annotation of speech corpora and other resources.  The initial scope of EMMA is to provide

a mechanism for communication among the components of interactive systems, such

as spoken and multimodal dialog systems.  In future versions of EMMA beyond 1.0 we

hope to provide more support for annotation and corpus use cases, and 

your input on relations with annotation schemes such as ATLAS will be extremely

valuable for that work. 

 

We would greatly appreciate it if you could respond within the next two weeks

indicating whether this response addresses your concerns. Thanks again for

such detailed feedback.

 

best

Michael Johnston

 

 

On behalf of W3C Multimodal working group

 

 

 

 

 

Thanks, Michael and all, for your explanation.
 
While the intended use of emma:start and emma:end inside an emma:arc is 
now clear to me, I cannot affirm the same about the overall issue of 
time anchoring an emma:lattice.
 
As I understand it, the current version of the specification allows two 
mechanisms for anchoring an emma:node:
an absolute anchoring, by means of emma:start and emma:end, and a 
relative anchoring, by means of emma:offset-to-start related to 
emma:time-ref-uri and emma:time-ref-anchor-point set for the whole 
emma:lattice.
 
If, as you announce, the absolute anchoring is removed, the possibility 
to anchor emma:node will remain thanks to the relative mechanism. But, 
as the whole relative anchoring is at ìrisk of removal due to potential 
lack of implementationî (from ìStatus of this Documentî), the answer to 
the question ìCan an emma:node  be anchored to time?î seems to depend 
more on current and known implementation needs than on a solid 
ontological model of the ìsignifiedî of an emma:lattice.
 
I will therefore try to elaborate my concerns and my personal view on 
this issue, a view which I understand could go well beyond the focus or 
the interest of the Workgroup.
 
I see at least three conceptually distinct time axis, which we can 
ìinput time axisî, ìmodel time axisî and ìoutput time axisî.
 
The input time axis is the time axis of the received (by the sensor) 
input signal, while the output time axis is the time of ìreplayî of the 
annotated signal (a trivial example can be subtitle presentation).
In the middle, the model time axis which is the temporality, if any, of 
the object model representing the event source of the signal. Here the 
emma:lattice.
 
Leaving aside for the moment if the emma:lattice has an internal 
temporality, I now understand that emma:arc.emma:start and 
emma:arc.emma.end anchor the emma:arc to the input time axis, that is 
they identify the part of the input motivating the label of the 
emma:arc. If annotation is a function, the input time axis is its 
domain, while the values of the labels are in the codomain.
 
Symmetrically, a label could be anchored to an output time axis. In the 
case of subtitles, to allow time to read it, the presentation of a 
label often last longer than the segment of the replayed signal 
motivating that label.
 
This distinction of time axis between object model and input (output) 
signal was at the origin of my misunderstanding of 
emma:arc.emma:start,end periods, more than the nature of the 
emma:lattice itself. It is a point easy to lose sight of, and actually 
in your answer even you seem to end up mixing the two when, after 
pointing out that time pertains to the label, you comment ìthe end of 
the arc [..] may be later thenÖî. Please forgive me for being so picky 
on your words, but they show how much that point needs to be stressed 
in the presentation of EMMA. Even suggesting tags which more explicitly 
refer to input and, why not, to output signal time axis.
 
A couple of side notes on this, about timescales and number of input 
signals.
 
The current specification prescribes timestamps in ìmilliseconds for 
ease in processing timestampsî. It seems to me quite a strong 
restriction, fitting more an API than a ìdata interchange formatî, 
especially in a multimodal environment. Wouldnít it be enough to have a 
default to ìmillisecondsî while leaving the possibility to specify 
other unites (including samples, percentage, etc.)?
About the number of input signals, the source event could be received 
through multiple inputs (from a couple of microphones up to more 
complex sensor arrays). There are multiple situations where the 
recognition of a communication event is the result of elaboration and 
fusion of the projections of that event on different sensors. I would 
say that EMMA cannot currently handle that situation, focused more on 
annotating the event projection inside a single signal than on 
representing the event as pre-existing its registration by any single 
sensor. It would be useful to leave room for this. Just as a teaser, 
think about describing the McGurk effect in a ìhuman recogniserî, 
keeping into account also the delay of the visual information, whose 
transduction takes some 20msec more than the acoustic one.
 
We arrive then to the temporality of the object model. Iím afraid I 
cannot agree with you when you mention ìa typed text stringî suggesting 
there the lack of ìa time sequenceî, as it is not a simple set but a 
true sequence of discrete linguistic units, and the order in the 
sequence is definitely a temporal one. Tense-relationships are there 
well defined, though it is true that those units cannot, alone, be 
anchored to an external Time Of Day absolute or relative timescale. The 
way I see it is as if the syntagmatic axis of the string were a ìshower 
curtainî with its well defined holes for the hooks, independently of 
actual hooks ìanchoringî each hole, and therefore the curtain, to the 
bar, which would be a ruler-like time axis. The anchors (the hooks) put 
in relation the (internal) time axis of the string (the curtain) with 
an external time axis (the bar) which provides a more useful timescale.
 
Every single path in an acyclic directed graph has the same kind of 
ordered discrete units and internal time scale that a ìtyped text 
stringî has. Actually, it seems to me that, conceptually, the graph of 
an emma:lattice and the graph of an ATLAS Annotation Graph coincide 
when there are no alternative paths, for EMMA, and only one annotation 
layer, for ATLAS.
I argue that an emma:lattice does ìnecessarily have to correspond to a 
time sequenceî, because the expression of the ìrange of possible 
interpretations of a signalî, corresponding to the paradigmatic axis, 
takes only part of an emma:lattice, as I understand it, while most of 
the emma:lattice produced by a good (robust?) recogniser has 
syntagmatic meaning.
 
Always at conceptual level, I think that EMMA and ATLAS Annotation 
Graphs could converge to a unified data interchange format able to 
account at the same time for:
- temporality of the syntagmatic axis of the object model
- anchoring of the syntagmatic units to input and output time axis
- paradigmatic choices (both in production and in recognition)
- multilayer annotation
 
Out of such a unified format, if necessary, restrictive profiles could 
be defined to simplify the treatment in specific applications (serving 
the same purpose of the millisecond restriction in the current EMMA 
specification).
This would allow a much broader applications than the ones foreseen in 
the W3C Multimodal Interaction Framework, reaching out to linguistic 
human-to-human and psycholinguistic annotations.
 
I stressed that this is at a conceptual level, with the object models, 
because at a representation level, with the data models, the 
ontological statute of arcs seems to be quite different in EMMA and in 
ATLAS. As I understand it, emma:lattice is kept very close to an 
automaton, while I guess that a unified representation would demand the 
automaton to be extracted from a potentially more complex graph. 
Anyway, I suggest that in ìmaking clear that lattices represent 
abstractions of the signalî, you elaborate the rationale for this 
ìlimitationsî imposed to the emma:lattice.
 
Thanks again for your attention and for your work,
 
        Paolo Martini
 
 
Le 04-juil.-07, à 16:41, <johnston@research.att.com <mailto:johnston@research.att.com?Subject=Re%3A%20emma%3Anode%20anchoring%20on%20signal%20time%20axis&In-Reply-To=%253Cdb0f4f3e717c7d8c838dad99a39c28f0%40chello.be%253E&References=%253Cdb0f4f3e717c7d8c838dad99a39c28f0%40chello.be%253E> > a écrit :
 
> 
> Dear Paolo Martini,
>  
> Thanks Paolo for your thoughtful comments on the EMMA specification. 
> The
> W3C Multimodal Working group has discussed these comments in some 
> detail
> and formulated the following response.
>  
> You are correct that emma:node elements are intended to correspond to 
> instants.
> Regarding 1., we agree that as it stands the ability to place both
> emma:start and emma:end on emma:node appears to allow a duration.  This
> is an error in the current draft as we did not intend for emma:start 
> and emma:end
> to be used on emma:node. In the next draft of the EMMA specification 
> and
> the corresponding schema we will remove the emma:start and emma:end
> attributes from emma:node.
>  
> With respect to point 2. The primary motivation for the addition of
> emma:node was to provide a place for annotations
> which apply specifically to nodes rather than to
> arcs. For example, in some representations of speech recognition 
> lattices,
> confidences or weights are placed on nodes in addition to arcs.
> For this reason we define both nodes and arcs.
> It is critical that we have both timestamps and node start end
> annotations on arcs as they serve different purposes. The role of
> the 'from' and 'to' annotations on arcs is to define the topology of 
> the
> graph. On the other hand the timestamps emma:start and emma:end are 
> annotations
> which describe temporal properties associated not necessarily with the 
> arc but
> with the label on the arc. There is in fact no guarantee that the 
> emma:end on
> 'flights' in your example will be equivalent to the emma:start on 'to'.
> If they were required to be the same, the transition point from one 
> arc to the next
> would have to be assigned to an arbitrary point in the silence between 
> the two
> words. Similarly if there is no silence between two words in sequence 
> and
> in fact they may share a geminate consonant, for example"well lit" 
> "gas station"
> word timings from the recognizer may in fact overlap, that is the end 
> of
> the arc for the word "well" may be later than beginning of "lit".
>  
> Perhaps the even stronger case for having both time and the 'from' 'to'
> annotations is that in the lattice representation being at a particular
> time point does not guarantee that you are on the same node in the 
> lattice.
> For example, imagine a lattice representing two possible strings:
>  
> 'to boston'
> 'two blouses'
>  
> The lattice representation:
>  
> <emma:lattice initial="1" final="4">
> <emma:arc from="1" to="2" start="1000" end ="2000">to</emma:arc>
> <emma:arc from="1" to="3" start="1000" end ="2000">two</emma:arc>
> <emma:arc from="2" to="4" start="2000" end ="4000">boston</emma:arc>
> <emma:arc from="3" to="4" start="2000" end ="4000">blouses</emma:arc>
> </emma:lattice>
>  
> Note that even though the first two arcs end at the same time point
> those arcs lead to different states 2 vs. 3, encoding which path
> has been taken in the graph.
>  
> The critical factor here is that the lattice representation does not
> necessarily have to correspond to a time sequence. The lattice 
> representation
> is used to encode a range of possible interpretations of a signal. It 
> is
> often the case that the left to right sequence of symbols in the 
> lattice corresponds to
> time but there is no guarantee. For example, the lattice may represent
> interpretations of a typed text string rather than speech. It is also 
> possible that
> a semantic representation encoded as a lattice could have time 
> annotations
> on the first arc which are later than time annotations on the final 
> arc.
> Since lattices represent abstractions over the signal we cannot assume
> that time annotations define their topology.
>  
> In order to clarify this we will add text to the
> specification making clear that lattices represent abstractions of the
> signal, and that time annotations may describe labels rather than arcs.
>  
> We would greatly appreciate if you would review this response and
> respond within three weeks indicating whether this resolves
> your concern. If we do not receive a response within three weeks we
> will assume that this response resolves your concern.
>  
>  
> best
> Michael Johnston
> W3C Multimodal Working Group
>  
>  
>  
>  
> Dear W3C Multimodal working group,
>  
> I approached only recently EMMA and I have some problems understanding
> the temporal anchoring of an emma:node.
>  
> I would instinctively expect a node to correspond to what ISO 8601
> calls an "instant", a "point on the time axis".
>  
> With reference to paragraph 3.4, if I read correctly the document:
>  
> 1. An emma:node can be anchored with absolute or relative timestamps.
> In the absolute mode, the optional emma:start and emma:end attributes
> seem to allow a duration, while in the relative mode, the optional
> emma:offset-to-start (with emma:duration not allowed) seems to force an
> instant status.
> If, conceptually, a node is allowed to correspond to a segment of the
> signal, I would welcome a comment on the rationale for that. If  not, I
> would suggest to replace emma:start and emma:end with a single "time
> point"-like attribute or, at least, to forbid emma:end, implicitely
> adding ambiguity in the semantics of  emma:start.
>  
> 2. An emma:arc implicitly asserts the existence of two nodes, but I
> would say that the temporal attributes of the arcs, if present, define
> those nodes. A node could be therefore defined more than once. I
> simplify the example in 3.4.2:
> <emma:arc from="1" to="2"
> emma:start="1087995961542" emma:end="1087995962042">flights</emma:arc>
> <emma:arc from="2" to="3"
> emma:start="1087995962042" emma:end="1087995962542">to</emma:arc>
> Being node 2 the same, what if emma:end in the "flights" arc and
> emma:start in the "to" arc do not have the very same value?
> Again, if this is conceptually allowed, I would welcome an explanation
> of the rationale. Otherwise, I would prefer enforcing a coherent
> description directly in the language instead of relying on validity
> checks. For example, restricting the "definition" of nodes inside
> node:element, i.e. forbidding timestamps in arcs.
>  
> I went through the document and the list archive and I wasnít able to
> find answers to these doubts. Nevertheless, I apologize if these points
> have already been addressed.
> Thanks for your help and your work,
>  
>    Paolo Martini
>
Received on Friday, 3 August 2007 17:30:10 UTC