Re: Embedded Content from Robert Sanderson on 2014-10-14 (public-annotation@w3.org from October 2014)

From: Robert Sanderson <azaroth42@gmail.com>
Date: Tue, 14 Oct 2014 09:08:54 -0700
To: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Cc: Annotation WG <public-annotation@w3.org>
Message-ID: <CABevsUFNkYZ8QEmeaT8NeLBZPL9LmsjfrsVMajA5L5nXyN6gFQ@mail.gmail.com>
Hi Stian,

Thanks for the detailed response!

On Tue, Oct 14, 2014 at 8:38 AM, Stian Soiland-Reyes <
soiland-reyes@cs.manchester.ac.uk> wrote:

> At first, I like the simplicity of your approach.
>
> Ideally I would had hoped for Content-in-RDF to be adapted by this WG
> so that we could just put it out (~ as-is) and make it official for
> anyone else - but I guess that is not allowed by our charter.
>

I'm not sure that we'd be allowed to co-opt their namespace regardless of
whether they're doing anything with it.


One feature of the Content-in-RDF model is that it allows both string
> and binary representations concurrently, indicating character set for
> interpreting the bytes.


My concerns with both representations at once are:

1. What is a system is supposed to do when they're different?
2. When would a system ever use the Base64 version when they have the
decoded characters already?

If the answer to these is that bytes and chars can be different
representations, then I think that's a bug rather than a feature.

And my concerns with characterEncoding:

3.  The serialization should have a character encoding, not a single
literal within the graph.
4.  The encoding should be UTF-8 regardless.
5.  I'm not sure that there would be many systems that actually did
anything with the value, if it was supplied.



> This is very useful for embedding content from
> non-web resources (e.g. files on a USB stick or stdout from a command
> line tool), as one cannot always be sure about the "stringiness" of
> the value and in particular of the character set of the bytes:
>

I have some sympathy, but I find it hard to construct a convincing use case
in the context of annotation.


Your approach is using rdf:value, where only one representation is
> possible. A duality of representations might not be needed as much in
> annotation systems, and could alternatively be asserted using
> prov:alternateOf (and prov:wasDerivedFrom) statements to a secondary
> oa:Content - I must admit this makes it clearer the direction of
> provenance:
>

+1 to splitting into two resources as below.


> :value1 a oa:Content ;
>     rdf:value "Hello world" ;
>     prov:alternateOf :value1Bytes ;
>     prov:wasDerivedFrom :value1Bytes .
>
> :value1Bytes a oa:Content ;
>     rdf:value "SGVsbG8gd29ybGQ="^^xsd:base64Binary ;
>     prov:atLocation <file:///tmp/annotation.txt> .
>



> It was mentioned earlier in another forum the challenge of embedding
> resources which have alternative representations (e.g. image/svg+xml
> and image/png). This might be a better way to handle any
> representation dualities - should oa:Content describe such relations?
>

We have oa:Choice for handling that.

It would be, currently, though see
https://github.com/w3c/web-annotation/issues/2:

{
  "@type": "oa:Choice"
  "default" : {
    "@type": "oa:Content",
    "format" : "image/svg",
    "value": "<svg:svg ...>"
  }
  "item" : {
    "@type": "oa:ContentAsBase64",
    "format" : "image/png",
    "value" : "91843709tuhasdfglkjdhfg..."
  }
}



> dc:format is a fairly weak property. It has been commonly used with
> IANA media types as in your example, but it is very poorly defined.
> Other valid dc:format strings are "book", "VHS" and "poster".
>

Agreed. The question to me is whether it's going to conflict with other
systems making "VHS" assertions?
Alternatively, is there a better property that already exists, or would we
be minting our own?

It is also unclear if parameters can be included, e.g.
> "application/ld+json; profile=http://example.com/p1".
>

This is more problematic.  I would like this to be possible.  Given the
looseness of dc:format, I think it's okay?


If a type is known and identifiable with a URI, but not officially
> registered with IANA, it might be odd for a third party to mint a
> x-type. An example of such a type is the system biology model
> language, which identifier includes version and compliancy level etc,
>
> http://identifiers.org/combine.specifications/sbml.level-3.version-1.core.release-1
>

What would a client system do with this information?  And when would such a
thing be embedded in an annotation?



> I have previously used instead dct:format. You loose some niceness as
> it gets a bit more verbose if you want it to be complete:
>
> <http://example.com/page.html> dcterms:format
> <http://purl.org/NET/mediatypes/text/html> .
>
> <http://purl.org/NET/mediatypes/text/html> a dcterms:FileFormat ;
>     dcam:memberOf dcterms:IMT ;
>     rdf:value "text/html" ;
>     rdfs:isDefinedBy <http://mediatypes.appspot.com/dump.rdf> ;
>     rdfs:label "HTML document"
>

In the minimal case where we want to only record media type...

{
  "dct:format": { "value" : "text/html" }
}

Yes?  And then this pattern allows other systems to include URIs as it's a
resource.



But overall I would much prefer the use of dct:format (or a
> sub-property oa:format which we say has range dcterms:FileFormat) to
> be able to have a resource that:
>
>  a) I can extend with additional properties
>

+0 ... unless there's a use case (other than label)?


>  b) Can be a non-IANA type, e.g.
>
> http://identifiers.org/combine.specifications/sbml.level-3.version-1.core.release-1


+0 ... not sure what a client would do with this, but I see the attraction.


>  c) Can have a classical IANA media type main/sub value with rdf:value
>  (TODO: parameters allowed?)
>

+1, and +1 to parameters being allowed ... or in a separate property?

 d) Can have a human-readable rdfs:label - e.g. "Microsoft Word document"
>

+1 to this property, this is almost convincing by itself.


>  e) A common URI pattern for any registered IANA type - e.g.
> http://www.iana.org/assignments/media-types/application/pdf (If they
> agree) or http://purl.org/NET/mediatypes/text/plain
>

-1 out of scope for us to fix this if IANA don't care?


On the other side I think dct:language (with range
> dct:LinguisticSystem) would be too verbose, so I would keep
> dc:language as long as we also recommend RFC 4646 for identifying the
> language.
>

Yes, though 5646 [1] obsoletes 4646, so with that slight tweak, I agree.


However, to play devil's advocate, there is some use of lexvo for linked
data languages, similarly to include labels and so forth.
This would not be too ugly given a sensible JSON-LD context that hides the
complexity...
    {"dct:language" : "lang:en"}

Thanks Stian!

Rob

[1] http://tools.ietf.org/html/rfc5646





> On 12 Oct 2014 21:43, "Robert Sanderson" <azaroth42@gmail.com> wrote:
> > One of the most significant changes that we need to make is what to do
> about the use of the seemingly abandoned Content in RDF specification.
> >
> > The issue:
> > * https://github.com/w3c/web-annotation/issues/3
> > * http://www.w3.org/annotation/track/issues/1
> >
> > The proposal in the github issue is to create two new classes for
> embedded plain text and embedded base64 encoded text, corresponding to
> cnt:ContentAsText and cnt:ContentAsBase64 respectively.
> >
> > These classes would use the properties:
> > * rdf:value -- for recording the content (required)
> > * dc:format -- for the media type of the content (optional)
> > * dc:language -- for the language of the content (optional)
> >
> > In JSON-LD this might look like:
> >
> > {
> >   "@type": "oa:Content",
> >   "value": "I love this book!",
> >   "format": "text/plain",
> >   "language": "en"
> > }
> >
> > Comments?
> >
> > Thanks!
> >
> > Rob
> >
> >
> > --
> > Rob Sanderson
> > Technology Collaboration Facilitator
> > Digital Library Systems and Services
> > Stanford, CA 94305
>



-- 
Rob Sanderson
Technology Collaboration Facilitator
Digital Library Systems and Services
Stanford, CA 94305
Received on Tuesday, 14 October 2014 16:09:23 UTC