Re: Embedded Content from Stian Soiland-Reyes on 2014-10-14 (public-annotation@w3.org from October 2014)

From: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Date: Tue, 14 Oct 2014 16:38:06 +0100
To: Robert Sanderson <azaroth42@gmail.com>
Cc: Annotation WG <public-annotation@w3.org>
Message-ID: <CAPRnXtnVoNuh==TwmeXGdHck52Z4gQTksoQsfgUSM0dCXuxY=A@mail.gmail.com>
At first, I like the simplicity of your approach.

Ideally I would had hoped for Content-in-RDF to be adapted by this WG
so that we could just put it out (~ as-is) and make it official for
anyone else - but I guess that is not allowed by our charter.


One feature of the Content-in-RDF model is that it allows both string
and binary representations concurrently, indicating character set for
interpreting the bytes. This is very useful for embedding content from
non-web resources (e.g. files on a USB stick or stdout from a command
line tool), as one cannot always be sure about the "stringiness" of
the value and in particular of the character set of the bytes:

:value1 a cnt:ContentAsText, cnt:ContentAsBase64 ;
  cnt:bytes "SGVsbG8gd29ybGQ="^^xsd:base64Binary ;
  cnt:chars "Hello world" ;
  cnt:characterEncoding "ASCII" .



Your approach is using rdf:value, where only one representation is
possible. A duality of representations might not be needed as much in
annotation systems, and could alternatively be asserted using
prov:alternateOf (and prov:wasDerivedFrom) statements to a secondary
oa:Content - I must admit this makes it clearer the direction of
provenance:


:value1 a oa:Content ;
    rdf:value "Hello world" ;
    prov:alternateOf :value1Bytes ;
    prov:wasDerivedFrom :value1Bytes .

:value1Bytes a oa:Content ;
    rdf:value "SGVsbG8gd29ybGQ="^^xsd:base64Binary ;
    prov:atLocation <file:///tmp/annotation.txt> .

(Describing the character set, checksums etc. would require additional
vocabularies and perhaps a PROV activity)



It was mentioned earlier in another forum the challenge of embedding
resources which have alternative representations (e.g. image/svg+xml
and image/png). This might be a better way to handle any
representation dualities - should oa:Content describe such relations?



dc:format is a fairly weak property. It has been commonly used with
IANA media types as in your example, but it is very poorly defined.
Other valid dc:format strings are "book", "VHS" and "poster".

It is also unclear if parameters can be included, e.g.
"application/ld+json; profile=http://example.com/p1".

If a type is known and identifiable with a URI, but not officially
registered with IANA, it might be odd for a third party to mint a
x-type. An example of such a type is the system biology model
language, which identifier includes version and compliancy level etc,
http://identifiers.org/combine.specifications/sbml.level-3.version-1.core.release-1

With dc:format to a literal we have to resort to xsd:anyURI which does
not make it Linked Data. (On the other hand dc:format is so poorly
defined you could also use it as an object property to a resource!)

I have previously used instead dct:format. You loose some niceness as
it gets a bit more verbose if you want it to be complete:


from https://gist.github.com/stain/4635250


<http://example.com/page.html> dcterms:format
<http://purl.org/NET/mediatypes/text/html> .

<http://purl.org/NET/mediatypes/text/html> a dcterms:FileFormat ;
    dcam:memberOf dcterms:IMT ;
    rdf:value "text/html" ;
    rdfs:isDefinedBy <http://mediatypes.appspot.com/dump.rdf> ;
    rdfs:label "HTML document"


(note - rdf:value there again)





Secondary the URIs for IANA media types are not quite in this century
yet - see my suggestion to IANA.

http://www.ietf.org/mail-archive/web/media-types/current/msg00617.html



But overall I would much prefer the use of dct:format (or a
sub-property oa:format which we say has range dcterms:FileFormat) to
be able to have a resource that:

 a) I can extend with additional properties
 b) Can be a non-IANA type, e.g.
http://identifiers.org/combine.specifications/sbml.level-3.version-1.core.release-1
 c) Can have a classical IANA media type main/sub value with rdf:value
 (TODO: parameters allowed?)
 d) Can have a human-readable rdfs:label - e.g. "Microsoft Word document"
 e) A common URI pattern for any registered IANA type - e.g.
http://www.iana.org/assignments/media-types/application/pdf (If they
agree) or http://purl.org/NET/mediatypes/text/plain





On the other side I think dct:language (with range
dct:LinguisticSystem) would be too verbose, so I would keep
dc:language as long as we also recommend RFC 4646 for identifying the
language.



On 12 Oct 2014 21:43, "Robert Sanderson" <azaroth42@gmail.com> wrote:
>
>
> One of the most significant changes that we need to make is what to do about the use of the seemingly abandoned Content in RDF specification.
>
> The issue:
> * https://github.com/w3c/web-annotation/issues/3
> * http://www.w3.org/annotation/track/issues/1
>
> The proposal in the github issue is to create two new classes for embedded plain text and embedded base64 encoded text, corresponding to cnt:ContentAsText and cnt:ContentAsBase64 respectively.
>
> These classes would use the properties:
> * rdf:value -- for recording the content (required)
> * dc:format -- for the media type of the content (optional)
> * dc:language -- for the language of the content (optional)
>
> In JSON-LD this might look like:
>
> {
>   "@type": "oa:Content",
>   "value": "I love this book!",
>   "format": "text/plain",
>   "language": "en"
> }
>
> Comments?
>
> Thanks!
>
> Rob
>
>
> --
> Rob Sanderson
> Technology Collaboration Facilitator
> Digital Library Systems and Services
> Stanford, CA 94305
Received on Tuesday, 14 October 2014 15:38:55 UTC