Semantic Tags (was several threads)

To try to pull the threads together ...

Issue:  If there is a document which an annotator wants to use as a
semantic tag, then it is not possible to say that it's an oa:Tag, as
that information is specific to the Annotation.

Use cases: Many use cases, especially in bioinformatics.

Severity:  Difficult to determine and somewhat mitigated by the
(unanimous?) consensus that it is bad modeling and against the
architecture of the WWW to have a URI identify both a concept and a
document at the same time.  Severe enough in communities that need it
that it would be great if it was addressed.

Current:  The spec does not say exactly how to solve the problem, but
recommends minting a new URI for the tag and relating it "somehow" to
the document. It also has a single oa:Tag class, and relies on the
presence or non-presence of cnt:chars.

Regarding, first oa:Tag versus oa:SemanticTag:

* The open world assumption means that the non-presence of cnt:chars
means "we don't know if it's a semantic tag or not".
* It's not our predicate to associate additional semantics with its
presence, or lack thereof
* If you get an HTTP URI that calls itself a tag, and has cnt:chars,
it's unclear what to do.

Thus the proposal is to have a subclass, oa:SemanticTag to avoid these
situations.


There are several implicit proposals as to the model, all of which
further clarify the current recommendation:

1.  (Rob) Use Specific Resource with a oa:SemanticTag class. Then the
object of oa:hasSource is the document.  Objection from Antoine: This
is abusing Specific Resources.

2.  (Antoine) Use a oa:SemanticTag class, with foaf:primaryTopicOf.
Object from Rob: it's inverse functional, so the same document
couldn't be used for different semantic concepts. As the URI for the
tag resource is likely going to be a UUID or a blank node, this could
have unfortunate repercussions.

3.  (Rob) Use oa:SemanticTag class, with foaf:page.  This is the same
as 2. but with a looser predicate that isn't functional.


Thanks all!  Please correct and add to this if I misunderstood or
misrepresented anything :)

Rob


On Mon, Feb 4, 2013 at 5:59 AM, Antoine Isaac <aisaac@few.vu.nl> wrote:
> Hi Stian,
>
> Indeed there's not much way CNT could constrain the use of cnt:chars, maybe
> it's difficult to write a formal spec of what would qualify as "content" in
> an RDF environment. It just requires that users would "get it right"--just
> as many other elements in OA or elsewhere (OA motivations, for a start).
>
> Now, if we don't trust CNT to be used right, nothing prevents us from
> coining a new (sub)property to replace cnt:chars.
>
> Antoine
>
>
>
>> I know this is taking it a bit of on an edge. I am primarily just
>> worried about having implied semantics based on the presence or not of
>> a property which is not even ours.  That such usage would mainly sound
>> stupid in the examples we make up, they are not disallowed by other
>> specifications, and I don't think we can mandate how other
>> vocabularies should be used on non-OA resources.
>>
>>
>> On Mon, Feb 4, 2013 at 11:16 AM, Antoine Isaac<aisaac@few.vu.nl>  wrote:
>>>
>>> Hi Stian,
>>>
>>> All this is leading us into deep ontological thinking...
>>> The baseline is that Content in RDF is for "Content", ie. just encoding
>>> of
>>> stuff, the content of a file. When somebody with no knowledge of biology
>>> types "GATTTTTTTTTTACA" it's not a nucleotide sequence, it's a string.
>>> The T
>>> there has as much semantics as the t in "Stian".
>>>
>>> Even if a nucleotide sequence may not need to refer to molecules to be
>>> operational, bioinformaticians still assume something more than a string
>>> of
>>> literals. You're expected to do something with it that has certain
>>> semantics, even if they are low-level: ie., the main splitting level is
>>> the
>>> one of individual symbols (letters), you can't have an X in it, etc.
>>>
>>> As you say the string represents the sequence, and that still hints at a
>>> quite important difference in level. the value of cnt:chars does not
>>> represent content, it is the content.
>>>
>>> Antoine
>>>
>>>
>>>
>>>> On Fri, Feb 1, 2013 at 5:18 PM, Robert Sanderson<azaroth42@gmail.com>
>>>> wrote:
>>>>
>>>>> http://dbpedia.org/resource/Paris doesn't identify a document, so
>>>>> there's no confusion as to whether to dereference it or not.
>>>>
>>>>
>>>>
>>>> No, here we are lucky in that dbpedia.org is playing by the rules.
>>>>
>>>>> Using documents as *semantic* tags is simply bad modeling.  Do you
>>>>> mean the document or the semantic concept (eg my home page or me).
>>>>> Surely this has been discussed long enough in other contexts that we
>>>>> don't have to rehash it here?
>>>>
>>>>
>>>>
>>>> Of course. I am not saying that it is not bad modelling. I am just
>>>> trying to say you would find this in the wild, and it would not be
>>>> against the current specifications for HTTP, HTML, RDF, etc.
>>>>
>>>> In particular you would find hash-URIs like
>>>> <http://example.com/aDocument.rdf#concept>   - now is that covered by
>>>> not recommended "the URI of a document"? That is unclear by the
>>>> current wording.
>>>>
>>>> Also you would find examples like<http://omim.org/entry/104760>   by
>>>> Paolo, of course here the omim.org site is 'innocent' in that they
>>>> never intended to mint a semantic concept. That should not preclude
>>>> users of OA to use it as such.
>>>>
>>>>> But to assert that a non information resource, the city of Paris, has
>>>>> content is clearly wrong.
>>>>
>>>>
>>>>
>>>> I agree that would be silly for Paris. But we don't know what other
>>>> users of other concepts have done using Content-in-RDF, which is
>>>> another specification. There is nothing in the Content-in-RDF spec
>>>> that would not allow it to be used such. cnt:Content does not mandate
>>>> that the resource is an infoamrtion resource.
>>>>
>>>>> The cnt:Content class is an overarching class for any content that
>>>>> could
>>>>> be found on the Web, in an Intranet or in local storage media, for
>>>>> example.
>>>>> It is recommended always to use one of its subclasses. There is no
>>>>> restriction within the vocabulary scope on what can be represented with
>>>>> this
>>>>> class: textual content, XML files, binary files (e.g., images or
>>>>> movies),
>>>>> etc.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>> For instance,
>>>>>> semantic tags identifying genome sequences might very well be
>>>>>> including the actual genome sequence (like "GATTATTATATATATAGATTACA"
>>>>>> as cnt:chars.
>>>>>
>>>>>
>>>>> And that too would be wrong.  The biological genome in the real world
>>>>> does not contain a string of characters in UTF-8 like that.
>>>>
>>>>
>>>>
>>>> No, but they are commonly represented as such.  Just like a person's
>>>> name is not a string of characters in UTF-8. A nucleotide sequence is
>>>> the primary representation that they are recognized as. I asked two
>>>> bioinformaticians separately:
>>>>
>>>>
>>>> [10:18:59] Stian Soiland-Reyes: What would you call this (type of)
>>>> thing?
>>>> GATTTTTTTTTTTTTTTACCCACACACACA
>>>> [10:35:51] Stian Soiland-Reyes: ignoring finer details such as introns
>>>> etc
>>>> [10:35:55] Kristina Hettne: a DNA sequence
>>>>
>>>>
>>>> [10:18:56] Stian Soiland-Reyes: What would you call this (type of)
>>>> thing?
>>>> GATTTTTTTTTTTTTTTACCCACACACACA
>>>> [10:19:19] Katy Wolstencroft: a nucleotide sequence
>>>>
>>>>
>>>> So just like you would call "Paris" a city (or the name of a city),
>>>> they would identify it as a sequence, and that's the abstraction level
>>>> they work on, not on particular molecules inside a cell found inside a
>>>> particular organism in this lab.
>>>>
>>>>
>>>>
>>>>
>>>>>  From Content-in-RDF:
>>>>
>>>>
>>>>
>>>>> cnt:chars
>>>>> The character sequence of the given content.
>>>>
>>>>
>>>>
>>>>
>>>> So I think there is nothing stopping anyone from doing:
>>>>
>>>>
>>>> <http://example.com/gene/1337>   a :NucleotideSequence ;
>>>>       :sequence "GATTTTTTTTTTACA" .
>>>>
>>>> :sequence a owl:DatatypeProperty ;
>>>>       rdfs:subPropertyOf cnt:chars ;
>>>>       rdfs:domain :NucleotideSequence .
>>>>
>>>> Their reason for using cnt:chars here could be that a GATC letter
>>>> transcription of a genome sequence is the primary representation of
>>>> the abstract concept of a nucleotide sequence in the field.
>>>>
>>>>
>>>>
>>>> But now I (who we can pretend did not write the above) can't use
>>>> <http://example.com/gene/1337>   as a OA semantic tag, because it
>>>> happens to have an (implied) cnt:chars property, and I would be
>>>> seeming to say that the user has tagged "GATTTTTTTTTTACA" as a text.
>>>> The example.com guys should not be required to read the OA specs to
>>>> prevent this, they just follow Content-in-RDF.
>>>>
>>>>
>>>>> Yes, but that particular plague makes everything practically unusable.
>>>>>    Does this specific resource have a state? I don't know! How many
>>>>> targets are there for the Annotation? I don't know, there could be
>>>>> others that I don't know about! Does this Annotation have a body? I
>>>>> don't know, please just let me get on with my job! etc. :)
>>>>
>>>>
>>>>
>>>> I know, we don't want to go there. However it is one thing to go from
>>>> "unspecific to specific" (as in adding state), another to totally
>>>> change the semantic "if unspecified, it's X, otherwise it's Y (which
>>>> is not Y!)".
>>>>
>>>>
>>>>> <anno1>   a oa:Annotation ;
>>>>>     oa:hasSemanticTag<composite1>   ;
>>>>>     oa:hasTarget<target1>   .
>>>>>
>>>>> <composite1>   isn't intended as a semantic tag. But if we allow any
>>>>> URI
>>>>> to be used as a tag, nothing prevents someone from saying it is. So
>>>>> already we have trouble.
>>>>
>>>>
>>>>
>>>> Ah, I had not thought about this case. Yes, now oa:hasSemanticTag is
>>>> very misleading. So we would have to disallow both Composite and
>>>> Specific Resource indirections in my proposal, which would make it
>>>> very special case.
>>>>
>>>>> Here,<textualbody1>   is the resource that<semantictag1>   was
>>>>> extracted
>>>>> from.  The semantics of Composite are that all of the items are
>>>>> required, which is what the publisher wants to convey.
>>>>> Except textualbody isn't a tag. Nor is composite1.  This is the same
>>>>> argument as against a new predicate for literals as bodies.
>>>>
>>>>
>>>>
>>>> If you want to annotate that I would propose that as an independent
>>>> provenance statement (<composite1>/<anno1>    pav:importedFrom
>>>> <textualbody1>), and not conflate it into the very same annotation.
>>>>
>>>> If you are trying to say that the user typed in the<textualbody1>   as
>>>> an annotation on<target1>, and the system have subsequently found
>>>> some semantic tag in the<textualbody1>, then I would try to do the
>>>> second step as a second annotation<anno2>   with targets both
>>>> <textualbody1>   and<target1>    (with an optional  provenance trace of
>>>> <anno2>   pav:importedFrom<textualbody1>   ;  pav:derivedFrom<anno1>   )
>>>>
>>>>
>>>>> If there's a solution that allows a mix of body types, I would be
>>>>> overjoyed!  But I can't see how to do that without introducing any of:
>>>>> 1. a node in between (as current spec for documents); 2. a class or
>>>>> other property (as current spec for non documents); or 3. a new
>>>>> predicate (that gets us in trouble)
>>>>
>>>>
>>>>
>>>> I like the suggestion in your next email, which is to subclass/type a
>>>> SpecificResource for this purpose. This solves nicely the problems
>>>> above, and also avoids introducing a new, independent concept.  It
>>>> does structurally mean that we have to split or move the Tagging
>>>> section.
>>>>
>>>> Perhaps ; counter to my previous reply - the best solution would be a
>>>> split. Let the Tagging section stay where it is - textual tagging is a
>>>> quite primary type of annotation we should support at "level 1".
>>>> Semantic tagging is a more advanced feature, and can be presented with
>>>> the specifiers as a new section 3.6 - a specialization of the level 1
>>>> tagging.  The first section will then just say "For semantic tagging;
>>>> see section X.X."
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>

Received on Monday, 4 February 2013 17:19:09 UTC