Re: Schema.org property cardinality and use of plural (WAS Re: SoftwareApplication proposal for schema.org)

From: Dan Brickley <danbri@danbri.org> · Date: Fri, 2 Mar 2012 16:58:50 +0100

Hi Tantek,

On 2 March 2012 01:41, Tantek <tantek@cs.stanford.edu> wrote:
> tl;dr: making all properties optional and plural at a
> syntactical/schema level is more future proof.
>
> slightly longer:
>
> I saw the wiki page: http://www.w3.org/wiki/WebSchemas/Singularity but
> I have such fundamental disagreements with its content that I'm
> following up in email instead (wiki follow-up is my preference by default).

I hope some of that disagreement comes from poor description on my
part, rather than deeper differences.

I yesterday asked the W3C team to add a syntax highlighter to the main
W3C wiki (apparently some groups were already using it). And they've
done that already :) This means we can now post markup examples there
much more easily; I've just made a start on this.

So I've added a couple to the top of
http://www.w3.org/wiki/WebSchemas/Singularity to give a bit more
context and concrete illustration of our current problem.

> 1. bad idea to put cardinality into formats/schemas.
> 2. better to make each property plural at a format/schema level.
> 3. move any notion of "singular" semantics up to the per application
> level (use first occurrence/value of a property if necessary).
>
> Our experience with microformats has shown all this and thus we are
> moving forward with 1-3 above in microformats-2 (explicitly making all
> properties both optional and plural at a syntax level, permitting
> applications to apply any needed singular semantics to first instance
> of a property).

The schema.org schema is essentially a dictionary; it doesn't have a
notion of mandatory properties. It does hint that using common
properties (url etc) is useful, but there is no notion of a schema.org
description that is invalid due to missing information. This (as
you've found with microformats) generally works better at the level of
applications. So while Rich Snippets or Yandex tools might complain
that they don't have enough information to go on, missing out
information is not 'wrong' in some broader schema.org-sense.

I haven't looked through yet, property by property, to see which of
the would-be 'singular' properties are most plausible vs questionable.
Personally I find it a bad idea to encode such decisions in the
spelling of the property, because it makes any change vastly more
expensive.

I suspect I'm more comfortable than you with the idea of expressing
such information in schemas, because the schema languages I generally
work with are in the passive/declarative family (RDFS/OWL).
Schema.org's approach to schema is much in that tradition, as it
expresses information about a vocabulary rather than a concrete
format.  In this style of schema, if you say that e.g. dateOfBirth is
a functional property, you're saying something about the meaning of
the term, rather than setting an expectation that every description
should include 1 of those.

When dealing with format schemas (XSD etc.) I also prefer to see
cardinality and other rules expressed more at the application level,
with Schematron being a fine example of an (XML) system that allows
different rules to be policed (or skipped) in different workflow
stages or applications.

So +1 on everything optional, and perhaps a difference of perspective
around value of expressing things in (some kind of) schema.

I think also (from our discussion around expected values for 'gender'
property in microformats, vCard, FOAF) we also share a concern that
engineers tend to push too hard to get a complex world into tidier,
rigid data structures, at the expense sometimes of the people they're
describing. So often the impulse to express a rigid constraint is
worth questioning, particularly when a vocabulary seeks wide-scale use
by very diverse parties and applications.

> See point (2.) here:
>
> http://microformats.org/wiki/microformats-2#Summary
>
> Related: making properties optional has also been a hard lesson
> learned, with repeated examples, from hCard to hAtom - which to be
> fair took their notions of "required" properties from vCard and Atom,
> though a mistake propagated is still a mistake.

Yup, I think we talked about this at SocialGraphFoo a while back. The
more widely a vocabulary aspires to be used, the less it can be strict
about the exact shape of descriptions that use it.

> In short: people will omit properties when publishing, and often think
> of ways that it makes sense to do so - beyond what the original
> vocabulary designer(s) was/were thinking. Better to define what that
> means on a per vocabulary / application level than make it something
> syntactical.

I agree; missing isn't always broken.

> More inline:
>
> On Thu, Mar 1, 2012 at 10:01 AM, Dan Brickley <danbri@danbri.org> wrote:
>> On 24 February 2012 21:25, Will Norris <will@willnorris.com> wrote:
>>> I had the same question when I first started looking at this.  There is a
>>> certain simplicity in not requiring microdata vocabularies to define
>>> cardinality of properties, and leaves the door open to interesting use cases
>>> that may not have been initially imagined.
>>
>> Yes. Well there are two things here: do we define a cardinality
>
> Experience with microformats has shown attempts to define cardinality
> in formats for publishing on the web create an unnecessary point of
> failure/fragility.
>
> So, no, don't bother.

I should probably have avoided the word 'cardinality'; it has at least
a couple of quite different meanings, depending on the underlying
format or data model.

>> do we also bake that cardinality into the property name with an
>> English plural 's' (or it's absense)?
>
> Even worse idea.

I agree.

> Simple answer (using hRecipe as a real world designed/implemented
> source of examples)
>
> http://microformats.org/wiki/hrecipe
>
> 1. use singular forms of English nouns, even for (expected)
> multivalued properties. E.g. "ingredient" is the property name even
> though pretty much all recipes have multiple ingredients (and thus
> instances of that property)

I agree.

> 2. plural forms of English nouns should only be used when it implies
> specific meaning about any instance of the property (e.g. amounts),
> not some implication that the property is or may be multivalued. E.g.
> "instructions" is the property because a property value itself likely
> contains multiple human readable instructions, and in practice recipes
> have a single instance of this property. Related: the "calories"
> extension property (which is an amount).

Yup

>> I'm in favour of defining cardinality when it makes sense to do so
>
> It never makes sense to do so at a format level.

If we can distinguish format vs vocabulary levels, perhaps we can agree here?

>> (and in the full expectation people will ignore or mess up whatever we
>> try to impose).
>
> Exactly why. Instead define application processing of such cases.
>
>
>> But I don't think the experiment of using plural 's'
>> markers has worked well.
>
> Yes, the use of plural forms to indicate anything syntactically or
> semantically automatically was/is a mistake. Let's stop propagating it
> (and shame on whoever thought that experiment was a good idea :P)

Re 'let's stop propagating it', that was a goal of
http://www.w3.org/wiki/WebSchemas/Singularity

>> http://schema.org/Person has 'spouse' rather than 'spouses'. Are we
>> really to assume the property can have at most one singular value?
>> What about re-marriage, or societies (the Web having global reach)
>> where multiple spouses are common?
>
> Great example.
>
> Cultural differences are easily a source of disagreements of
> cardinality. Avoid this problem by leaving out cardinality.
>
>
>> I'd much rather see cardinality expressed schematically
>> than through spelling,
>
> Both are bad.

I believe schemas can be managed and evolved more easily than such
spelling rules, not least because you can change a schema and
associated documentation more easily than changing millions of Web
pages. But our main concern here is whether to move various property
names, e.g. 'actor' to 'actors'. We can debate the level, formality
and content of schema separately.

>> since changing the expected spelling has impact
>> on a *lot* of instance data.
>
> Separate issue: unnecessary renaming in general is bad, and I'd advise
> anyone who makes decisions on schema property names to consider
> re-using existing property names, perhaps singularized as necessary
> (as we've done with microformats), rather than using new names (as is
> rampant throughout schema.org - lots of unnecessary NIH,
> even/especially where we (previous to schema) had format convergence
> on the web e.g. Person vs. vCard/hCard, Event vs. iCalendar/hCalendar
> etc.).

Noted in wiki.

The list at http://www.w3.org/wiki/WebSchemas/Singularity#Details
isn't yet cross-referenced to associated classes and other
documentation, but if you have e.g. a 'top 10' wishlist for name
changes that could be made part of this proposal, please record it in
the wiki page. Fixing by removing the final 's' is the simplest option
but it would be good to see a specific proposal for any name
improvements that could happen instead.

>>> I think the same would apply to cardinality.  We provide guidance on
>>> expected cardinality of properties, but always do the best we can with
>>> whatever we get.
>>
>> Yes. With FOAF we declared some properties as having 'at most one
>> proper value', or implying that
>> there can be at most one entity with any given value. Sites got it
>> wrong all the time, but at least the
>> declaration helped track down some data problems.
>
> And before that vCard (up through v3) made the same mistake of
> declaring many properties to have at most one value.
>
> Much of this was addressed in vCard4, where previously singular
> properties were made plural.
>
> In short, vocabulary designers get cardinality wrong all the time, so
> you might as well give up trying. Seriously, y'all are not that smart.
> None of us are. ;)

The FOAF usage was generally more to express 'at most one thing has
any given value for this property' (indirect identification by
description) than to say 'there's at most one value of this property'.
We say that things have at most one gender; that documents can have at
most one primaryTopic; people one birthday and age. That's all, for
reasons I think we largely agree on. The rest were 'inverse
functional' constraints, eg. that a homepage is a homepage of at most
one 'thing' (ditto primary mailbox).

> It's easier (and more future-proof) to simply allow every property to
> be plural, and then define any perceived singular semantics at a
> higher application level (which is where any notion of singular vs
> plural actually matters if at all). If it changes, changing the
> application is much easier than the format.
>
>> If we have to choose between the JSON being a bit weird, or the
>> HTML-based markups being a bit weird, I would go for the former. JSON
>> feeds are relatively invisible, whereas the HTML source has a wider
>> and more varied audience.
>
> Agreed. HTML impacts more authors, thus takes design precedence over JSON.

Noted

>>> This same problem occurred with PortableContacts when you compare the XML
>>> and JSON
>>> representations: http://portablecontacts.net/draft-schema.html#anchor5.  For
>>> what it's worth, PoCo used plural naming where properties were expected to
>>> be multi-valued.
>
> Which was also a mistake.
>
>> Yup, it's hard designing a schema to work nicely in two quite
>> different syntaxes.
>
> Yes it is hard, but not impossible.
>
> We've taken a shot at doing so for HTML and JSON in microformats 2.0 [1].
>
> Comments appreciated (though perhaps better redirected to microformats-new[2]).

I'll take a look. I liked the general direction it was taking last year.

cheers,

Dan

> Thanks,
>
> Tantek
>
> [1] http://microformats.org/wiki/microformats-2
> [2] http://microformats.org/mailman/listinfo/microformats-new/