[whatwg] Please review use cases relating to embedding micro-data in text/html from Eduard Pascual on 2009-04-28 (public-whatwg-archive@w3.org from April 2009)

From: Eduard Pascual <herenvardo@gmail.com>
Date: Tue, 28 Apr 2009 23:31:53 +0200
Message-ID: <6ea53250904281431x5d32ecp358083c81b35de76@mail.gmail.com>
On Thu, Apr 23, 2009 at 10:46 PM, Ian Hickson <ian at hixie.ch> wrote:
[...]
> Exposing known data types in a reusable way
>
> ? USE CASE: Exposing calendar events so that users can add those events to
> ? their calendaring systems.
[...]
> ? REQUIREMENTS:
[...]
> ? ? * Should be unlikely to get out of sync with prose on the page.
> ? ? * Machine-readable event data shouldn't be on a separate page than
> ? ? ? human-readable dates.
[...]
> ? ---------------------------------------------------------------------------
>
> ? USE CASE: Exposing contact details so that users can add people to their
> ? address books or social networking sites.
[...]
> ? REQUIREMENTS:
[...]
> ? ? * Data should not need to be duplicated between machine-readable and
> ? ? ? human-readable forms (i.e. the human-readable form should be
> ? ? ? machine-readable).
> ? ? * Machine-readable contact information shouldn't be on a separate page
> ? ? ? than human-readable contact information.
[...]
> ? ---------------------------------------------------------------------------
>
> ? USE CASE: Allow users to maintain bibliographies or otherwise keep track
> ? of sources of quotes or references.
[...]
> ? REQUIREMENTS:
>
> ? ? * Machine-readable bibliographic information shouldn't be on a separate
> ? ? ? page than human-readable bibliographic information.
[...]
> ? ---------------------------------------------------------------------------
>
> ? USE CASE: Help people searching for content to find content covered by
> ? licenses that suit their needs.
[...]
> ? REQUIREMENTS:
[...]
> ? ? * License information should be able to survive from one site to another
> ? ? ? as the data is transfered.
[...]
> ? ? * Machine-readable licensing information shouldn't be on a separate page
> ? ? ? than human-readable licensing information.
[...]
> ==============================================================================
>
> Annotations
>
> ? USE CASE: Annotate structured data that HTML has no semantics for, and
> ? which nobody has annotated before, and may never again, for private use or
> ? use in a small self-contained community.
> [...]
> ? REQUIREMENTS:
> [...]
> ? ? * Machine-readable annotations shouldn't be on a separate page than
> ? ? ? human-readable annotations.
[...]
> ? ? * The syntax for adding this data should encourage the data to remain
> ? ? ? accurate when the page is changed.
> ? ? * The syntax should be resilient to intentional copy-and-paste
> ? ? ? authoring: people copying data into the page from a page that already
> ? ? ? has data should not have to know about any declarations far from the
> ? ? ? data.
> ? ? * The syntax should be resilient to unintentional copy-and-paste
> ? ? ? authoring: people copying markup from the page who do not know about
> ? ? ? these features should not inadvertently mark up their page with
> ? ? ? inapplicable data.
>
> ? ---------------------------------------------------------------------------
[...]
> ? USE CASE: Site owners want a way to provide enhanced search results to the
> ? engines, so that an entry in the search results page is more than just a
> ? bare link and snippet of text, and provides additional resources for users
> ? straight on the search page without them having to click into the page and
> ? discover those resources themselves.
[...]
> ? REQUIREMENTS:
>
> ? ? * Information for the search engine should be on the same page as
> ? ? ? information that would be shown to the user if the user visited the
> ? ? ? page.
>
> ==============================================================================
>
> Cross-site communication
>
> ? USE CASE: Copy-and-paste should work between Web apps and native apps and
> ? between Web apps and other Web apps.

I have noticed (highlighted by the quoted fragments above) quite a bit
of recurrence of some of the requirements, namely:
- Information for the machine / agent / whatever should be on the same
page as information for the (human) user.
- copy-paste resilience
- (on some cases) Data shouldn't be duplicated for humans and for
machines (although this is not always achievable, for example with
dates).

There is a requirement that has been put forward previously [1], which
IMO may interact with these, and didn't show up on Ian's original
mail:
- Meta-data (or any additional markup or data used to allow the
machine to understand the actual information) shouldn't be redundantly
repeated.

Examples:

-> An author puts up a page with contact information for several
people (for example, the people responsible for the website; a list of
entities that are somehow related to the website, like sponsors; or a
list of friends in a restricted-access social website, such as in
Microsoft's "Live Spaces"). Let's say that author puts this info in a
table, with the contact name on the first column, the e-mail address
on the second column, and so on, just because that's the kind of job
tables are for. Of course, the first row in the table would hold the
headers describing what each column means.
   The author *should* be able to tell the machine something like "the
first column (or the first cell on each row) are the names, the second
column (or 2nd cell on each row) are the e-mail addresses, ...",
rather than, for each contact, having to repeat "this is the name",
"this is the e-mail address", and so on.

-> A website lists a series of software projects or products (from
something as huge as SourceForge to something as small as a company's
site listing its own few products), stating the product's title,
author/s (in the case the products have diferent authors) license,
version, and date of the last release. Again, the author of that site
should be able to tell the machine something like "these are the
products, these the authors, these the licenses, ...", rather than
stating "this is the product's name, this is the product's author,
this is the product's license, ..." for each and every product listed.

Rationale:

I hopoe it can be noticed how ignoring this need would raise some
serious issues: first, and foremost, having to repeat the
meta-information for each "entry" is tedious and error-prone: if an
author misses to add a meta-data field to the new entry s/he just
added, the whole purpose of using metadata is ruined, since users
would need to manually retrieve the information anyway (wasn't the
error-prowness the main reason to require keeping the metadata as
close as possible to the actual information?).
Next, redundant data means larger files, which directly translates in
slower page loads for the user and higher bandwith costs for the
publisher. There may be some secondary issues from this (for example,
some search engines tend to "truncate" large files and ignore
everything beyond a certain threshold), but those come from the
needlessly enlargement of the file; so file bloating is the actual
issue to keep in mind and deal with here.

Additional considerations:

- Fullfilling this requirement could make harder to deal with the
copy-paste tasks, but not impossible. Some browsers can preserve the
formatting applied from an external CSS when copying, so preserving
the metadata when it has been defined upon structure would be equally
achievable.
- There *are* cases where repeating the metadata a few times can be
better than having it centralized. I have nothing against any solution
that *allows redundancy*, as long as it *does not enforce redundancy*.
- I want to make clear that there is a difference between having the
human-readable and machine-readable information in the same place
(even reusing the same info for both consumers when doable) and having
the metadata (the data that defines how to interpret the actual data)
there as well. There might even be cases where having the metadata
somewhere else may make sense (for example, in the SourceForge example
above, it would be quite reasonable to have a single file defining how
to retrieve the useful data for each SERP (SEarch Result Page), rather
than defining it on every SERP). Again, I feel that the ideal solution
should allow either practice and force none (after all, from an
author's PoV, more choice means more power, which is always better for
us).

References:

[1] http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-August/016037.html
(You might want to review other messages on that thread as well, but I
think this is the one that better describes the actual issue. Also,
keep in mind that, while my intention with this post is to bring the
problem/need into consideration, that thread evolved into discussing
some solution ideas. I think we should have the list of needs and
use-cases properly defined before we start discussing solutions.)

Regards,
Eduard Pascual
Received on Tuesday, 28 April 2009 14:31:53 UTC