W3C home > Mailing lists > Public > public-xg-lld@w3.org > January 2011

Re: vocabs, metadata set, datasets

From: ZENG, MARCIA <mzeng@kent.edu>
Date: Sat, 8 Jan 2011 12:59:53 -0500
To: Mark van Assem <mark@cs.vu.nl>
CC: "public-xg-lld@w3.org" <public-xg-lld@w3.org>
Message-ID: <C94E1049.12FC8%mzeng@kent.edu>
Mark,
Thanks for the discussion.  See below for a little bit more explanation (marked in blue color).

On 1/8/11 5:22 AM, "Mark van Assem" <mark@cs.vu.nl> wrote:

Hi Marcia,

If I understand correctly, your point is that some resources such as
gazetteers can be dataset, value vocab and metadata schema (because the
gazetteer entries can have attributes themselves, and the values of
these attributes may come from another code list defined in the gazetteer).

>mz: This is 70% correct, just needs to take off 'metadata schema'.  Every vocabulary (such as a thesaurus) has a schema that defines the attributes, some are universal  (the thesauri world all followed ISO2788 for a long time, now ISO25964 and BS 8723 [ref][1]) and some are locally defined.
I use the word 'attributes' because this has been the way used in standards, still based on the Entity-relationship model.
So, correctly your statement should be:
"some resources such as gazetteers can be dataset and/or value vocab."

I would definitely see TGN and MeSH as value vocabs, even though they
specify their own metadata elements, and describe their own entries with
elements and values (making them like a dataset) and may have "sub
vocabularies".
>mz: [continue from above}: all value vocabularies have their own set of attributes. TGN and other Getty vocabularies have their standardized attributes and a number of controlled lists (e.g., those 'flags').  So is MeSH, which has controls over its attributes.  It also has a controlled list for allowable qualifiers.
TGN's content is almost identical to a gazetteer. "Almost" means it has something others do not have, and other have something it did not provide.
Dublin Core and others always define it as a value encoding vocabulary.  Outside of the bibliographic field, Getty's vocabularies are considered as knowledge references.

I've tried to cover this problem through the "Confusions" points. If
they do not succeed in doing this, what would you add/remove in the text
to fix this?

>mz: I am providing the following suggestions for  "Value Vocabularies" Confusions part.
Before:  a value vocabulary often also defines metadata elements. For example, GeoNames defines elements for coordinates, names and postal codes of places. These can be referred to as the GeoNames metadata elements. Similarly, VIAF defines elements to describe authorities (corporations, people).

After:  A value vocabulary often employs a schema that is derived from a model underlying its data structure. Some of the models are universal and have been defined in international and national standards, e.g., for thesauri [ref], while others are implementation-specific or yet to become widely-adopted. For example, GeoNames defines elements for coordinates, names and postal codes of places.  Similarly, VIAF defines elements to describe corporate bodies and people.

Confusion #2
Before: We classify VIAF and GeoNames as value vocabularies instead of datasets because they are used (or are meant to be used) extensively as value vocabularies in record collections, while their metadata elements are not widely reused (as are DC elements). We acknowledge that this distinction is dependent on the role that the dataset/vocabulary plays instead of its inherent characteristics. Our viewpoint is indeed debatable, but sufficient for the purposes of our report

After: We classify VIAF and GeoNames as value vocabularies instead of datasets because they are used (or are meant to be used) extensively as value vocabularies in building other record collections datasets.  This distinction is dependent on the role that the dataset/vocabulary plays instead of its inherent characteristics.

[ref]
ISO 2788Documentation -- Guidelines for the establishment and development of monolingual thesauri. 1974, 1986.
ISO 25964 Part 1 Thesauri and Interoperability with Other Vocabularies. Clause 15. Data model. 2010.
BS 8723: Structured Vocabularies for Information Retrieval. Part 5. Exchange formats and protocols for interoperability. 2008

[1] http://schemas.bs8723.org/Model.aspx

If I still didnt get your point I apologize!
Mark.

Hope this helps. Thanks.
Marcia

Op 7-1-2011 16:16, ZENG, MARCIA schreef:
> Mark,
> Re: your question
>> >Re Marcia's point [["For example, in digital gazetteers not only the
>
>     place names are controlled but also the place features, type,
>     coordinates, and even maps are included."]]
>
>> >I'm not sure I get what you mean with the "also controlled",
>
> I am giving the following text to explain further [ref]:
>
> 1.Concept of a geographic place is fuzzy (e.g., Rocky Mountains) and we
> use place names differently according to the circumstances (e.g., using
> "Santa Barbara" generally to mean the whole general area or specifically
> to mean just the incorporated city area.)
> 2.When locations are named, they can be in a gazetteer. A place can have
> more than one name: name variants, name in different languages, etc.
> 3.In a geospatially referenced gazetteer, each entry have a "footprint"
> consisting of latitude and longitude coordinates. This footprint can be
> a point (most current gazetteer footprints are points)...
> 4.Each entry in a digital gazetteer must also be categorized according
> to a formal typing system (a controlled vocabulary of type terminology).
>
> #2 is what most thesauri would do, to control the synonyms and equivalents.
> #3 is especially the approach used in a thesaurus to eliminate
> ambiguities. But here they are not like a GPS which focuses on
> coordinates and use bounding boxes to provide a precise location. These
> points in a gazetteer are more as a qualifier to provide context of a
> place.
> #4 is to provide a TYPE for each named place. This is similar to the
> Medical Subject Headings where each concept is giving a TYPE code
> according to a formal typing system (see example [1]). In the Getty
> Thesaurus of Geographical Names place types are also an important
> component in each entry. Those TYPE values are usually are from from a
> controlled vocabulary.[2] So they could use other building blocks.
> However the general function and purpose of the digital gazetteer is, as
> a "spatial dictionary of named and typed places".
>
> Quite a lot project have used ADL gazetteers as value vocabularies, but
> the gazetteers is also used as a reference itself, e.g., [3].
> Marcia
>
> [1]
> http://www.nlm.nih.gov/cgi/mesh/2011/MB_cgi?mode=&index=8264&view=expanded
> <http://www.nlm.nih.gov/cgi/mesh/2011/MB_cgi?mode=&index=8264&view=expanded>
>
> [2] http://www.alexandria.ucsb.edu/~lhill/FeatureTypes/ver070302/index.htm
> [3] http://clients.alexandria.ucsb.edu/globetrotter/ (try to find a
> place then see the catalog record.)
> [Ref] JCDL 2002 NKOS Workshop on Digital Gazetteers.
> http://nkos.slis.kent.edu/DL02workshop.htm
>
>
> On 1/7/11 5:05 AM, "Mark van Assem" <mark@cs.vu.nl> wrote:
>
>     Thanks all for the feedback!
>
>     I've tried to address all your points in de value vocab description:
>
>     - "A dataset is a collection of structured metadata records"
>
>     - added some more "similar terms", including KOS, gazetteer, authority
>     file, concept scheme
>
>     - "They are "building blocks" with which metadata records can be built."
>
>     Re Marcia's point [["For example, in digital gazetteers not only the
>     place names are controlled but also the place features, type,
>     coordinates, and even maps are included."]]
>
>     I'm not sure I get what you mean with the "also controlled", but I think
>     indeed that this is the same as the VIAF situation: the values in a
>     value vocabulary can be described with elements and values themselves,
>     which would make them "datasets" also. However, we can still see VIAF as
>     a value vocab and not a dataset, as its main role is to be a building
>     block for metadata records.
>
>     Mark
>
>
>     Op 6-1-2011 18:15, ZENG, MARCIA schreef:
>     >  I like the way Karen used in terms of building block or not... Also
>     >  agree with Jeff's use of SKOS 'concept scheme' to define VIAF.
>     >
>     >  * Regarding 'data sets': To me, the 'data sets' we are talking about
>     >  are structured data. Outside in other places 'data sets' could be
>     >  un-structured or semi-structured data (e.g., data.gov's raw data
>     >  sets).
>     >  * Regarding 'value vocabularies': In the conventional way we have
>     >  used "knowledge organization systems (KOS)" for concept schemes
>     >  (broader than "controlled vocabularies"). Most of the vocabulary
>     >  types are clear such as pick lists, taxonomies, thesauri, subject
>     >  headings. But there is a group of 'metadata-like' KOS such as
>     >  authority files and digital gazetteers. They are/can be
>     >  constructed as thesauri (e.g., The Getty Thesaurus of Geographic
>     >  Names (TGN) and Union List of Artist Names (ULAN)). Or, they can
>     >  be in other structures. It is the contents they include that made
>     >  them also be referred to 'data sets'. For example, in digital
>     >  gazetteers not only the place names are controlled but also the
>     >  place features, type, coordinates, and even maps are included.
>     >  Digital gazetteers can be used alone as data sets or be the value
>     >  vocabularies used in structured data sets. This might be like the
>     >  VIAF situation, depending on how it is constructed or on how it is
>     >  used.
>     >
>     >  My 2 cents.
>     >  Marcia
>     >
>     >  On 1/6/11 11:37 AM, "Karen Coyle" <kcoyle@kcoyle.net> wrote:
>     >
>     >  Quoting Emmanuelle Bermes <emmanuelle.bermes@bnf.fr>:
>     >
>     >
>     >  > As for myself, I do have a few more comments :
>     >  > - I think the emphasis on value vocabs is too important in the
>     current
>     >  > definition of dataset. It's actually creating confusion, in my view.
>     >  > - I'm wondering if we could use the term "instance" (a dataset is a
>     >  > collection of instance descriptions) or is it too implementation
>     >  oriented ?
>     >  >
>     >
>     >
>     >  I'm not sure that the term "instance" will work -- even a value in a
>     >  list could be considered an instance, no?
>     >
>     >  Somehow, the concept for a dataset is that it consists of the
>     >  descriptions of entities that you need for an application or function,
>     >  rather than the building blocks for creating such a description.
>     >  (Which gets back to Mark's statement about "A record for Derrida's
>     >  book in dataset X ...")
>     >
>     >  Essentially, one person's dataset could be another person's building
>     >  block. But I think the key is that a dataset is complete for an
>     >  application, while a value vocabulary needs to be combined with other
>     >  data to be useful.
>     >
>     >  No, I'm not satisfied with that explanation... I'll ruminate on this
>     >  and see if I can find better words.
>     >
>     >  kc
>     >
>     >  > Emmanuelle
>     >  >
>     >  > On Thu, Jan 6, 2011 at 5:13 PM, Mark van Assem <mark@cs.vu.nl>
>     wrote:
>     >  >
>     >  > > Hi Emma,
>     >  > >
>     >  > > I saw you had already followed up on our action to clarify "value
>     >  > > vocabularies".
>     >  > >
>     >  > > I saw that you think we should clarify how value vocabularies
>     >  actually
>     >  > > appear in metadata records (as literals, codes, identifiers).
>     >  While I kinda
>     >  > > feel we should try to stay agnostic to that I kept it in, but
>     >  rewrote it
>     >  > > slightly:
>     >  > >
>     >  > > "In actual metadata records, the values used can be literals,
>     >  codes, or
>     >  > > identifiers (including URIs), as long as these refer to a
>     >  specific concept
>     >  > > in a value vocabulary. "
>     >  > >
>     >  > > I also moved your point re "closed list" up to the initial
>     >  definition; this
>     >  > > is indeed central to what a value vocab is.
>     >  > >
>     >  > > Mark.
>     >  > >
>     >  > >
>     >  > > On 06/01/2011 16:34, Mark van Assem wrote:
>     >  > >
>     >  > >> Hi Jodi,
>     >  > >>
>     >  > >> X and Y would be two collections ("datasets") from two different
>     >  > >> libraries. It could also be two subcollections or within one
>     >  collection,
>     >  > >> but I think making them separate ones will make it more
>     >  illustrative.
>     >  > >>
>     >  > >> Do you have a suggestion on how to clarify or replace X and Y
>     with
>     >  > >> specific existing collections/libraries as examples?
>     >  > >>
>     >  > >> Mark
>     >  > >>
>     >  > >>
>     >  > >> On 06/01/2011 16:21, Jodi Schneider wrote:
>     >  > >>
>     >  > >>> Thanks for this, Mark! I especially like the 'confusions' area
>     >  -- that
>     >  > >>> will make this quite useful.
>     >  > >>>
>     >  > >>> In this, it would be helpful if you'd explain what datasets
>     X and Y
>     >  > >>> might be. Particular collections? Subcollections of a larger
>     whole?
>     >  > >>> "in some cases records in a dataset are themselves used as
>     >  values in
>     >  > >>> other datasets. For example, Derrida wrote a book that
>     comments on
>     >  > >>> Heidegger's book "Sein und Zeit". A record for Derrida's book
>     >  in dataset
>     >  > >>> X can state this by relating it to a record for Heidegger's
>     book in
>     >  > >>> dataset Y. This statement in the Derrida record could consist
>     >  of the
>     >  > >>> Dublin Core Subject with as value a reference to the Heidegger
>     >  record.
>     >  > >>> In this case we would still term X and Y datasets, not a value
>     >  > >>> vocabularies."
>     >  > >>>
>     >  > >>> -Jodi
>     >  > >>>
>     >  > >>> On 6 Jan 2011, at 08:00, Mark van Assem wrote:
>     >  > >>>
>     >  > >>>
>     >  > >>>> Hi all,
>     >  > >>>>
>     >  > >>>> As per my action I have written some text [1] to explain
>     the terms
>     >  > >>>> "dataset, metadata element set, value vocabulary" with
>     >  feedback from
>     >  > >>>> Karen and Antoine to address the things that don't fit very
>     >  nicely.
>     >  > >>>>
>     >  > >>>> Please let me know what you think, after I've had your input
>     >  we'll put
>     >  > >>>> it on the public list to get shot at.
>     >  > >>>>
>     >  > >>>> Mark.
>     >  > >>>>
>     >  > >>>> [1]
>     >  > >>>>
>     >
>     http://www.w3.org/2001/sw/wiki/Library_terminology_informally_explained#Vocabularies.2C_Element_sets.2C_Datasets
>     >  > >>>>
>     >  > >>>>
>     >  > >>>> On 28/12/2010 18:40, Karen Coyle wrote:
>     >  > >>>>
>     >  > >>>>> I have been organizing the vocabularies and technologies
>     on the
>     >  > >>>>> archives
>     >  > >>>>> cluster page [1] and it was a very interesting exercise
>     trying to
>     >  > >>>>> determine what category some of the "things" fit into. This
>     >  could turn
>     >  > >>>>> out to be a starting place for our upcoming discussion of our
>     >  > >>>>> definitions since it has real examples. The hard part seems
>     >  to be value
>     >  > >>>>> vocabularies v. datasets, and I have a feeling that there
>     >  will not be a
>     >  > >>>>> clear line between them.
>     >  > >>>>>
>     >  > >>>>> kc
>     >  > >>>>> [1]
>     >  > >>>>>
>     >  > >>>>>
>     >
>     http://www.w3.org/2005/Incubator/lld/wiki/Cluster_Archives#Vocabularies_and_Technologies
>     >  > >>>>>
>     >  > >>>>>
>     >  > >>>>>
>     >  > >>>>>
>     >  > >>>>
>     >  > >>>
>     >  > >>
>     >  >
>     >  >
>     >  > --
>     >  > =====
>     >  > Emmanuelle Bermès - http://www.bnf.fr
>     >  > Manue - http://www.figoblog.org
>     >  >
>     >
>     >
>     >
>     >  --
>     >  Karen Coyle
>     >  kcoyle@kcoyle.net http://kcoyle.net
>     >  ph: 1-510-540-7596
>     >  m: 1-510-435-8234
>     >  skype: kcoylenet
>     >
>     >
>     >
>
Received on Saturday, 8 January 2011 18:00:40 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Saturday, 8 January 2011 18:00:40 GMT