[Metadata] Re: As an aside, a possibly interesting read.... from Ivan Herman on 2014-09-26 (public-digipub-ig@w3.org from September 2014)

From: Ivan Herman <ivan@w3.org>
Date: Fri, 26 Sep 2014 13:39:08 +0200
To: Bill Kasdorf <bkasdorf@apexcovantage.com>
Cc: W3C Digital Publishing IG <public-digipub-ig@w3.org>
Message-Id: <18ACEF9B-CBC7-4C15-9C75-7315AB7D7A14@w3.org>
(switching to the interest group)

I think that it was a major misjudgment on my part when I put the subject as "As an aside, a possibly interesting read..." :-)

Seriously, this was a really interesting set of discussion. The question is: is there some sort of a conclusion that we could/should put into the metadata task force report?

Ivan


On 26 Sep 2014, at 24:06 , Bill Kasdorf <bkasdorf@apexcovantage.com> wrote:

> True, good point.
>  
> From: Laura Dawson [mailto:Laura.Dawson@bowker.com] 
> Sent: Thursday, September 25, 2014 6:05 PM
> To: Bill Kasdorf
> Cc: Graham Bell; David (Standards) Singer; Todd Carpenter (Gmail); Koji Ishii; Ivan Herman; Laura Dawson; Phil Madans; W3C Public Digital Publishing IG Mailing List
> Subject: Re: As an aside, a possibly interesting read....
>  
> Yes, it's worth remembering that these sorts of identifiers are numerical representations of THINGS - book products, names, journals; DOIs function a little differently.
> 
> On Sep 25, 2014, at 5:56 PM, "Bill Kasdorf" <bkasdorf@apexcovantage.com> wrote:
> 
> +1 Great explanation.
>  
> Two things I like to point out wrt the ISBN:
>  
> --It is fundamentally a _product_ identifier . . .
>  
> and thus the following rule of thumb helps test whether the same or a different ISBN is required:
>  
> --If you ordered a product with a given ISBN, are you assured of getting the thing you ordered?
> 
> Thus if you order the paperback, you don't want to get the audiobook or the hardcover. Thus they get separate ISBNs.
> Thus if you order the EPUB, you don't want to get the PDF or the KF8. Thus they get separate ISBNs.
> If you ordered the paperback with the old cover, and get the same paperback but with a new cover, you still got "the thing you ordered."
> Etc.
>  
> That's how _the ISBN_ is designed to work.
>  
> It is not how the DOI is designed to work, or the ISSN, or the ISTC, or the ISNI, or the ORCID, etc. etc.
>  
> Each of those identifiers has specific metadata associated with it to enable it to serve the purpose it was created for.
>  
> The minimum metadata required by the ISBN is designed to ensure that the above—"I got the product I ordered"—works.
>  
> And of course for that to "work" there has to be some system, some registry, some authority (centralized or distributed) that actually maintains, manages, and uses that metadata. I like Graham's concept that the ID is really just a sort of "proxy" for the metadata.
>  
> --Bill K
>  
> From: Graham Bell [mailto:graham@editeur.org] 
> Sent: Thursday, September 25, 2014 5:40 PM
> To: David (Standards) Singer
> Cc: Bill Kasdorf; Laura Dawson; Todd Carpenter (Gmail); Koji Ishii; Ivan Herman; Laura Dawson; Phil Madans; W3C Public Digital Publishing IG Mailing List
> Subject: Re: As an aside, a possibly interesting read....
>  
> Hi David
>  
> Ultimately this 'promise' grows from the governance built into the standard, which builds trust, and from the minimum amount of metadata that must be associated with each ID, which Todd mentioned earlier in the thread...
>  
> The real power is in the associated metadata related to that identifier.
>  
> So if we look at the ISBN, for example, there is a minimum set of metadata elements that is supposed to be collected by the various national ISBN agencies (not by the International ISBN Agency -- there is no central registry). The set of metadata elements defined within the ISBN standard essentially sets out the 'promise', or the scope within which the ID is unique. If any part of the metadata is different, the ID is different. If all elements of metadata are identical, the ID should be identical too.
>  
> So different editions (3rd ed, 4th ed) get different ISBNs because 'edition number' is part of that minimum metadata set. Different bindings (hb, pb) get different ISBNs because the binding is part of that minimum metadata set. Different covers on otherwise identical paperbacks don't always get different ISBNs because the cover image is not part of that minimum metadata set (though for practical stock control purposes, publishers may well assign different ISBNs anyway).
>  
> Now looked at from this perspective, the ID itself is not the important part of the discussion -- it is the metadata that is the key, and an ID is simply a shorthand (or a link, or a hash -- pick your terminology) for one particular set of values for that minimum set of metadata elements.
>  
> Identifier schemes are characterised by their minimum set of metadata elements, and the choices made when defining that set are guided by thepurpose of (or use cases for) the identifier -- the functionality the ID is designed to support. ISBN was designed for the book supply chain (originally the physical supply chain, though it mostly works for digital too), and all items with the same ISBN should be functionally identical for the purposes of the book supply chain (but not necessarily identical for other functions). If there are three unsold copies with the same ISBN sitting on a shelf, it does not matter which particular one of the three you purchase.
>  
> But the ISBN is not the solution to every problem -- it doesn't help much with rights trading (which by and large operates at indecs / ISTC work level, or the FRBR expression level), it doesn't solve your problems if you are in a publisher's reprint department (because reprints use the same ISBN), and it doesn't solve all the issues in libraries (which is why they use accession numbers to identify individual copies of books, for example).
>  
> Graham
>  
>  
> Graham Bell
> EDItEUR
>  
> Tel: +44 20 7503 6418
>  
>  
> EDItEUR Limited is a company limited by guarantee, registered in England no 2994705. Registered Office: United House, North Road, London N7 9DP, UK. Website: http://www.editeur.org
>  
>  
> 
> 
> 
>  
> On 25 Sep 2014, at 18:44, David (Standards) Singer wrote:
> 
> 
> 
> I am wondering whether we have historically focused on the wrong question, notably “is the ID unique?”.  Of the projects I know about, I think too little time was spent on what the ‘promise’ was, and hence ‘unique in what sense?’.
> 
> Looking at a specific example, say I have a scheme to give IDs to physical books. If I re-publish the exact same text but with a different page or font size, so the pagination is different, does that get the same ID or a different one?  Well, it must be different if you expect to be able to refer to text by page and line number — did the promise include that that would be stable?
> 
> This failure mode — the assigner thought that the promise was X, the user Y — has been the death of labeling systems. If you cannot reliably use the label for a purpose, then it may be use-less.
> 
> “Do I have this item in stock?”
> “Can I refer to parts of it stably?”
> and so on...
> 
> On Sep 25, 2014, at 8:04 , Bill Kasdorf <bkasdorf@apexcovantage.com> wrote:
> 
> 
> 
> I also want to point out that what we really need is not just about books.
>  
> Even though there has been frequent discussion on the IG about whether we can _focus_ on books (and the consensus, which I reluctantly went along with, is yes), for something this fundamental we really need to think in terms of a _publication_ or even a _resource_.
>  
> Even in traditionally book-dominated sectors like educational publishing, there is a rapid movement away from the concept of a "book" at all. Professors increasingly are willing to let students use any of a range of "textbooks" as a resource for, say, calculus or microbiology, as long as they are useful and have information that is relevant to the course. Increasingly those "books" themselves are being deconstructed, and more importantly most big educational publishers are moving toward a vision in which they develop resources first and books (or parts of books) are just one of many ways of associating, combining, and distributing those resources. And that is done in the context of _all the other stuff out there_ (mostly but not exclusively on the Web).
>  
> All that stuff has to be able to be identified, cited, annotated, etc. etc.
>  
> I could have written that description just as well in the context of magazines, for which _exactly the same dynamic_ is happening. Right now.
>  
> Same for scholarly/STM publishing (where publishing _data_--and citing datasets--is a very live issue). And even in the humanities, where "Digital Humanities" is becoming mainstream (and which is about "works" in the FRBR sense).
>  
> And think of all the resources needed in corporate publishing, training, etc.
>  
> All of that is "publishing." No publication exists in a closed system. It may think it is in a walled garden but there is a giant jungle outside its walls.
>  
> I really think in the pursuit of this identifier issue we MUST take the broadest possible vision or we will come up with something that is useful in one sector (perhaps) but not truly interoperable in the publishing ecosystem and the web in general (the context in which the publishing ecosystem increasingly lives and works) and will thus ultimately prove inadequate.
>  
> This is not to replace domain-specific or purpose-built identifiers like the DOI, the ISBN, etc.--those that, as Todd and others pointed out, have metadata and systems associated with them to DO THINGS. Any identifier we come up with should not make those obsolete and ideally should not conflict with them at all. It should make them more interoperable and more useful. This is not a Battle of Identifiers, and those who think One and Only One Identifier is the goal are mistaken. Many identifiers are needed because we need to do many different things with them.
>  
> But the identifier we are looking for here--enabling annotation and a myriad other related things on the Web (citation, previews, chunking, etc.)--needs to be radically widely applicable, completely agnostic as to the type of publication or resource it identifies, the format in which that publication or resource is disseminated, and yet durable, persistent, and reliable across formats and across time.
>  
> --Bill Kasdorf
>  
> -----Original Message-----
> From: Laura Dawson [mailto:Laura.Dawson@bowker.com]
> Sent: Thursday, September 25, 2014 9:01 AM
> To: Todd Carpenter (Gmail); Koji Ishii
> Cc: Ivan Herman; David (Standards) Singer; Laura Dawson; Bill Kasdorf; Graham Bell; Phil Madans; W3C Public Digital Publishing IG Mailing List
> Subject: Re: As an aside, a possibly interesting read....
>  
> Todd, I think you're absolutely right about the difference between librarianship and the trade. It has been the function of libraries to archive, curate, and canonize information since their inception. Trade is about one thing and one thing only - sales. In building infrastructure, we need to support both. What both have in common is a need for effective discovery - directing a reader to the book they want. So much of the metadata will be shared in common - that which describes the book; the metadata describing the terms by which a reader may have it will differ depending on.well, the terms - the environment in which the reader is discovering the book.
>  
> That all said, I can envision a world where - for the purposes of curation and archiving - there exists a "canonical" version of a book at a URI that could well consist of the ISBN for that book (as Koji described), but if you want to own the book, you are directed to whichever platforms support it, and you choose which one you want to read on. But that presupposes an authority to govern that system. I would say the ISBN-International Agency could be that authority, but there is one important issue that prevents that - no publisher is required to report back to ISBN-IA which ISBNs get assigned to which books. ISBNs are issued in blocks - and in the case of larger publishers, many never see the light of day. ISBN-IA does not maintain a database of the ISBNs that get assigned - that is down to the registration agencies (such as Bowker, Nielsen, national libraries). And the publishers don't always report back to the RA's which numbers they are assigning to which things.
>  
> Also to be considered - in a world of self-publishing, ISBNs frequently are not assigned at all. Books are available in proprietary systems only (Kindle), and not easily discoverable. Amazon is said to be publishing about 2000 of these per week. We have no idea what they are, if they are books or "shorts", fiction, memoir, cookbooks - only Amazon has that data, and the data is provided by author/publishers who are not necessarily familiar with metadata conventions and effective description.
>  
> So, to be succinct, whether distributed or centralized, we need to break down the specific problems based on audience and the pain we're trying to solve. Probably won't be a single solution.
>  
> On 9/25/14, 2:58 AM, "Todd Carpenter (Gmail)" <tcarpenter@niso.org> wrote:
>  
> There is a tremendous problem with distributed systems when it comes to
> canonical information and standard identifiers.  That being the
> metadata that is associated with that identifier.  An identifier is (or
> better put should be) just a dumb (i.e., without embedded meaning),
> unique set of string of characters. The structure of that string, while
> systematically important is beside the point. Whether an identifier is
> expressed as a 16-digit string, or as an URI or anything else is not finally the point.
>  
> The real power is in the associated metadata related to that identifier.
> While there is tremendous overhead in a centralized system, they are
> critically important in a well-functioning ID system. Without a
> controlling system, then there will be no standard set of associated
> metadata.  Now, how well that metadata is created, managed, curated and
> controlled are open questions (as Laura certainly knows), but without
> some authority driving compliance than inevitably there will be an
> increasing divergence of metadata quality, practice and interoperability.
>  
>  
> Also to Ivan's question about work-level IDs, there is work being done
> by OCLC to develop a true FRBR Work-level identifier based on their
> data store of library's bibliographic data. This ID is derived by
> analysis of the collection once the items are released then catalogued.
> I am not certain that a similar level work ID would be possible in
> trade, outside of being done by the author, agent or rights manager to
> truly combine all of the works (in a FRBR sense) under a single ID.
> Identifying say, the hardcover book of a story, comic book version of
> that same story, the blue-ray DVD of that story, the broadway play of
> that story, and the swedish translation of the book into a single
> Work-level ID is only something that can be done after the fact,
> because their expressions are very, very different. The closest that we
> might come to identifying that pre-production is to ID the rights
> associated with a particular intellectual property. And while it may be
> useful in practice, I don't know it would be useful in application.
> Which, I expect in the end would only serve the purpose of making lots of IP lawyers very wealthy.
>  
> Todd
>  
>  
>  
>  
> On Sep 25, 2014, at 5:07 AM, Koji Ishii <kojiishi@gluesoft.co.jp> wrote:
>  
> Maybe this was already discussed, but I'm in favor of a distributed
> ID system than a single, central system.
>  
> Take DNS. Or Java namespace. Their prefix comes from domain names
> authors own, which is unique, then authors can define whatever the rest.
> If a publisher wants to use ISBN, they could use, for instance,
> <epub://isbn-international.org/123456789>.
>  
> Since what we want is to identify publications, as long as authors or
> publications agree to use consistent domains/postfixes, I guess we can
> guarantee the uniqueness.
>  
> Maybe there are more use cases for the ID more than identifying
> publications? Use cases I have in mind are for links between
> publications and OA, these I think distributed system can do.
>  
> /koji
>  
> On Sep 25, 2014, at 12:51 PM, Ivan Herman <ivan@w3.org> wrote:
>  
>  
> On 24 Sep 2014, at 23:14 , Laura Dawson <Laura.Dawson@bowker.com>
> wrote:
>  
> True. It's a cluttered road.
>  
> We are in a really dangerous business!
>  
> Ivan
>  
>  
> On 9/24/14, 5:12 PM, "David (Standards) Singer" <singer@apple.com>
> wrote:
>  
>  
> On Sep 24, 2014, at 12:16 , LAURA DAWSON <ljndawson@gmail.com> wrote:
>  
> Yes, Bowker were a DOI registration agency and I can tell you
> that the  associated systems and metadata were the primary reason
> DOIs for trade  books (as opposed to STEM/scholarly) never took
> off.
>  
> So you see, Ivan, the road to book URIs is littered with a couple
> of corpses.
>  
> It's not just books.  I was on a project that needed something for
> recordings many years ago, and that road was also strewn with
> corpses.
>  
>  
> On 9/24/14, 3:13 PM, "Bill Kasdorf" <bkasdorf@apexcovantage.com>
> wrote:
>  
> Actually, the DOI _is_ used for this, mainly by scholarly/STM
> publishers,  as well as for chapters of books--typically one DOI
> for the book and a  DOI for each chapter (and sometimes DOIs at
> even lower component  levels,  most often for figures and
> tables). And these are _agnostic_ as to  format, they typically
> mean "the book" and "the chapter" in the  abstract  sense. When
> you click on one of these DOIs you are usually then given  your
> choice of what format, whether you have access, how to obtain
> access, etc.
>  
> But it requires the associated systems, metadata, registration
> agency,  etc. to make it work. To belabor a point, though, in
> that context it  does  work. There are a gazillion of them. The
> whole scholarly/STM ecosystem  is  now dependent on DOIs.
>  
> Those that use the DOI for this use CrossRef DOIs, which
> _should_ be  expressed as URIs (and increasingly are).
>  
> But all that is purely under the control of the publisher
> (including  what  the DOI links to and what that destination
> provides--not necessarily  the  content itself); it doesn't
> address "work" in the way librarians mean  "work," and it
> requires the systems I mentioned (including the Handle  system on
> which DOI is based). It would not work for our need to point  to
> the "work itself" or some component of the work. So the answer in
> a  purely standard web-world sense is still no.
>  
> --Bill K
>  
> -----Original Message-----
> From: Laura Dawson [mailto:Laura.Dawson@bowker.com]
> Sent: Wednesday, September 24, 2014 2:55 PM
> To: Ivan Herman; Graham Bell
> Cc: Laura Dawson; Phil Madans; Bill Kasdorf; W3C Public Digital
> Publishing IG Mailing List
> Subject: Re: As an aside, a possibly interesting read....
>  
> As it stands now, no. So a book's "home" on the web (regardless
> of
> edition) is not standardizable at this point unless you want to
> go down  the DOI road (please let's not go down the DOI road).
>  
> On 9/24/14, 4:13 AM, "Ivan Herman" <ivan@w3.org> wrote:
>  
> Thanks for all the interesting discussion...
>  
> However: all this is to say that there does not seem to be any
> existing  (and viable) option to uniquely identify (preferably
> through a
> URI) a
> 'work' (whether in the ISTC or the FRBR sense). Which is a
> problem for  metadata as well as for archiving. :-( Tell me I am
> wrong, please...
>  
> Ivan
>  
>  
> On 24 Sep 2014, at 24:19 , Graham Bell <graham@editeur.org> wrote:
>  
> And they can be treated this way in ONIX too. As I said,
>  
> they are not (strictly) an attribute of the ISBN, though they
> may be  presented as such in various systems
>  
> G
>  
> NB repeatable because the ISBN is associated directly with
> only one  work, but can be indirectly associated (through that
> work) with  several other works.
>  
>  
> On 23 Sep 2014, at 21:12, LAURA DAWSON wrote:
>  
> Yes, even at Bowker we made them a repeatable attribute on
> the ISBN  record.
>  
> From: "Madans, Phil" <Phil.Madans@hbgusa.com>
> Date: Tuesday, September 23, 2014 at 3:13 PM
> To: Laura Dawson <ljndawson@gmail.com>, Graham Bell
> <graham@editeur.org>, Bill Kasdorf
> <bkasdorf@apexcovantage.com>,  Ivan  Herman <ivan@w3.org>, W3C
> Public Digital Publishing IG Mailing List
> <public-digipub-ig-comment@w3.org>
> Subject: Re: As an aside, a possibly interesting read....
>  
> I stand corrected on the assignment of the ISTC. Bad choice
> of  words.
> I was speaking more on how I would have to manage them
> internally on  the systems level―that's how I think about
> these things―and that  would be as an attribute.  That  all
> depends on how titles systems  are structured, and I'm not
> saying ours is the best way to do  things,  but I think the
> way we do it is how most do it these days. From a  practical
> standpoint, I'm not sure how else I could handle them. IF  I
> publish an English and Spanish edition of a work, and the
> ISTC's are  different, then they would be attributes of the
> ISBNs so that I  could  keep them linked internally. We are
> already doing this, as is most  everyone else, and I think
> that is why the ISTC was such a hard  sell.
>  
> ------------------------------------------------------------
> Phil Madans | Executive Director of Digital Publishing
> Technology |  Hachette Book Group | 237 Park Avenue NY 10017
> |212-364-1415 |  phil.madans@hbgusa.com
>  
> From: LAURA DAWSON <ljndawson@gmail.com>
> Date: Tuesday, September 23, 2014 at 2:22 PM
> To: Graham Bell <graham@editeur.org>, Phil Madans
> <phil.madans@hbgusa.com>, Bill Kasdorf
> <bkasdorf@apexcovantage.com>,
> Ivan Herman <ivan@w3.org>, W3C Public Digital Publishing IG
> Mailing  List <public-digipub-ig-comment@w3.org>
> Subject: Re: As an aside, a possibly interesting read....
>  
> Bowker was an ISTC registration agency until recently. We
> pulled out  because of the lack of support in the US, and
> refer the few curious  to Nielsen.
>  
> From: Graham Bell <graham@editeur.org>
> Date: Tuesday, September 23, 2014 at 2:09 PM
> To: Phil Madans <Phil.Madans@hbgusa.com>, Laura Dawson
> <ljndawson@gmail.com>, Bill Kasdorf
> <bkasdorf@apexcovantage.com>,
> Ivan Herman <ivan@w3.org>, W3C Public Digital Publishing IG
> Mailing  List <public-digipub-ig-comment@w3.org>
> Subject: Re: As an aside, a possibly interesting read....
>  
> What Phil and Laura have written certainly summarises -- and
> illustrates -- the debate over identifiers.
>  
> But the text below (from Phil) is a little misleading.
>  
> Whether an ISTC
> is a real work Identifier or not is a matter of debate. I
> disagree that ii is. It is actually an attribute of the
> ISBN―-hat is how they are assigned.
> Different ISBNs of the same master content might have
> different ISTC's.
> Translations for instance.
>  
> The 'rules' of the ISTC say that translations are by
> definition different works, and MUST have different ISTCs
> (though those ISTCs  will be related to each other -- one is a
> 'derived work', and this  close relationship is recorded in
> the registration metadata for the  ISTCs themselves). This
> contrasts with library practice, where  'work'
> is something at a higher level and two translations are
> actually  termed two 'expressions' of the same 'work'. In
> library terms, the  ISTC is an expression identifier. See the
> attached PDF (a slide from  a training session that I deliver
> fairly regularly) for a summary of  how the <indecs> model on
> which ISTC and ONIX are based compares with  the FRBR library
> model. There is -- as far as I know -- no public  identifier
> that works at the FRBR:work level, though libraries may  have
> internal IDs.
>  
> And I'm pretty sure ISTCs can be assigned without an ISBN
> (and  without any product ID at all, in fact) -- they are not
> (strictly)
> an
> attribute of the ISBN, though they may be presented as such
> in  various  systems.
> They can be registered based on a manuscript, prior to there
> being a  product.
>  
> On the other hand, there's no doubt that ISTC has so far
> proved  unpopular among publishers, for some of the reasons
> Laura and Phil  list, and its actual usage is minimal.
>  
>  
> Graham
>  
>  
>  
>  
>  
> Graham Bell
> EDItEUR
>  
> Tel: +44 20 7503 6418
> Mob: +44 7887 754958
>  
> EDItEUR Limited is a company limited by guarantee, registered
> in England no 2994705. Registered Office: United House, North
> Road, London
> N7 9DP, UK. Website:http://www.editeur.org
>  
>  
>  
>  
>  
> This may contain confidential material. If you are not an
> intended  recipient, please notify the sender, delete
> immediately, and  understand that no disclosure or reliance on
> the information herein  is  permitted.
> Hachette Book Group may monitor email to and from our network.
>  
>  
>  
> ----
> Ivan Herman, W3C
> Digital Publishing Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> GPG: 0x343F1A3D
> WebID: http://www.ivan-herman.net/foaf#me
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
> David Singer
> Manager, Software Standards, Apple Inc.
>  
>  
>  
>  
>  
> ----
> Ivan Herman, W3C
> Digital Publishing Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> GPG: 0x343F1A3D
> WebID: http://www.ivan-herman.net/foaf#me
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
> 
> David Singer
> Manager, Software Standards, Apple Inc.
> 


----
Ivan Herman, W3C 
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
WebID: http://www.ivan-herman.net/foaf#me
Received on Friday, 26 September 2014 11:39:41 UTC