Are shared data standards and shared repositories the future?

I keep having or hearing similar conversations about shared repositories and shared data standards in places like the SWTT, Antiquist, the Museums Computers Group, the mashed museum group and the HEIRNET Data Sans Frontières. The mashed museum hack day also got me excited about the infinite possibilities for mashups and new content creation that accessible and reliable feeds, web services or APIs into cultural heritage content would enable.

So this post is me thinking aloud about the possible next steps – what might be required; what might be possible; and what might be desired but would be beyond the scope of any of those groups to resolve so must be worked around. I'll probably say something stupid but I'll be interested to see where these conversations go.

I might be missing out lots of the subtleties but seems to me that there are a few basic things we need: shared technical and semantic data standards or the ability to map between institutional standards consistently and reliably; shared data, whether in a central repository or a service/services like federated searches capable of bringing together individual repositories into a virtual shared repository. The implementation details should be hidden from the end user either way – it should Just Work.

My preference is for shared repositories (virtual or real) because the larger the group, the better the chance that it will be able to provide truly permanent and stable URIs; and because we'd gain efficiencies when introducing new partners, as well as enabling smaller museums or archaeological units who don't have the technical skills or resources to participate. One reason I think stable and permanent URIs are so important is that they're a requirement for the semantic web. They also mean that people re-using our data, whether in their bookmarks, in mashup applications built on top of our data or on a Flickr page, have a reliable link back to our content in the institutional context.

As new partners join, existing tools could often be re-used if they have a collections management system or database used by a current partner. Tools like those created for project partners to upload records to the PNDS (People's Network Discovery Service, read more at A Standards Framework For Digital Library Programmes) for Exploring 20th Century London could be adapted so that organisations could upload data extracted from their collections management, digital asset or excavation databases to a central source.

But I also think that each (digital or digitised) object should have a unique 'home' URI. This is partly because I worry about replication issues with multiple copies of the same object used in various places and projects across the internet. We've re-used the same objects in several Museum of London projects and partnerships, but the record for that object might not be updated if the original record is changed (for example, if a date was refined or location changed). Generally this only applies to older projects, but it's still an issue across the sector.

Probably more importantly for the cultural heritage sector as a whole, a central, authoritative repository or shared URL means we can publish records that should come with a certain level of trust and authority by virtue of their inclusion in the repository. It does require playing a 'gate keeper' role but there are already mechanisms for determining what counts as a museum, and there might also be something for archaeological units and other cultural heritage bodies. Unfortunately this would mean that the Framley Museum wouldn't be able to contribute records – maybe we should call the whole thing off.

If a base record is stored in a central repository, it should be easy to link every instance of its use back to the 'home' URI, or to track discoverable instances and link to them from the home URI. If each digital or digitised object has a home URI, any related content (information records, tags, images, multimedia, narrative records, blog posts, comments, microformats, etc) created inside or outside the institution or sector could link back to the home URI, which would mean the latest information and resources about an object are always available, as well as any corrections or updates which weren't replicated across every instance of the object.

Obviously the responses to Michelangelo's David are going to differ from those to a clay pipe, but I think it'd be really interesting to be able to find out how an object was described in different contexts, how it inspired user-generated content or how it was categorised in different environments.

I wonder if you could include the object URL in machine tags on sites like Flickr? [Yes, you could. Or in the description field]

There are obviously lots of questions about how standards would be agreed, where repositories would be hosted, how the scope of each are decided, blah blah blah, and I'm sure all these conversations have happened before, but maybe it's finally time for something to happen.

[Update – Leif has two posts on a very similar topic at HEIR tonic and News from the Ouse.

Also I found this wiki on the business case for web standards – what a great idea!]

[Update – this was written in June 2007, but recent movements for Linked Open Data outside the sector mean it's becoming more technically feasible. Institutionally, on the other hand, nothing seems to have changed in the last year.]

I came across a mention of 'Digital Object Identifiers' in a paper on digital humanities, and discovered DOI.org:

A DOI name – a digital identifier for any object of intellectual property. A DOI name provides a means of persistently identifying a piece of intellectual property on a digital network and associating it with related current data in a structured extensible way.

A DOI name can apply to any form of intellectual property expressed in any digital environment. DOI names have been called "the bar code for intellectual property": like the physical bar code, they are enabling tools for use all through the supply chain to add value and save cost.

A DOI name differs from commonly used internet pointers to material such as the URL because it identifies an object as a first-class entity, not simply the place where the object is located. The DOI name identifies an entity directly, not some attribute of an object (an address is an attribute of a thing, whereas the thing itself is a first class object).

At some stage I have a big post to write about stable, permanent URIs for museum objects, and I'll be re-visiting this site when I start that.