open data – Open Objects

'An (even briefer) history of open cultural data' at GLAM-Wiki 2013

These are some of my notes for my invited plenary talk at GLAM-Wiki 2013 (Galleries, Libraries, Archives, Museums & Wikimedia, #GLAMWiki), held at the British Library on April 12-13, 2013. I don't think I stuck that closely to them on the day, and in the interests of brevity I've left out the 'timeline' bits (but you can read about some of them in a related MuseumID article, 'Where next for open cultural data in museums?') to focus on the lessons to be learnt from changes so far. There were lots of great talks and discussion at the event, you can view some of the presentations on Wikimedia UK's YouTube channel.

A (now very) brief history of open cultural data

Firstly, thank you for the invitation to speak… This morning I want to highlight some key moments of change in the history of open cultural data – a history not only of licenses and data, but also of conversations, standards, and collaborations, of moments where things changed… I've included key moments from funders, legislative influences and the commercial sector too, as they create the context in which change happens and often have an effect on what's considered possible. I'll close by considering some of the lessons learnt.

[Please help improve this talk]

A caveat – there may well be a bias towards the English-speaking world (and to museums, because of my background). If you know of an open GLAM (gallery, library, archive, museum) data source I've missed, you can add it to the open cultural data/GLAM API wiki… or Lotte's Belice's list of open culture milestones timeline.

Definitions

'open cultural data' is data from cultural institutions that is made available for use in a machine-readable format under an open licence. But each word in open, cultural, data is slightly more complicated so I'll unpack them a little…

Open

Office clerks, FNV. Voorlichting.

While the degree of openness required to be 'open' data can be contentious, at its simplest, 'open' refers to content that is available for use outside the institution that created it, whether for school homework projects, academic monographs or mobile phone apps. 'Open' may refer to licences that clarify the permissions and restrictions placed on data, or to the use of non-proprietary digital technologies, or ideally, to a combination of both open licences and technologies.

Ideally, open data is freely available for use and redistribution by anyone for any purpose, but in reality there are often restrictions. GLAMs may limit commercial use by licensing content for 'non-commercial use only', but as there is no clear definition of 'non-commercial use' in Creative Commons licences, some developers may choose not to risk using a dataset with an unclear licence. GLAMs may also release data for commercial use but still require attribution, either to help retain the provenance of the content, to help people find their way to related content or just because they'd like some credit for their work. GLAMs might also release data under custom licences that deal with their specific circumstances, but they are then difficult to integrate with content from other openly-licensed datasets.

Hybrid licensing models are a pragmatic solution for the current environment. They at least allow some use and may contribute to greater use of open cultural data while other issues are being worked out. For example, some institutions in the UK are making lower resolutions images available for re-use under an open licence while reserving high resolution versions for commercial sales and licensing. Or they may differentiate between scholarly and commercial use, or use more restrictive licences for commercially valuable images and release everything else openly.

I think this type of access is better than nothing, particularly if organisations can learn from the experience and release more data next time. Because these hybrid models are often experimental, their reception is important, and it's helpful for GLAMs to be able to show they've had a positive impact and hopefully helped create relationships with groups like Wikipedia.

Cultural

Cultural data is data about objects, publications (such as books, pamphlets, posters or musical scores), archival material, etc, created and distributed by museums, libraries, archives and other organisations.

Data

It's a useful distinction to discuss early with other cultural heritage staff as it's easy to be talking at cross-purposes: data can refer to different types of content, from metadata or tombstone records (the basic titles, names, dates, places, materials, etc of a catalogue record), to entire collection records (including data such as researched and interpretive descriptions of objects, bibliographic data, related themes and narratives) to full digital surrogates of an object, document or book as images or transcribed text. Some organisations release open metadata, others release all their data including their images. If you can't do open data (full content or 'digital surrogates' like photographs or texts) then at least open up the metadata (data about the content) as e.g. CC0 and the rest with another licence. Releasing data may involve licensing images, offering downloads from catalogue sites; 'content donations', APIs and machine-facing interfaces; term lists, etc. Much of the data that isn't images isn't immediately interesting, and may be designed for inter-collections interoperability or mashups rather than media commons.

Why is open cultural data important?

Before I go on, why do we care? Open cultural data is the foundation on which many projects can be built. It helps achieve organisational goals, mission; can help increase engagement with content; can create 'network effect' with related institutions; can be re-used by people who share your goals around access to knowledge and information – people like Wikipedians.

Some key moments in open cultural data

Events I discussed included the founding of Wikimedia, Europeana and Flickr Commons, previous GLAM-Wiki conferences, changes in licences for art images, library catalogue records and museum content, GLAM APIs and linked data services and the launch of the Digital Public Library of America next week.

Lessons learnt

Many of the changes are the results of years of conversation and collaboration – change is slow but it does happen. GLAMs work through slow iterations – try something, and if no-one dies, they'll try something else. We are all ambassadors, and we are all translators, helping each domain understand the other.

Contradictory things GLAMs are told they must do

Give content away for the benefit of all
Monetise assets; protect against loss of potential income; protect against mis-use of collections; conserve collections in perpetuity; protect the IP of artists; demonstrate ROI on digitisation

It's not easy for GLAMs to release all their data under an entirely open licence, but they don't do it just to be annoying – it's important to understand some of the pressures they're under. For example, GLAMs usually need to be able to track uses of their data and content to show the impact of digitising and publishing content, so they prefer attribution licences.

The issue of potential lost income – imaginary money that could be made one day if circumstances change, or profit that someone else makes off their opened data – is particularly difficult as hard to deal with [and here I ad-libbed, saying that it was like worrying about failing to meet the love of your life because you got on a different tube carriage – you can't live your life chasing ghosts]. Ideally, open data needs to be understood as an input to the creative economy rather than an item on the balance sheet of an individual GLAM.

GLAMs worry about reputational damage, whether appearing on the front page of a tabloid newspaper for the 'wrong' reasons, questions being asked in Parliament, or critique from Wikipedians. Over time, their mindset is changing from keeping 'our data' to being holders, custodians of our shared heritage.

Conversations, communities, collaborations

Conversations matter… we're all working towards the same goal, but we have different types of anxieties and different problems we have to address.

GLAMs are about collections, knowledge, and audiences. Unlike most online work, they are used to seeing the excitement people experience walking through their door – help GLAMs understand what Wikipedians can do for different audiences by making those audience real to them. GLAMs are also used to being wined and dined before you lay the hard word on them. Just because you don't need to ask for permission to use content doesn't mean you shouldn't start a conversation with an organisation. There are lots of people with similar goals inside organisations, so try to find them and work with them. Trust is a currency, don't blow it!

Being truly collaborative sometimes means compromising (or picking your battles) and it definitely means practising empathy. Open data people could stop talking about open data as something you *do* to GLAMs, and GLAMs could stop thinking open data people just want to make your life difficult.

The role of higher powers

Government attitudes to open data make a big difference and they can also change the risks associated with publishing orphan works. Governments can also help GLAMs open up their content by indemnifying them against the chance that someone else will monetise their data – consider it not a failure of the GLAM but a contribution to the creative and digital economy.

Things that are better than a poke in the eye with a sharp stick

Kittens (and puppies)
Cultural data that's available online but isn't (yet) openly licensed
Cultural data online that is licensed for non-commercial use

Yes, the last two aren't ideal, but they are great deal better than nothing.

Into the future…

GLAMs and Wikipedians may move at different paces, and may have different priorities and different ways of viewing the world, but we're all working towards the same goals. Not everything is as open, but a lot more is open than it used to be. I sensed yesterday [the first day of the conference] that there are still some tensions between Wikimedians and GLAMers, moments when we need to take a deep breath and put empathy before a pithy put down, but I loved that Kat Walsh's welcome yesterday described how Wikipedia used to focus on how different from others but now focuses on reaching out to others and figuring out how we're the same.

GLAMs and Wikipedians have already used open cultural data to make the world a better place. Let's celebrate the progress we've made and keep working on that…

GLAM-WIKI 2013 Friday attendees photograph by Mike Peel (www.mikepeel.net).

Congratulations to everyone who helped make it a great event, but particularly to Daria Cybulska and Andrew Gray (@generalising) for making everything work so smoothly, and Liam Wyatt (@wittylama) for the original invitation to speak.

On releasing museum data and the importance of licenses

I've been preparing for the workshop on 'Hacking and mash-ups for beginners' I'm running at the Museum Computer Network conference (MCN2011) this year, which as always means poking around the GLAM APIs, linked and open data services page for some nice datasets to use in exercises. Meanwhile, people have been using NMSI data at Culture Hack North this weekend, and a question from that event made me realised I never blogged here about the collections data released by NMSI (i.e. the UK Science Museum, National Media Museum and National Railway Museum) back in March 2011.

There's more in the post I wrote on the museum developers blog at the time, Collections data published, but in summary:

We’ve released the files [218,822 object records, 40,596 media records and 173 event records] as a lightweight experiment – we’d like to understand whether, and if so, how, people would use our data. We’d also like to explore the benefits for the museum and for programmers using our data – your feedback will inform decisions about future investment in more structured data as well as helping shape our understanding of the requirements of those users. The files are in CSV format – because it’s a really simple format, viewable in a text editor, we hope that it will be usable by most people.

And since someone asked for some background on how I dealt with the organisational issues, the short answer is – I was pragmatic, figured any reasonable data was better than none, and kept it simple. Or, as I wrote at the time in Update on collections data and geocoded NRM data:

A few people have commented on the licence (Creative Commons Attribution-NonCommercial-ShareAlike, CC BY-NC-SA) and on the format (CSV). As tomorrow is my last day, I can’t really speak for the museum but the intention is to learn from how people use the data – the things they make, the barriers they face, etc – and iterate (as resources allow) until we get to an optimal solution (or solutions). So please get in touch if you’ve got requests or think you can help clear up some of the issues these kinds of projects face, because there’s a good chance you’ll help make a difference.

The licence is a pragmatic solution – it’s clarification of existing terms rather than a change to our terms, because this avoided a need for legal advice, policy review, etc, that would have added several months to the process.

And yes, I know CSV is quick and dirty, but it’s effective. The museum sector is still working out how to match the resources available with the needs of mash-up type developers who work best with JSON and those who are aiming for linked open data; my hope is that your feedback on this will help museums figure out how to support people using open data in various forms. A simple solution like this also means it’s easy for the museum to re-run the export to update the data as time goes on, and that anyone, geek or not, can open the files without being startled by angle brackets and acronyms. Also, did I mention it was quick?

In some ways, 2011 has been the year I really understood how much of a barrier a 'non-commercial' license is to re-use ('Wired releases images via Creative Commons, but reopens a debate on what “noncommercial” means' is quite a useful article for understanding the confusion though the LOD-LAM Summit was really where it came together for me). Even I've struggled with questions like 'does a non-commercial license mean I can or can't upload the data to Google Fusion Tables to clean it?', let alone 'can a widget made with non-commercial data be displayed on an ad-supported blog site?'.

Most people who want to play with heritage data want to do the right thing, so an ambiguous 'non-commercial' license effectively prevents them using it (people who want to do bad things with it would probably just scrape the data anyway). I get the sense that museums (and other GLAM orgs) are strongly loss averse, so a full 'commercial use ok' statement might be a bit much, but maybe we can do more to define exactly what's reasonable 'commercial' use and what's not? The Wired article provides some useful starting questions, as does Europeana's discussion of their Data Exchange Agreement. Maybe 2012 will be the year we start to provide answers…

Update, January 2013: I've been writing a piece on open cultural data in museums so have been coming across more material on confusion about 'non-commercial'. The Danger of Using Creative Commons Flickr Photos in Presentations discusses one case where the owner of a photograph was confused about whether it was being used commercially or not. While that may turn out to be a case of mistaken identity, one commenter, Michael, says:

'Commercial and non-commercial are very difficult to determine. As such, I make a point of never using photos that have a non-commercial license. Too much hassle. (I also now do not use photos with a share-alike provision. Same reason, too much hassle.)'

A post on the Creative Commons blog, Library catalog metadata: Open licensing or public domain? discusses the case for and against requesting vs requiring attribution.

Rockets, Lockets and Sprockets – towards audience models about collections?

This is something I wrote for my MSc dissertation ('Playing with difficult objects: game designs for crowdsourcing museum metadata', view the games I built for it at http://museumgam.es/ or check out the paper (Playing with Difficult Objects – Game Designs to Improve Museum Collections) I wrote for Museums and the Web 2011) about the role of 'distinctiveness' in mental models about collections, that's potentially relevant to discussions around telling stories with and collecting metadata about museum collections. I'm posting it here for reference in the conversation about instances vs classes of objects that arose on the UKMCG list after the release of NMSI (Science Museum, National Media Museum, National Railway Museum) data as CSV. One reason I've been thinking about 'distinctiveness' is because I'm wondering how we help people find the interesting records – the iconic objects, the intriguing stories – in a collection of 240,000 objects.

I'm interested in audiences' mental models about when a record refers to the type of object vs the individual object – my sense is that 'rockets', in the model below, are generally thought of as the individual object, and that 'sprockets' are thought of as the type of object, but that it varies for 'lockets', depending how distinctive they are in relation to the person.

I'm also generally curious about the utility of the model, and would love to know of references that might relate to it (whether supporting or otherwise) – if you can think of any, let me know in the comments.

Not all objects are created equal

Both museum objects and the records about them vary in quality. Just as the physical characteristics of one object – its condition, rarity, etc – differ from another, the strength of its associations with important people, events or concepts will also vary. To complicate things further, as the Collections Council of Australia (2009) states, this 'significance' is 'relative, contingent and dynamic'.

When faced with hundreds of thousands of objects, a museum will digitise and describe objects prioritised by 'technical criteria (physical condition of the original material), content criteria (representativeness, uniqueness), and use criteria (demand)' (Karvonen, 2010). In theory, all objects are registered by the collecting institution, so a basic record exists for each. Hopefully, each has been catalogued and the information transcribed or digitised to some extent, but this is often not the case. Records are often missing descriptions, and most lack the contextual histories that would help the general visitor understand its significance. Some objects may only have an accession number and a one word label, while those on display in a museum generally have well-researched metadata, detailed descriptions and related narratives or contextualised histories. Variable image quality (or lack of images) is an issue in collections in general. This project excludes object records without images but does include many poor-quality images as a result of importing records from a bulk catalogue.

This project posits that objects can be placed on a scale of 'distinctiveness' based on their visual attributes and the amount and quality of information about them. Within this project, bulk collections with minimal metadata and distinctiveness have been labelled 'sprockets', the smaller set of catalogued objects with some distinctiveness have been labelled 'lockets', and the unique, iconic objects with a full contextual history have been labelled 'rockets'. This concept also references the English Heritage 'building grades' model (DCMS, 2010). During the project, the labels 'heroic', 'semi-heroic' and 'bulk' objects were also used.

These labels are not concerned with actual 'significance' or other valuation or priority placed on the object, but relate only to the potential mental models around them and data related to them – the potential for players to discover something interesting about them as objects, or whether they can just tag them on visual characteristics.

In theory there is a correlation between the significance of an object and the amount of information available about it; there may be particular opportunities for games where this is not the case.

Project label

Information type

Amount of information

Proportion of collection

Rockets Subjective Contextual history ('background, events, processes and influences') Tiny minority

Lockets Mostly objective, may be contextual to collection purpose Catalogued (some description) Minority

Sprockets Objective Registered (minimal) Majority

Table 1 Objects grouped by distinctiveness

This can also be represented visually as a pyramid model:

Figure 2 A figurative illustration of the relative numbers of different levels of objects in a typical history museum.

References
Department of Media, Culture and Sport (DCMS) (2010) Principles of Selection for Listing Buildings [Online] Available from: http://www.english-heritage.org.uk/content/imported-docs/p-t/principles-of-selection-for-listing-buildings-2010.pdf

Karvonen, M. (2010). "Digitising Museum Materials – Towards Visibility and Impact". In Pettersson, S., Hagedorn-Saupe, M., Jyrkkiö, T., Weij, A. (Eds) Encouraging Collections Mobility In Europe. Collections Mobility. [Online] Available from: http://www.lending-for-europe.eu/index.php?id=167

Russell, R., and Winkworth, K. (2009). Significance 2.0: a guide to assessing the significance of collections. Collections Council of Australia. [Online] Available from: http://significance.collectionscouncil.com.au/

Notes from Culture Hack Day (#chd11)

Culture Hack Day (#chd11) was organised by the Royal Opera House (the team being @rachelcoldicutt, @katybeale, @beyongolia, @mildlydiverting, @dracos – and congratulations to them all on an excellent event). As well as a hack event running over two days, they had a session of five minute 'lightning talks' on Saturday, with generous time for discussion between sessions. This worked quite well for providing an entry point to the event for the non-technical, and some interesting discussion resulted from it. My notes are particularly rough this time as I have one arm in a sling and typing my hand-written notes is slow.

Lightning Talks
Tom Uglow @tomux “What if the Web is a Fad?”
'We're good at managing data but not yet good at turning it into things that are more than points of data.' The future is about physical world, making things real and touchable.

Clare Reddington, @clarered, “What if We Forget about Screens and Make Real Things?”
Some ace examples of real things: Dream Director; Nuage Vert (Helsinki power station projected power consumption of city onto smoke from station – changed people's behaviour through ambient augmentation of the city); Tweeture (a conch, 'permission object' designed to get people looking up from their screens, start conversations); National Vending Machine from Dutch museum.

Leila Johnston, @finalbullet talked about why the world is already fun, and looking at the world with fresh eyes. Chromaroma made Oyster cards into toys, playing with our digital footprint.

Discussion kicked off by Simon Jenkins about helping people get it (benefits of open data etc) – CR – it's about organisational change, fears about transparency, directors don't come to events like this. Understand what's meant by value – cultural and social as well as economic. Don't forget audiences, it has to be meaningful for the people we're making it (cultural products) for'.

Comment from @fidotheCultural heritage orgs have been screwed over by software companies. There's a disconnect between beautiful hacks around the edges and things that make people's lives easier. [Yes! People who work in cultural heritage orgs often have to deal with clunky tools, difficult or vendor-dependent data export proccesses, agencies that over-promise and under-deliver. In my experience, cultural orgs don't usually have internal skills for scoping and procuring software or selecting agencies so of course they get screwed over.]

TU: desire to be tangible is becoming more prevalent, data to enhance human experience, the relationship between culture and the way we live our lives.

CR: don't spend the rest of the afternoon reinforcing silos, shouldn't be a dichotomy between cultural heritage people and technologists. [Quick plug for http://museum30.ning.com/, http://groups.google.com/group/antiquist, http://museum-api.pbwiki.com/ and http://museumscomputergroup.org.uk/email-list/ as places where people interested in intersection between cultural heritage and technology can mingle – please let me know of any others!] Mutual respect is required.

Tom Armitage, @infovore “Sod big data and mashups: why not hack on making art?”
Making culture is more important than using it. 3 trends: 1) collection – tools to slice and dice across time or themes; 2) magic materials 3) mechanical art, displays the shape of the original content; 3a) satire – @kanyejordan 'a joke so good a machine could make it'.

Tom Dunbar, @willyouhelp – story-telling possibilites of metadata embedded in media e.g. video [check out Waisda? for game designed to get metdata added to audio-visual archives]. Metadata could be actors, characters, props, action…

Discussion [?]:remixing in itself isn't always interesting. Skillful appropriation across formats… Universe of editors, filterers, not only creators. 'in editing you end up making new things'.

Matthew Somerville, @dracos, Theatricalia, “What if You Never Needed to Miss a Show?”
'Quite selfish', makes things he needs. Wants not to miss theatre productions with people he likes in/working on them. Theatricalia also collects stories about productions. [But in discussion it came up that the National Theatre asked him to remove data – why?! A recommendation system would definitely get me seeing more theatre, and I say that as a fairly regular but uninformed theatre-goer who relies on word-of-mouth to decide where to spend ticket money.]

Nick Harkaway, @Harkaway on IP and privacy
IP as way of ringfencing intangible ideas, requiing consent to use. Privacy is the same. Not exciting, kind of annoying but need to find ways to make it work more smoothly while still proving protection. 'Buying is voting', if you buy from Tesco, you are endorsing their policies. 'Code for the change you want to see in the world', build the tools you want cultural orgs to have so they can do better. [Update: Nick has posted his own notes at Notes from Culture Hack Day. I really liked the way he brought ethical considerations to hack enthusiasm for pushing the boundaries of what's possible – the ability to say 'no' is important even if a pain for others.]

Chris Thorpe, @jaggeree. ArtFinder, “What if you could see through the walls of every museum and something could tell you if you’d like it?”

Culture for people who don't know much about culture. Cultural buildings obscure the content inside, stop people being surprised by what's available. It's hard if you don't know where to start. Go for user-centric information. Government Art Collection Explorer – ace! Wants an angel for art galleries to whisper information about the art in his ear. Wants people to look at the art, not the screen of their device [museums also have this concern]. SAP – situated audio platform. Wants a 'flight data recorder' for trips around cultural places.

Discussion around causes of fear and resistance to open data – what do cultural orgs fear and how can they learn more and relax? Fear of loss of provenance – response was that for developers displaying provenance alongside the data gives it credibility; counter-response was that organisations don't realise that's possible. [My view is that the easiest way to get this to change is to change the metrics by which cultural heritage organisations are judged, and resolve the tension between demands to commercialise content to supplement government grants and demands for open access to that same data. Many museums have developed hybrid 'free tombstone, low-res, paid-for high-res' models to deal with this, but it's taken years of negotiation in each institution.] I also ranted about some of these issues at OpenTech 2010, notes at 'Museums meet the 21st century'.

Other discussion and notes from twitter – re soap/drama characters tweeting – I managed to out myself as a Neighbours watcher but it was worth it to share that Neighbours characters tweet and use Facebook. Facebook relationship status updates and events have been included as plot points, and references are made to twitter but not to the accounts of the characters active on the service. I wonder if it's script writers or marketing people who write the characters tweets? They also tweet in sync with the Australian showings, which raises issues around spoilers and international viewers.

Someone said 'people don't want to interact with cultural institutions online. They want to interact with their content' but I think that's really dependent on the definition of content – as pointed out, points of data have limited utility without further context. There's a catch-22 between cultural orgs not yet making really engaging data and audiences not yet demanding it, hopefully hack days like CHD11 help bridge the gap and turn data into stories and other meaningful content. We're coming up against the limits of what can be dome programmatically, especially given variation in quality and extent of cultural heritage data (and most of it is data rather than content).

[Update: after writing this I found a post The lightning talks at Culture Hack Day about the day, which happily picks up on lots of bits I missed. Oh, and another, by Roo Reynolds.]

After the lightning talks I popped over the road to check out the hacking and ended up getting sucked in (the lure of free pizza had a powerful effect!). I worked on a WordPress plugin with Ian Ibbotson @ianibbo that lets you search for a term on the Culture Grid repository and imports the resulting objects into my museum metadata games so that you can play with objects based on your favourite topic. I've put the code on github [https://github.com/mialondon/mmg-import] and will move it from my staging server to live over the next few days so people can play with the objects. It's such a pain only having one hand, and I'm very grateful to Ian for the chance to work together and actually get some code written. This work means that any organisation that's contributed records to the Culture Grid can start to get back tags or facts to enhance their collections, based on data generated by people playing the games. The current 300-ish objects have about 4400 tags and 30 facts, so that's not bad for a freebie. OTOH, I don't know of many museums with the ability to display content created by others on their collections pages or store it in their collections management systems – something for another hack day?

Something I think I'll play around with a bit more is the idea of giving cultural heritage data a quality rating as it's ingested. We discussed whether the ratings would be local to an app (as they could be based on the particular requirements of that application) or generalised and recorded in the CultureGrid service. You could record the provence of a rating which might be an approach that combines the benefits of both approaches. At the moment, my requirements for a 'high quality' record would be: title (e.g. 'The Ashes trophy', if the object has one), name or type of object (e.g. cup), date, place, decent sized image, description.

Finally, if you're interested in hacking around cultural heritage data, there's also historyhackday next weekend. I'm hoping to pop in (dependent on fracture and MSc dissertation), not least because in March I'm starting a PhD in digital humanities, looking at participatory digitisation of geo-located historical material (i.e. getting people to share the transcriptions and other snippets of ad hoc digitisation they do as part of their research) and it's all hugely relevant.

What would Phar Lap do? AKA, what happens when Facebook and museum URIs meet a dead horse?

Phar Lap was a famous race horse. After he died (in film-worthy suspicious circumstances), bits of Phar Lap ended up in three different museums – his skin is at Melbourne Museum, his skeleton is at Te Papa in Wellington, NZ, and his heart is in Canberra at the National Museum of Australia.

I've always been fascinated by the way the public respond to Phar Lap – when I worked at Museum Victoria, the outreach team would regularly get emails written to Phar Lap by people who had seen the film or somehow come across his story. (I was also never quite sure why they thought emailing a dead horse would work). So when I first heard that Phar Lap was on Facebook, I was curious to see which museum would have 'claimed' Phar Lap. Does possession of the most charismatic object (the hide) make it easier for Melbourne Museum to step up as the presence of Phar Lap on social media, or were they just the first to be in that space? The issues around 'ownership' and right to speak for an iconic object like Phar Lap make a brilliant case study for how museums represent their collections online.

And today, when I came across three posts (Responses to "Progress on Museum URIs", Progress on Museum URIs by @sebastianheath, Identifing Objects in Museum Collections by @ekansa) on movements towards stable museum URIs that problematised the "politics of naming and identifying cultural heritage" and the concept of the "exclusive right of museums to identify their objects", I thought of Phar Lap. (Which is nice, cos 80 years and one day ago he won the Melbourne Cup).

Of the three museums that own bits of the dead horse, which gets to publish the canonical digital record about Phar Lap? I hope the question sounds silly enough to highlight the challenges and opportunities in translating physical models to the digital realm. Of course each museum can publish a record (specifically, mint a URI) about Phar Lap (and I hope they do) but none of the museums could prevent the others from publishing (and hopefully they wouldn't want to).

Or as the various blog posts said, "many agents can assert an identity for an object, with those identities together forming a distributed and diverse commentary on the human past", and museums need to play their part: "a common identifier promoted by and discoverable at the holding institution will ease the process of recognizing that two or more identifiers refer to the 'same thing'".

Of course it's not that simple, and if you're interested in the questions the museum sector (by which I hopefully don't only mean me) is grappling with, the museums and the machine-processable web page on Permanent IDs has links to discussions on the MCG list, and I've wrestled a bit with how URIs might look at the Science Museum/NMSI (and I need to go back and review the comments left by various generous people). I'd love to know what other museums are planning, and what consumers of the data might need, so that we can come up with a robust common model for museum URIs.

And to reward you for getting this far, here is a picture of Phar Lap on Facebook as his skin and bones are about to be re-united:

UK Culture Grid wants to know what developers need – get in!

Neil Smith from Knowledge Integration dropped by the Museums and the machine-processable web wiki to ask what users (developers) need to get data in and out of the Culture Grid:

To support the ambitious targets for increasing the number of item records in Culture Grid, we thought know would be a good time to review the venerable old application profile we use for importing metadata into the Grid. I've added a discussion page reviewing options at http://museum-api.pbworks.com/w/page/Culture-Grid-Profile.

We really want the community to be involved in helping ensure that whatever profile (or profiles) we support will meet the needs of users – not only for getting things into the grid but also for getting things out in a format that is useful to them. Although the paper focusses mainly on XML representations of metadata, we're also interested in your views on whether non-XML representations (e.g RDF or JSON) need to be supported.

So whether you work in a museum or are an external developer who'd like to use museum data, I'd encourage you to think about the four options Neil outlines, and to comment, ask questions, share sample data, vote for your favourite option, whatever, on the Culture Grid Profile page. One of the options is to develop a new model – definitely more time-consuming, but a great opportunity to make your needs known.

As an indication of the type of content that's available through the Culture Grid, I've copied this text from some of their about pages: "It contains over 1 million records from over 50 UK collections, covering a huge range of topics and periods. Records mostly refer to images but also text, audio and video resources and are mostly about museum objects with library, archive and other kinds of collections also included." So, that's:

"information about items in collections (referencing the images, video, audio or other material you offer online about the things in your collections)
information about collections as a whole (their scope, significance and access details)
information about collecting organisations (contact and access details)"

There's a lot of cultural heritage and tech jargon involved on the Culture Grid Profile discussion page – don't hold back on asking for clarifications where needed. I'm certainly not an expert on the various schemas and it's a very long time since I helped work out the Exploring 20th Century London extensions for the original PNDS, but I've given it a go.

If you've read this far, you might also be interested in the first ever Culture Grid Hack Day in Newcastle Upon Tyne on December 3, 2010.

Notes on 'User Generated Content' session, Open Culture Conference 2010

My notes from the 'user generated content' parallel track on first day of the Open Culture 2010 conference. The session started with brief presentations by panellists, then group discussions at various tables on questions suggested by the organisers. These notes are quite rough, and of course any mistakes are mine. I haven't had a chance to look for the speakers' slides yet so inevitably some bits are missing, and I can only report the discussion at the table I was at in the break-out session. I've also blogged my notes from the plenary session of the Open Culture 2010 conference.

User-generated content session, Open Culture, Europeana – the benefits and challenges of UGC.
Kevin Sumption, User-generated content, a MUST DO for cultural institutions
His background – originally a curator of computer sciences. One of first projects he worked on at Powerhouse was D*Hub which presented design collections from V&A, Brooklyn Museum and Powerhouse Museum – it was for curators but also for general public with an interest in design. Been the source of innovation. Editorial crowd-sourcing approach and social tagging, about 8 years ago.

Two years ago he moved to National Maritime Museum, Royal Observatory, Greenwich. One of the first things they did was get involved with Flickr Commons – get historic photographs into public domain, get people involved in tagging. c1000 records in there. General public have been able to identify some images as Adam Villiers images – specialists help provide attribution for the photographer. Only for tens of records of the 000s but was a good introduction to power of UGC.

Building hybrid exhibition experiences – astronomy photographer of the year – competition on Flickr with real world exhibition for the winners of the competition. 'Blog' with 2000 amateur astronomers, 50 posts a day. Through power of Flickr has become a significant competition and brand in two years.

Joined citizen science consortia. Galaxy Zoo. Brainchild of Oxford – getting public engaged with real science online. Solar Stormwatch c 3000 people analysing and using the data. Many people who get involved gave up science in high school… but people are getting re-engaged with science *and* making meaningful contributions.

Old Weather – helping solve real-world problems with crowdsourcing. Launched two months ago.
Passion for UGC is based around where projects can join very carefully considered consortia, bringing historical datasets with real scientific problems. Can bring large interested public to the project. Many of the public are reconnecting with historical subject matter or sciences.

Judith Bensa-Moortgat, Nationaal Archief, Netherlands, Images for the Future project
Photo collection of more than 1 million photos. Images for the future project aims to save audio-visual heritage through digitisation and conservation of 1.2 million photos.

Once digitised, they optimise by adding metadata and context. Have own documentalists who can add metadata, but it would take years to go through it all. So decided to try using online community to help enrich photo collections. Using existing platforms like Wikipedia, Flickr, Open Street map, they aim to retrieve contextual info generated by the communities. They donated political portraits to Wikimedia Commons and within three weeks more than half had been linked to relevant articles.

Their experiences with Flickr Commons – they joined in 2008. Main goal was to see if community would enrich their photos with comments and tags. In two weeks, they had 400,000 page views for 400 photos, including peaks when on Dutch TV news. In six months, they had 800 photos with over 1 million views. In Oct 2010, they are averaging 100,000 page views a month; 3 million overall.

But what about comments etc? Divided them into categories of comments [with percentage of overall contributions]:

factual info about location, period, people 5%;
link to other sources eg Wikipedia 5%;
personal stories/memories (e.g. someone in image was recognised);
moral discussions;
aesthetical discussions;
translations.

The first two are most important for them.
13,000 tags in many languages (unique tags or total?).
10% of the contributed UGC was useful for contextualisation; tags ensure accessibility [discoverability?] on the web; increased (international) visibility. [Obviously the figures will vary for different projects, depending on what the original intent of the project was]

The issues she'd like to discuss are – copyright, moderation, platforms, community.

Mette Bom, 1001 Stories about Denmark
Story of the day is one of the 1001 stories. It's a website about the history and culture of Denmark. The stories have themes, are connected to a timeline. Started with 50 themes, 180 expert writers writing the 1001 stories, now it's up to the public to comment and write their own stories. Broad definition of what heritage is – from oldest settlement to the 'porn street' – they wanted to expand the definition of heritage.

Target audiences – tourists going to those places; local dedicated experts who have knowledge to contribute. Wanted to take Danish heritage out of museums.

They've created the main website, mobile apps, widget for other sites, web service. Launched in May 2010. 20,000 monthly users. 147 new places added, 1500 pictures added.

Main challenges – how to keep users coming back? 85% new, 15% repeat visitors (ok as aimed at tourists but would like more comments). How to keep press interested and get media coverage? Had a good buzz at the start cos of the celebrities. How to define participation? Is it enough to just be a visitor?

Johan Oomen, Netherlands Institute for Sound and Vision, Vrij Uni Amsterdam. Participatory Heritage: the case of the Waisda? video labelling game.
They're using game mechanisms to get people to help them catalogue content. [sounds familiar!]
'In the end, the crowd still rules'.
. Tagging is a good way to facilitate time-based annotation [i.e. tag what's on the screen at different times]

Goal of game is consensus between players. Best example in heritage is steve.museum; much of the thinking about using tagging as a game came from Games with a Purpose (gwap.com). Basic rule – players score points when their tag exactly matches the tag entered by another within 10 seconds. Other scoring mechanisms. Lots of channels with images continuously playing.

Linking it to twitter – shout out to friends to come join them playing. Generating traffic – one of the main challenges. Altruistic message 'help the archive' 'improve access to collections' came out of research with users on messages that worked. Worked with existing communities.

Results, first six months – 44,362 pageviews. 340,000 tags to 604 items, 42,068 unique tags.
Matches – 42% of tags entered more than 2 times. Also looked at vocab (GTAA, Cornetto), 1/3 words were valid Dutch words, but only a few part of thesauruses. Tags evaluated by documentalists. Documentary film 85% – tags were useful; for reality series (with less semantic density) tags less useful.

Now looking at how to present tags on the catalogue Powerhouse Museum style. Experimenting with visualising terms, tag clouds when terms represented, also makes it easy to navigate within the video – would have been difficult to do with professional metadata. Looking at 'tag gardening' – invite people to go back to their tags and click to confirm – e.g. show images with particular tags, get more points for doing it.

Future work – tag matching – synonyms and more specific terms – will get more points for more specific terms.

Panel overview by Costis Dallas, research fellow at Athena, assistant professor at Panteion University, Athens.
He wants to add a different dimension – user-generated content as it becomes an object for memory organisations. New body of resources emerging through these communication practices.
Also, we don't have a historiography anymore; memory resides in personal information devices. Mashups, changes in information forms, complex composed information on social networks – these raise new problems for collecting – structural, legal, preservation in context, layered composition. What do we need to do now in order to be able to make use of digital technologies in appropriate, meaningful ways in the future? New kinds of content, participatory curation are challenges for preservation.

Group discussion (breakout tables)
Discussion about how to attract users. [It wasn't defined whether it was how to attract specifically users who'll contribute content or just generally grow the audience and therefore grow the number of content creators within the usual proportions of levels of participation e.g. Nielsen, Forrester; I would also have liked to discussed how to encourage particular kinds of contributions, or to build architectures of participation that provided positive feedback to encourage deeper levels of participation.]

Discussion and conclusions included – go with the strengths of your collections e.g. if one particular audience or content-attracting theme emerges, go with it. Norway has a national portal where people can add content. They held lots of workshops for possible content creators; made contact with specialist organisations [from which you can take the lesson that UGC doesn't happen in a vacuum, and that it helps to invest time and resources into enabling participants and soliciting content]. Recording living history. Physical presence in gallery, at events, is important. Go where audiences already are; use existing platforms.

Discussion about moderation included – once you have comments, how are they integrated back into collections and digital asset management systems? What do you do about incorrect UGC displayed on a page? Not an issue if you separate UGC from museum/authoritative content in the interface design. In the discussion it turned out that Europeana doesn't have a definition of 'moderation'. IMO, it should include community management, including acknowledging and thanking people for contributions (or rather, moderation is a subset of community management). It also includes approving or reviewing and publishing content, dealing with corrections suggested by contributors, dealing with incorrect or offensive UGC, adding improved metadata back to collections repositories.

User-generated content and trust – British Library apparently has 'trusted communities' on their audio content – academic communities (by domain name?) and 'everyone else'. Let other people report content to help weed out bad content.

Then we got onto a really interesting discussion of which country or culture's version of 'offensive' would be used in moderating content. Having worked in the UK and the Netherlands, I know that what's considered a really rude swear word and what's common vocabulary is quite different in each country… but would there be any content left if you considered the lowest common standards for each country? [Though thinking about it later, people manage to watch films and TV and popular music from other countries so I guess they can deal with different standards when it's in context.] To take an extreme content example, a Nazi uniform as memorabilia is illegal in Germany (IIRC) but in the UK it's a fancy dress outfit for a member of the royal family.

Panel reporting back from various table discussions
Kevin's report – discussion varied but similar themes across the two tables. One – focus on the call to action, why should people participate, what's the motivation? How to encourage people to participate? Competitions suggested as one solution, media interest (especially sustained). Notion of core group who'll energise others. Small groups of highly motivated individuals and groups who can act as catalysts [how to recruit, reward, retain]. Use social media to help launch project.

1001 Danish Stories promotional video effectively showed how easy the process of contributing content was, and that it doesn't have to to be perfect (the video includes celebrities working the camera [and also being a bit daggy, which I later realised was quite powerful – they weren't cool and aloof]).
Giving users something back – it's not a one-way process. Recognition is important. Immediacy too – if participating in a project, people want to see their contributions acknowledged quickly. Long approval processes lose people.
Removal of content – when different social, political backgrounds with different notions of censorship.

Mette's report – how to get users to contribute – answers mostly to take away the boundaries, give the users more credit than we otherwise tend to. We always think users will mess things up and experts will be embarrassed by user content but not the case. In 1001 they had experts correcting other experts. Trust users more, involve experts, ask users what they want. Show you appreciate users, have a dialouge, create community. Make it a part of life and environment of users. Find out who your users are.

Second group – how Europeana can use the content provided in all its forms. Could build web services to present content from different places, linking between different applications.
How to set up goals for user activity – didn't get a lot of answers but one possibility is to start and see how users contribute as you go along. [I also think you shouldn't be experimenting with UGC without some goal in mind – how else will you know if your experiment succeeded? It also focusses your interaction and interface design and gives the user some parameters (much more useful than an intimidating blank page)].

Judith's report (including our table) – motivation and moderation in relation to Europeana – challenging as Europeana are not the owners of the material; also dealing with multilingual collections. Culturally-specific offensive comments. Definition and expectations of Europeana moderation. Resources need if Europeana does the moderation.
Incentives for moderation – improving data, idealism, helping with translations – people like to help translate.

Johan's report – rewards are important – place users in social charts or give them a feeling of contributing to larger thing; tap into existing community; translate physical world into digital analogue.
Institutional policy – need a clear strategy for e.g. how to integrate the knowledge into the catalogue. Provide training for staff on working with users and online tools. There's value in employing community managers to give people feedback when they leave content.
Using Amazon's Mechanical Turk for annotations…
Doing the projects isn't only of benefit in enriching metadata but also for giving insight into users – discover audiences with particular interests.

Costis commenting – if Europeana only has thumbnails and metadata, is it a missed opportunity to get UGC on more detailed content?

Is Europeana highbrow compared to other platforms like Flickr, FB, so would people be afraid to contribute? [probably – there must be design patterns for encouraging participation from audiences on museum sites, but we're still figuring out what they are]
Business model for crowdsourcing – producing multilingual resources is perfect case for Europeana.

Open to the floor for questions… Importance of local communities, getting out there, using libraries to train people. Local newspapers, connecting to existing communities.

Notes from Europeana's Open Culture Conference 2010

The Open Culture 2010 conference was held in Amsterdam on October 14 – 15. These are my notes from the first day (I couldn't stay for the second day). As always, they're a bit rough, and any mistakes are mine. I haven't had a chance to look for the speakers' slides yet so inevitably some bits are missing. If you're in a hurry, the quote of the day was from Ian Davis: "the goal is not to build a web of data. The goal is to enrich lives through access to information".

The morning was MCd by Costis Dallas and there was a welcome and introduction from the chair of the Europeana Foundation before Jill Cousins (Europeana Foundation) provided an overview of Europeana. I'm sure the figures will be available online, but in summary, they've made good progress in getting from a prototype in 2008 to an operational service in 2010. [Though I have written down that they had 1 million visits in 2010, which is a lot less than a lot of the national museums in the UK though obviously they've had longer to establish a brand and a large percentage of their stats are probably in the 'visit us' areas rather than collections areas.]

Europeana is a super-aggregator, but doesn't show the role of the national or thematic aggregators or portals as providers/collections of content. They're looking to get away from a one-way model to the point where they can get data back out into different places (via APIs etc). They want to move away from being a single destination site to putting information where the user is, to continue their work on advocacy, open source code etc.

Jill discussed various trends, including the idea of an increased understanding that access to culture is the foundation for a creative economy. She mentioned a Kenneth Gilbraith [?] quote on spending more on culture in recession as that's where creative solutions come from [does anyone know the reference?]. Also, in a time of Increasing nationationalism, Europeana provided an example to combat it with example of trans-Euro cooperation and culture. Finally, customer needs are changing as visitors move from passive recipients to active participants in online culture.

Europeana [or the talk?] will follow four paths – aggregration, distribution, facilitation, engagement.

Aggregation – build the trusted source for European digital cultural material. Source curated content, linked data, data enrichment, multilinguality, persistent identifiers. 13 million objects but 18-20thC dominance; only 2% of material is audio-visual [?]. Looking towards publishing metadata as linked open data, to make Europeana and cultural heritage work on the web, e.g. of tagging content with controlled vocabularies – Vikings as tagged by Irish and Norwegian people – from 'pillagers' to 'loving fathers'. They can map between these vocabularies with linked data.
Distribution – make the material available to the user wherever they are, whenever they want it. Portals, APIs, widgets, partnerships, getting information into existing school systems.
Facilitate innovation in cultural heritage. Knowledge sharing (linked data), IPR business models, policy – advocacy and public domain, data provider agreements. If you write code based on their open sourced applications, they'd love you to commit any code back into Europeana. Also, look at Europeana labs.
Engagement – create dialogue and participation. [These slides went quickly, I couldn't keep up]. Examples of the Great War Archive into Europe [?]. Showing the European connection – Art Nouveau works across Europe.

The next talk was Liam Wyatt on 'Peace love and metadata', based in part on his experience at the British Museum, where he volunteered for a month to coordinate the relationship between Wikipedia as representative of the open web [might have mistyped that, it seems quite a mantle to claim] and the BM as representatiave of [missed it]. The goal was to build a proactive relationship of mutual benefit without requiring change in policies or practices of either. [A nice bit of realism because IMO both sides of the museum/Wikipedia relationship are resistant to change and attached firmly to parts of their current models that are in conflict with the other conglomeration.]

The project resulted in 100 new Wikipedia articles, mostly based on the BM/BBC A History of the World in 100 Objects project (AHOW). [Would love to know how many articles were improved as a result too]. They also ran a 'backstage pass' day where Wikipedians come on site, meet with curators, backstage tour, then they sit down and create/update entries. There were also one-on-one collaborators – hooking up Wikipedians and curators/museums with e.g. photos of objects requested.

It's all about improving content, focussing on personal relationshiips, leveraging the communities; it didn't focus on residents (his own work), none of them are content donation projects, every institution has different needs but can do some version of this.

[I'm curious about why it's about bringing Wikipedians into museums and not turning museum people into Wikipedians but I guess that's a whole different project and may be result from the personal relationships anyway.]

Unknown risks are accounted for and overestimated. Unknown rewards are not accounted for and underestimated. [Quoted for truth, and I think this struck a chord with the audience.]

Reasons he's heard for restricting digital access… Most common 'preserving the integrity of the collection' but sounds like need to approve content so can approve of usages. As a result he's seen convoluted copyright claims – it's easy tool to use to retain control.

Derivative works. Commercial use. Different types of free – freedom to use, freedom to study and apply knowledge gained; freedom to make and redistribute copies; [something else].

There are only three applicable licences for Wikipedia. Wikipedia is a non-commercial organisation, but don't accept any non-commercially licenced content as 'it would restrict the freedom of people downstream to re-use the content in innovative ways'. [but this rules out much museum content, whether rightly or not, and with varying sources from legal requirements to preference. Licence wars (see the open source movement) are boring, but the public would have access to more museum content on Wikipedia if that restriction was negotiable. Whether that would outweight the possible 'downstream' benefit is an interesting question.]

Liam asked the audience, do you have a volunteer project in your institution? do you have an e-volunteer program? Well, you do already, you just don't know it. It's a matter of whether you want to engage with them back. You don't have to, and it might be messy.

Wikipedia is not a social network. It is a social construction – it requires a community to exist but socialising is not the goal. Wikipedia is not user generated content. Wikipedia is community curated works. Curated, not only generated. Things can be edited or deleted as well as added [which is always a difficulty for museums thinking about relying on Wikipedia content in the long term, especially as the 'significance' of various objects can be a contested issue.]

Happy datasets are all alike; every unhappy dataset is unhappy in its own way. A good test of data is that it works well with others – technically or legally.

According to Liam, Europeana is the 21st century of the gallery painting – it's a thumbnail gallery but it could be so much more if the content was technically and legally able to be re-used, integrated.
Data already has enough restrictions already e.g. copyright, donor restrictions. but if it comes without restrictions, its a shame to add them. 'Leave the gate as you found it'.

'We're doing the same thing for the same reason for the same people in the same medium, let's do it together.'

The next sessions were 'tasters' of the three thematic tracks of the second part of the day – linked data, user-generated content, and risks and rewards. This was a great idea because I felt like I wasn't totally missing out on the other sessions.

Ian Davis from Talis talked about 'linked open culture' as a preview of the linked data track. How to take practices learned from linked data and apply them to open culture sector. We're always looking for ways to exchange info, communicate more effecively. We're no longer limited by the physicality of information. 'The semantic web fundamentally changes how information, machines and people are connected together'. The semantic web and its powerful network effects are enabling a radical transformation away from islands of data. One question is, does preservation require protection, isolation, or to copy it as widely as possible?

Conjecture 1 – data outlasts code. MARC stays forever, code changes. This implies that open data is more important than open source.
Conjecture 2 – structured data is more valuable than unstructured. Therefore we should seek to structure our data well.
Conjecture 3 – most of the value in our data will be unexpected and unintended. Therefore we should engineer for serendipity.

'Provide and enable' – UK National Archives phrase. Provide things you're good at – use unique expertise and knowledge [missed bits]… enable as many people as possible to use it – licence data for re-use, give important things identifiers, link widely.

'The goal is not to build a web of data. The goal is to enrich lives through access to information.'
[I think this is my new motto – it sums it up so perfectly. Yes, we carry on about the technology, but only so we can get it built – it's the means to an end, not the end itself. It's not about applying acronyms to content, it's about making content more meaningful, retaining its connection to its source and original context, making the terms of use clear and accessible, making it easy to re-use, encouraging people to make applications and websites with it, blah blah blah – but it's all so that more people can have more meaningful relationships with their contemporary and historical worlds.]

Kevin Sumption from the National Maritime Museum presented on the user-generated content track. A look ahead – the cultural sector and new models… User-generated content (UGC) is a broad description for content created by end users rather than traditional publishers. Museums have been active in photo-sharing, social tagging, wikipedia editing.

Crowdsourcing e.g. – reCAPTCHA [digitising books, one registration form at a time]. His team was inspired by the approach, created a project called 'Old Weather' – people review logs of WWI British ships to transcribe the content, especially meterological data. This fills in a gap in the meterological dataset for 1914 – 1918, allows weather in the period to be modelled, contributes to understanding of global weather patterns.

Also working with Oxford Uni, Rutherford Institute, Zooniverse – solar stormwatch – solar weather forecast. The museum is working with research institutions to provide data to solve real-world problems. [Museums can bring audiences to these projects, re-ignite interest in science, you can sit at home or on the train and make real contributions to on-going research – how cool is that?]

Community collecting. e.g. mass observation project 1937 – relaunched now and you can train to become an observer. You get a brief e.g. families on holidays.

BBC WW2 People's War – archive of WWII memories. [check it out]

RunCoCO – tools for people to set up community-lead, generated projects.

Community-lead research – a bit more contentious – e.g. Guardian and MPs expenses. Putting data in hands of public, trusting them to generate content. [Though if you're just getting people to help filter up interesting content for review by trusted sources, it's not that risky].

The final thematic track preview was by Charles Oppenheim from Loughborough University, on the risks and rewards of placing metadata and content on the web. Legal context – authorisation of copyright holder is required for [various acts including putting it on the web] unless… it's out of copyright, have explicit permission from rights holder (not implied licence just cos it's online), permission has been granted under licensing scheme, work has been created by a member of staff or under contract with IP assigned.

Issues with cultural objects – media rich content – multiple layers of rights, multiple rights holders, multiple permissions often required. Who owns what rights? Different media industries have different traditions about giving permission. Orphan works.

Possible non-legal ramifiations of IPR infringements – loss of trust with rights holders/creators; loss of trust with public; damage to reputation/bad press; breach of contract (funding bodies or licensors); additional fees/costs; takedown of content or entire service.

Help is at hand – Strategic Content Alliance toolkit [online].

Risks beyond IPR – defamation; liability for provision of inaccurate information; illegal materials e.g. pornography, pro-terrorism, violent materials, racist materials, Holocaust denial; data protection/privacy breaches; accidental disclosure of confidential information.

High risk – anything you make money from; copying anything that is in copyright and is commercially availabe.
Low risk – orphan works of low commercial value – letters, diaries, amateur photographs, films, recordings known by less known people.
Zero risk stuff.
Risks on the other side of the coin [aka excuses for not putting stuff up]

Christian Heilmann on Yahoo!'s YQL, open data tables, APIs

My notes from Christian Heilmann's talk on 'Reaching those web folk' with Yahoo!'s new-ish YQL, open data tables and APIs at the National Maritime Museum [his slides]. My notes are a bit random, but might be useful for people, especially the idea of using YQL as an easy way to prototype APIs (or implement APIs without too much work on your part).

For him it's about data on the web, not just technology.

Number of users is a crap metric, [should consider the user experience].

Stats should be what you use to discover areas where are the problems, not to pat yourself on the back.

People with blackberries have no Javascript, no CSS. Don't have front-loading navigation they have to scroll through – cos they won't.

If you think of your site as content, then visitors can become 'broadcasting stations' and relay your message. Information flows between readers and content. They're passing it on through distribution channels you're not even aware of.

Content on the web is validated with links and quotes from other sources e.g. Wikipedia. People mix your information with other sources to prove a point or validate it. eg. photos on maps.

How can you be part of it?
Make it easy to access. Structure your websites in (plain old semantic HTML) a semantic manner. Title is important, etc. Add more semantic richness with RDF and microformats. Provide data feeds or RSS. Consider the Rolls Royce of distribution – an API. Help other machines make sense of your content – search engines will love you too.

Yahoo index via BOSS API – Yahoo do it because they know 'search engines are dying'. Catch-all search engines are stupid. Apples are not the same apples for everyone. Build a cleverer web search.

http://ask-boss.appspot.com/ – nlp analysis of search results. Try 'who is batman in the dark knight' – amazing.

BOSS provides mainstream channel for semantic web and microformats. Microformats are chicken and egg problem. Using searchmonkey technology, BOSS lists this information in the results. BOSS can return all known information about a page, structured.

Key terms parameter in BOSS – what did people enter to find a site/page? http://keywordfinder.org/ – what successful websites have for a given keyword.

Clean HTML is the most important thing, semantic and microformats are good.

If your data is interesting enough, people will try to get to it and remix it.

[Curl has grown up since I last used it! Can be any browser, do cookies, etc.]

Now the web looks like an RSS reader.

Include RSS in your stats.

Guardian – any of their content websites put out RSS through CMS. They then provided an API so end users can filter down to the data they need.

Programmable Web – excellent resource but can be overwhelming.

The more data sources you use, the more time you spend reading API documentation, sos every API is different. Terms, formats, etc. The more sources you connect to, the more chances of error. The more stuff you pull in, the slower the performance of your website.

So you need systems to aggregate sources painlessly. Yahoo Pipes. A visual interface, changes have to be made by hand.

You can't quickly use a pipe in your code and change it on the fly. e.g. change a parameter for one implementation. No version control.

So that's one of the reasons for YQL: Yahoo Query Language. SQL style interface to all yahoo data (all Yahoo APIs) and the web. Yahoo build things with APIs cos it's the only way to scale. Book: 'scalable websites', all about APIs.

Build queries to Yahoo APIs, try them out in YQL console. Provides diagnostics – which URLs, how long it took, any problems encountered. Allows nesting of API calls.

Outputs XML or JSON, consistent format so you know how to use that information.

YQL also helped internally because of varying APIs between departments.

Gives access to all Yahoo services, any data sources on the web, including html and microformats, and can scrape any website.

Open tables
Easy way to add own information to YQL. Tell Yahoo end point where can get the info.

Jim wanted to allow people to access data without building an API. All it needed was a simple XML file.

[Though you do need RSS results from a search engine to point to – I'm going to see what we can output from our Google Mini and will share any code – or would appreciate some time-saving pointers if anyone has any. Yes, hello, lazyweb, that's my coat, thanks.]

Basically it's a way of providing an API without having to develop one.

Concluding: you can piggyback on people's social connections with other people by making data shareable. [Then your data is shared, yay. Assuming your institution is down with that, and no copyrights or puppies were hurt in the process.]

APIs are a commitment – have to be available all the time, lot of traffic, but hard to measure traffic and benefits. Making APIs scale is a pain and have to be clever to do it. Pointing YQL open data table pointing to search engine on your site also works.

Saves documenting API? [??]

YQL handles the interface, caching and data conversion for you. Also limits the access to sensible levels – 10,000 hits/hour.

Jim – 'images from collection' displayed on page as badge thing with YQL as RSS browser. Can just create RSS feed for exhibition than can new badge for new exhibition.

Using YQL protects against injection attacks.

Comment from audience – YQL as meta-API.

Registering is basically making the XML file. You need a Yahoo ID to use the console. [The console is cool, basically like a SQL 'enterprise' system console, with errors and transaction processing costs.]

We had questions about adding in metrics, stats, to use both for reporting and keeping funders/bosses happy and for diagnostics – to e.g. find out which areas of the collection are being queried, what people are finding interesting.

github repository as place to register open tables to make them discoverable.

There's a YQL blog.

[So, that's it – it's probably worth a play, and while your organisation might not want to use it in production without checking out how long the service is likely to be around, etc, it seems like an easy way of playing with API-able data. It'd be really interesting to see what happened if a few museums with some overlap in their collections coverage all made their data available as an open table.]

Happy developers + happy museums = happy punters (my JISC dev8D talk)

This is a rough transcript of my lightning talk 'Happy developers, happy museums' at JISC's dev8D 'developer happiness' days last week. The slides are downloadable or embedded below. The reason I'm posting this is because I'd still love to hear comments, ideas, suggestions, particularly from developers outside the museum sector – there's a contact form on my website, or leave a comment here.

"In this talk I want to show you where museums are in terms of data and hear from you on how we can be more useful.

If you're interested in updates I use my blog to [crap on a bit, ahem] talk about development at work, and also to call for comment on various ideas and prototypes. I'm interested in making the architecture and development process transparent, in being responsive to not only traditional museum visitors as end users, but also to developers. If you think of APIs as a UI for developers, we want ours to be both usable and useful.

I really like museums, I've worked in three museums (or families of museums) now over ten years. I think they can do really good things. Museums should be about delight, serendipity and answers that provoke more questions.

A recent book, 'How does one become a scientist? : survey on the birth of a Vocation' states that '60% of scientists over 30 and 40% of scientists under 30 note claim, without prompting, that the Palais de la Découverte [a science museum in Paris] triggered their vocation'.

Museums can really have an impact on how people think about the world, how they think about the possibilities of their lives. I think museums also have a big responsibility – we should be curating collections for current and future audiences, but also trying to provide access to the collections that aren't on display. We should be committed to accessibility, transparency, curation, respecting and enabling expertise.

So today I'm here because we want to share our stuff – we are already – but we want to share better.

We do a lot of audience research and know a lot about some of our users, including our specialist users, but we don't know so much about how people might use our data, it's a relatively new thing for us. We're used to saying 'here are objects in a case, interpretation in label', we're not used to saying 'here's unmediated access, access through the back door'.

Some of the challenges for museums: technology isn't that much of a challenge for us on the whole, except that there are pockets of excellence, people doing amazing things on small budgets with limited resources, but there are also a lot of old-fashioned monolithic project designs with big overheads that take a long time to deliver. Lots of people mean well but don't know what's possible – I want to spread the news about lightweight, more manageable and responsive ways of developing things that make sense and deliver results.

We have a lot of data, but a lot of it's crap. Some of what we have is wrong. Some of it was written 100 years ago, so it doesn't match how we'd describe things now.

We face big institutional challenges. Some curators – (though it does depend on the museum) – fear loss of control, fear intellectual vandalism, that mistakes in user-generated content published on museum sites will cause people to lose trust in museums. We have fears of getting the IT wrong (because for a while we did). Funding and metrics are a big issue – we are paid by how many people come through our door or come to our websites. If we're doing a mashup, how do we measure the usage of that? Are we going to cost our organisations money if we can't measure visits and charge back to the government? [This is particularly an issue for free museums in the UK, an interesting by-product of funding structures.]

Copyright is a huge issue. We might not even own an object that appears in our collections, we might not own the rights to the image of our object, or to the reproductions of an image. We might not have asked for copyright clearance at the time when an object was donated, and the cost of tracing it might be too high, so we can't use that object online. Until we come up with a reliable model that reduces the risk to an institution of saying 'copyright unknown', we're stuck.

The following are some ways I can think of for dealing with these challenges…
Limited resources – we can't build an interface to meet every need for every user, but we can provide the content that they'd use. Some of the semantic web talks here have discussed a 'thin layer' of application over data, and that's kind of where we want to go as well.

Real examples to reduce institutional fear and to provide real examples of working agile projects. [I didn't mean strictly 'agile' methodology but generally projects that deliver early and often and can respond to the changing technical and social environment]

Finding ways for the sector to reward intelligent failure. Some museums will never ever admit to making a mistake. I've heard over the past few days that universities can be the same. Projects that are hyped up suddenly aren't mentioned, and presumably it's failed, but no-one [from the project] ever talks about why so we don't learn from those mistakes. 'Fail faster, succeed sooner'.
I'd like to hear suggestions from you on how we could deal with those challenges.

What are museums known for? Big buildings, full of stuff; experts; we make visitors come to us; we're known for being fun; or for being boring.

Museum websites traditionally appear to be about where we are, when we're open, what's on, is there a cafe on site. Which is useful, but we can do a lot more.

Traditionally we've done pretty exhibition microsites, which are nice – they provide an experience of the exhibition before or after your visit. They're quite marketing-led, they don't necessarily provide an equivalent experience and they don't really let you engage with the content beyond the fact that you're viewing it.

We're doing lots of collections online projects, some of these have ended up being silos – sometimes to the extent if we want to get data out of them, we have to screen-scrape our own data. These sites often aren't as pretty, they don't always have the same design and usability budgets (if any).

I think we should stick to what we're really good at – understanding the data (collections), understanding how to mediate it, how to interpret it, how to select things that are appropriate for publication, and maybe open it up to other people to do the shiny pretty things. [Sounds almost like I'm advocating doing myself out of a job!]

So we have lots of objects, images, lots of metadata; our collections databases also include people, events, dates, places, businesses and organisations, lots of qualified information around things like dates, they're not necessarily simple fields but that means they can convey a lot more meaning. I've included that because people don't always realise we have information beyond objects and object metadata. This slide [11 below] is an example of one of the challenges – this box of objects might not be catalogued as individual instruments, it might just be catalogued as a 'box of stuff', which doesn't help you find the interesting objects in the box. Lots of good stuff is hidden in this way.

We're slowly getting there. We're opening up access. We're using APIs internally to share data between gallery interactives and the web, we're releasing them as data points, we're using them to provide direct access to collections. At the moment it still tends to be quite mediated access, so you're getting a lot of interpretation and a fewer number of objects because of the resources required to create really nice records and the information around them.

'Read access' is relatively easy, 'write access' is harder because that's when we hit those institutional issues around authority, authorship. Some curators are vaguely horrified that they might have to listen to what the public have to say and actually take some of it back into their collections databases. But they also have to understand that they can't know everything about their collections, and there are some specialist users who will know everything there is to know about a particular widget on a particular kind of train. We'd like to capture that knowledge. [London Transport Museum have had a good go at that.]

Some random URLs of cool stuff happening in museums [http://dashboard.imamuseum.org/, http://www.powerhousemuseum.com/collection/database/menu.php, http://www.brooklynmuseum.org/opencollection/collections/, http://objectwiki.sciencemuseum.org.uk/] – it's still very much in small pockets, it's still difficult for museum staff to convince people to take what seems like a leap of faith and try these non-traditional things out.

We're taking our content to where people hang out. We're exploring things like Flickr Commons, asking people to tag and comment. Some museums have been updating collections records with information added by the public as a result. People are geo-tagging photos for us, which means you can do 'then and now' mashups without a big metadata enhancement budget.

I'd like to see an end to silos. We are kinda getting there but there's not a serious commitment to the idea that we need to let things go, that we need to make sure that collections online shareable, that they're interoperable, that they can mesh with other things.

Particularly for an education audience, we want to help researchers help themselves, to help developers help others. What else do we have that people might find useful?

What we can do depends on who you are. I could hope that things like enquiry-based learning, mashups, linked data, semantic web technologies, cross-collections searches, faceted browsing to make complex searches easy would be useful, that the concept of museums as a place where information lives – a happy home for metadata mapped around objects and authority records – are useful for people here but I wouldn't want to put words into your mouths.

There's a lot we can do with the technology, but if we're investing resources we need to make sure that they're useful. I can try things in my own time because it's fun, but if we're going to spend limited resources on interfaces for developers then we need to that it's actually going to help some group of people out there.

The philosophy that I'm working with is 'we've got really cool things, but we can have even cooler things if we can share what we have with everyone else'. "The coolest thing to do with your data will be thought of by someone else". [This quote turns out to be on the event t-shirts, via CRIG!] So that said… any ideas, comments, suggestions?"

And that, thankfully, is where I stopped blathering on. I'll summarise the discussion and post back when I've checked that people are ok with me blogging their comments.

[If the slide show below has a brown face on a black background, it's the right one – slideshare's embed seems to have had a hiccup. If it's not that, try viewing it online directly.]

Happy developers + happy museums = happy punters

View more presentations from miaridge.

[My slide images include the Easter Egg museum in Kolomyya, Ukraine and 'Laughter in Odd Places' event at the Museum of London.]

This is a quick dump of some of the text from an interview I did at the event, cos I managed to cover some stuff I didn't quite articulate in my talk:

[On challenges for museums:] We need to change institutional priorities to acknowledge the size of the online audience and the different levels of engagement that are possible with the online experience. Having talked to people here, museums also need to do a bit of a sell job in letting people know that we've changed and we're not just great big imposing buildings full of stuff.

[What are the most exciting developments in the museum sector, online?] For digital collections, going outside the walls of the museum using geo-location to place objects in their original context is amazing. It means you can overlay the streets of the city with past events and lives. Outsourcing curation and negotiating new models of expertise is exciting. Overcoming the fear of the digital surrogate as a competitor for museum visits and understanding that everything we do builds audiences, whether digital or physical.

Project label	Information type	Amount of information	Proportion of collection
Rockets	Subjective	Contextual history ('background, events, processes and influences')	Tiny minority
Lockets	Mostly objective, may be contextual to collection purpose	Catalogued (some description)	Minority
Sprockets	Objective	Registered (minimal)	Majority