metadata – Open Objects

Reflections on teaching Neatline

I've called this post 'Reflections on teaching Neatline' but I could also have called it 'when new digital humanists meet new software'. Or perhaps even 'growing pains in the digital humanities?'.

A few months ago, Anouk Lang at the University of Strathclyde asked me to lead a workshop on Neatline, software from the Scholar's Lab that plots 'archives, objects, and concepts in space and time'. It's a really exciting project, designed especially for humanists – the interfaces and processes are designed to express complexity and nuance through handcrafted exhibits that link historical materials, maps and timelines.

The workshop was on Thursday, and looking at the evaluation forms, most people found it useful but a few really struggled and teaching it was also slightly tough going. I've been thinking a lot about the possible reasons for that and I'm sharing them both as a request for others to share their experiences in similar circumstances and also in the hope that they'll help others.

The basic outline of the workshop was an intros round (who I am, who they are and what they want to learn); information on what Neatline is and what it can do; time to explore Neatline and explore what the software can and can't do (e.g. login, follow the steps at neatline.org/plugins/neatline to create an item based on a series of correspondence Anouk had been working on, deciding whether you want to transcribe or describe the letter, tweaking its appearance or linking it to other items); and a short period for reflection and discussion (e.g. 'What kinds of interpretive decisions did you find yourself making? What delighted you? What frustrated you?') to finish. If you're curious, you can follow along with my slides and notes or try out the Neatline sandbox site.

The first half was fine but some people really struggled with the hands-on section. Some of it was to do with the software itself – as a workshop, it was a brilliant usability test of the admin interfaces of the software for audiences outside the original set of users. Neatline was only launched in July this year and isn't even in version 2 yet so it's entirely understandable that it appears to have a few functional or UX bugs. The documentation isn't integrated into the interface yet (and sometimes lacks information that is probably part of the shared tacit knowledge of people working on the project) but they have a very comprehensive page about working with Neatline items. Overall, the process of handcrafting timelines and maps for a Neatline exhibit is still closer to 'first, catch your rabbit' than making a batch of ready-mix cupcakes. Neatline is also designed for a particular view of the world, and as it's built on top of other software (Omeka) with another very particular view of the world (and hello, Dublin Core), there's a strong underlying mental model that informs the processes for creating content that is foreign to many of its potential users, including some at the workshop.

But it was also partly because I set the bar too high for the exercises and didn't provide enough structure for some of the group. If I'd designed it so they created a simple Neatline item by closely following detailed instructions (as I have done for other, more consciously tech-for-beginners workshops), at least everyone would have achieved a nice quick win and have something they could admire on the screen. From there some could have tried customising the appearance of their items in small ways, and the more adventurous could have tried a few of the potential ways to present the sample correspondence they were working with to explore the effects of their digitisation decisions. An even more pragmatic but potentially divisive solution might have been to start with the background and demonstration as I did, but then do the hands-on activity with a smaller group of people who were up for exploring uncharted waters. On a purely practical level, I also should have uploaded the images of the letters used in the exercise to my own host so that they didn't have to faff with Dropbox and Omeka records to get an online version of the image to use in Neatline.

And finally it was also because the group had really mixed ICT skills. Most were fine (bar the occasional bug), but some were not. It's always hard teaching technical subjects when participants have varying levels of skill and aptitude, but when does it go beyond aptitude into your attitude about being pushed out of your comfort zone? I'd warned everyone at the start that it was new software, but if you haven't experienced beta software before I guess you don't have the context for understanding what that actually means.

I should make it clear here that I think the participants' achievements outshine any shortcomings – Neatline is a great tool for people working with messy humanities data who want to go beyond plonking markers on Google Maps, and I think everyone got that, and most people enjoyed the chance to play with Neatline.

But more generally, I also wonder if it has to do with changing demographics in the digital humanities – increasingly, not everyone interested in DH is an early, or even a late adopter, and someone interested in DH for the funding possibilities and cool factor might not naturally enjoy unstructured exploration of new software, or be intrigued by trying out different combinations of content and functionality just 'to see what happens'.

Practically, more information for people thinking of attending would be useful – 'if you know x already, you'll be fine; if you know y already, you'll be bored' would be useful in future. Describing an event as 'if you like trying new software, this is for you' would probably help, but it looks like the digital humanities might also now be attracting people who don't particularly like working things out as they go along – are they to be excluded? If using software like this is the onboarding experience for people new to the digital humanities, they're not getting the best first impression, but how do you balance the need for fast-moving innovative work-in-progress to be a bit hacky and untidy around the edges with the desires of a wider group of digital humanities-curious scholars? Is it ok to say 'here be dragons, enter at your own risk'?

On releasing museum data and the importance of licenses

I've been preparing for the workshop on 'Hacking and mash-ups for beginners' I'm running at the Museum Computer Network conference (MCN2011) this year, which as always means poking around the GLAM APIs, linked and open data services page for some nice datasets to use in exercises. Meanwhile, people have been using NMSI data at Culture Hack North this weekend, and a question from that event made me realised I never blogged here about the collections data released by NMSI (i.e. the UK Science Museum, National Media Museum and National Railway Museum) back in March 2011.

There's more in the post I wrote on the museum developers blog at the time, Collections data published, but in summary:

We’ve released the files [218,822 object records, 40,596 media records and 173 event records] as a lightweight experiment – we’d like to understand whether, and if so, how, people would use our data. We’d also like to explore the benefits for the museum and for programmers using our data – your feedback will inform decisions about future investment in more structured data as well as helping shape our understanding of the requirements of those users. The files are in CSV format – because it’s a really simple format, viewable in a text editor, we hope that it will be usable by most people.

And since someone asked for some background on how I dealt with the organisational issues, the short answer is – I was pragmatic, figured any reasonable data was better than none, and kept it simple. Or, as I wrote at the time in Update on collections data and geocoded NRM data:

A few people have commented on the licence (Creative Commons Attribution-NonCommercial-ShareAlike, CC BY-NC-SA) and on the format (CSV). As tomorrow is my last day, I can’t really speak for the museum but the intention is to learn from how people use the data – the things they make, the barriers they face, etc – and iterate (as resources allow) until we get to an optimal solution (or solutions). So please get in touch if you’ve got requests or think you can help clear up some of the issues these kinds of projects face, because there’s a good chance you’ll help make a difference.

The licence is a pragmatic solution – it’s clarification of existing terms rather than a change to our terms, because this avoided a need for legal advice, policy review, etc, that would have added several months to the process.

And yes, I know CSV is quick and dirty, but it’s effective. The museum sector is still working out how to match the resources available with the needs of mash-up type developers who work best with JSON and those who are aiming for linked open data; my hope is that your feedback on this will help museums figure out how to support people using open data in various forms. A simple solution like this also means it’s easy for the museum to re-run the export to update the data as time goes on, and that anyone, geek or not, can open the files without being startled by angle brackets and acronyms. Also, did I mention it was quick?

In some ways, 2011 has been the year I really understood how much of a barrier a 'non-commercial' license is to re-use ('Wired releases images via Creative Commons, but reopens a debate on what “noncommercial” means' is quite a useful article for understanding the confusion though the LOD-LAM Summit was really where it came together for me). Even I've struggled with questions like 'does a non-commercial license mean I can or can't upload the data to Google Fusion Tables to clean it?', let alone 'can a widget made with non-commercial data be displayed on an ad-supported blog site?'.

Most people who want to play with heritage data want to do the right thing, so an ambiguous 'non-commercial' license effectively prevents them using it (people who want to do bad things with it would probably just scrape the data anyway). I get the sense that museums (and other GLAM orgs) are strongly loss averse, so a full 'commercial use ok' statement might be a bit much, but maybe we can do more to define exactly what's reasonable 'commercial' use and what's not? The Wired article provides some useful starting questions, as does Europeana's discussion of their Data Exchange Agreement. Maybe 2012 will be the year we start to provide answers…

Update, January 2013: I've been writing a piece on open cultural data in museums so have been coming across more material on confusion about 'non-commercial'. The Danger of Using Creative Commons Flickr Photos in Presentations discusses one case where the owner of a photograph was confused about whether it was being used commercially or not. While that may turn out to be a case of mistaken identity, one commenter, Michael, says:

'Commercial and non-commercial are very difficult to determine. As such, I make a point of never using photos that have a non-commercial license. Too much hassle. (I also now do not use photos with a share-alike provision. Same reason, too much hassle.)'

A post on the Creative Commons blog, Library catalog metadata: Open licensing or public domain? discusses the case for and against requesting vs requiring attribution.

What would Phar Lap do? AKA, what happens when Facebook and museum URIs meet a dead horse?

Phar Lap was a famous race horse. After he died (in film-worthy suspicious circumstances), bits of Phar Lap ended up in three different museums – his skin is at Melbourne Museum, his skeleton is at Te Papa in Wellington, NZ, and his heart is in Canberra at the National Museum of Australia.

I've always been fascinated by the way the public respond to Phar Lap – when I worked at Museum Victoria, the outreach team would regularly get emails written to Phar Lap by people who had seen the film or somehow come across his story. (I was also never quite sure why they thought emailing a dead horse would work). So when I first heard that Phar Lap was on Facebook, I was curious to see which museum would have 'claimed' Phar Lap. Does possession of the most charismatic object (the hide) make it easier for Melbourne Museum to step up as the presence of Phar Lap on social media, or were they just the first to be in that space? The issues around 'ownership' and right to speak for an iconic object like Phar Lap make a brilliant case study for how museums represent their collections online.

And today, when I came across three posts (Responses to "Progress on Museum URIs", Progress on Museum URIs by @sebastianheath, Identifing Objects in Museum Collections by @ekansa) on movements towards stable museum URIs that problematised the "politics of naming and identifying cultural heritage" and the concept of the "exclusive right of museums to identify their objects", I thought of Phar Lap. (Which is nice, cos 80 years and one day ago he won the Melbourne Cup).

Of the three museums that own bits of the dead horse, which gets to publish the canonical digital record about Phar Lap? I hope the question sounds silly enough to highlight the challenges and opportunities in translating physical models to the digital realm. Of course each museum can publish a record (specifically, mint a URI) about Phar Lap (and I hope they do) but none of the museums could prevent the others from publishing (and hopefully they wouldn't want to).

Or as the various blog posts said, "many agents can assert an identity for an object, with those identities together forming a distributed and diverse commentary on the human past", and museums need to play their part: "a common identifier promoted by and discoverable at the holding institution will ease the process of recognizing that two or more identifiers refer to the 'same thing'".

Of course it's not that simple, and if you're interested in the questions the museum sector (by which I hopefully don't only mean me) is grappling with, the museums and the machine-processable web page on Permanent IDs has links to discussions on the MCG list, and I've wrestled a bit with how URIs might look at the Science Museum/NMSI (and I need to go back and review the comments left by various generous people). I'd love to know what other museums are planning, and what consumers of the data might need, so that we can come up with a robust common model for museum URIs.

And to reward you for getting this far, here is a picture of Phar Lap on Facebook as his skin and bones are about to be re-united:

UK Culture Grid wants to know what developers need – get in!

Neil Smith from Knowledge Integration dropped by the Museums and the machine-processable web wiki to ask what users (developers) need to get data in and out of the Culture Grid:

To support the ambitious targets for increasing the number of item records in Culture Grid, we thought know would be a good time to review the venerable old application profile we use for importing metadata into the Grid. I've added a discussion page reviewing options at http://museum-api.pbworks.com/w/page/Culture-Grid-Profile.

We really want the community to be involved in helping ensure that whatever profile (or profiles) we support will meet the needs of users – not only for getting things into the grid but also for getting things out in a format that is useful to them. Although the paper focusses mainly on XML representations of metadata, we're also interested in your views on whether non-XML representations (e.g RDF or JSON) need to be supported.

So whether you work in a museum or are an external developer who'd like to use museum data, I'd encourage you to think about the four options Neil outlines, and to comment, ask questions, share sample data, vote for your favourite option, whatever, on the Culture Grid Profile page. One of the options is to develop a new model – definitely more time-consuming, but a great opportunity to make your needs known.

As an indication of the type of content that's available through the Culture Grid, I've copied this text from some of their about pages: "It contains over 1 million records from over 50 UK collections, covering a huge range of topics and periods. Records mostly refer to images but also text, audio and video resources and are mostly about museum objects with library, archive and other kinds of collections also included." So, that's:

"information about items in collections (referencing the images, video, audio or other material you offer online about the things in your collections)
information about collections as a whole (their scope, significance and access details)
information about collecting organisations (contact and access details)"

There's a lot of cultural heritage and tech jargon involved on the Culture Grid Profile discussion page – don't hold back on asking for clarifications where needed. I'm certainly not an expert on the various schemas and it's a very long time since I helped work out the Exploring 20th Century London extensions for the original PNDS, but I've given it a go.

If you've read this far, you might also be interested in the first ever Culture Grid Hack Day in Newcastle Upon Tyne on December 3, 2010.

Pre-tagging content for sharing on Twitter

The Age newspaper has implemented an interesting social media widget on their article pages. Their 'Join the conversation' widget shows how many people are reading the same article, links to discussion of the page on Twitter, allows the reader to easily add their own comment to Twitter, lists other articles that people who read this article read, and where appropriate, adds a 'Related Coverage' section above the other standard nav links.

I've included a screenshot because I assume attention for each article is fairly transitory so there may not be other readers or twitter discussions on the sample article by the time you read it:

I'm particularly interested in their approach to Twitter. They've used automatically generated hash tags to group together discussion of each article on Twitter. For example, in this article, 'First climate refugees start move to new island home' the hash tag is '#fd-e06x'. If you use their 'Comment on Twitter' link it automatically sends a status to the Twitter site (if you're logged in) with the article URL and '#fd-e06x'.

The 'Read tweets' link takes you to the Twitter search page with the pre-populated search term '#fd-e06x'. Of course, the search won't show any discussion about the issues in the article or about the article itself that haven't used their hash tag.

The system also seems to generate a new hash tag for each article, even those that are updates on previous breaking news stories. So these two articles (Woman hit by train fights for life, Woman fights for life after train station accident) about the same incident have different hash tags – perhaps this can be fixed in a later iteration.

I wonder if it would be possible to harvest possible topical tags from other tweets about the article (checking all the various URL shortening services) to suggest more human-friendly tags related to an article? It's probably not worth it for The Age, but it might be for content with a longer life. Or would an organisation be at risk of appearing to endorse those labels?

Interestingly, the pages don't mention the hash tags, so the process is invisible to the user. Would explaining it lead to greater uptake? I use a Twitter client (partly because I can easily shorten long links, while The Age doesn't pre-shorten their links) so would take the URL directly from the location bar, missing out their hash tag.

Selfishly, because I'm thinking about it for work, I'd love to know how they generate their 'other people who read this' and 'Related Coverage' links. I assume the latter is manually generated, either as direct links or based on article or section metadata.

Also selfishly, I'd like to know their motives, and whether they have any metrics for the success of the project.

Christian Heilmann on Yahoo!'s YQL, open data tables, APIs

My notes from Christian Heilmann's talk on 'Reaching those web folk' with Yahoo!'s new-ish YQL, open data tables and APIs at the National Maritime Museum [his slides]. My notes are a bit random, but might be useful for people, especially the idea of using YQL as an easy way to prototype APIs (or implement APIs without too much work on your part).

For him it's about data on the web, not just technology.

Number of users is a crap metric, [should consider the user experience].

Stats should be what you use to discover areas where are the problems, not to pat yourself on the back.

People with blackberries have no Javascript, no CSS. Don't have front-loading navigation they have to scroll through – cos they won't.

If you think of your site as content, then visitors can become 'broadcasting stations' and relay your message. Information flows between readers and content. They're passing it on through distribution channels you're not even aware of.

Content on the web is validated with links and quotes from other sources e.g. Wikipedia. People mix your information with other sources to prove a point or validate it. eg. photos on maps.

How can you be part of it?
Make it easy to access. Structure your websites in (plain old semantic HTML) a semantic manner. Title is important, etc. Add more semantic richness with RDF and microformats. Provide data feeds or RSS. Consider the Rolls Royce of distribution – an API. Help other machines make sense of your content – search engines will love you too.

Yahoo index via BOSS API – Yahoo do it because they know 'search engines are dying'. Catch-all search engines are stupid. Apples are not the same apples for everyone. Build a cleverer web search.

http://ask-boss.appspot.com/ – nlp analysis of search results. Try 'who is batman in the dark knight' – amazing.

BOSS provides mainstream channel for semantic web and microformats. Microformats are chicken and egg problem. Using searchmonkey technology, BOSS lists this information in the results. BOSS can return all known information about a page, structured.

Key terms parameter in BOSS – what did people enter to find a site/page? http://keywordfinder.org/ – what successful websites have for a given keyword.

Clean HTML is the most important thing, semantic and microformats are good.

If your data is interesting enough, people will try to get to it and remix it.

[Curl has grown up since I last used it! Can be any browser, do cookies, etc.]

Now the web looks like an RSS reader.

Include RSS in your stats.

Guardian – any of their content websites put out RSS through CMS. They then provided an API so end users can filter down to the data they need.

Programmable Web – excellent resource but can be overwhelming.

The more data sources you use, the more time you spend reading API documentation, sos every API is different. Terms, formats, etc. The more sources you connect to, the more chances of error. The more stuff you pull in, the slower the performance of your website.

So you need systems to aggregate sources painlessly. Yahoo Pipes. A visual interface, changes have to be made by hand.

You can't quickly use a pipe in your code and change it on the fly. e.g. change a parameter for one implementation. No version control.

So that's one of the reasons for YQL: Yahoo Query Language. SQL style interface to all yahoo data (all Yahoo APIs) and the web. Yahoo build things with APIs cos it's the only way to scale. Book: 'scalable websites', all about APIs.

Build queries to Yahoo APIs, try them out in YQL console. Provides diagnostics – which URLs, how long it took, any problems encountered. Allows nesting of API calls.

Outputs XML or JSON, consistent format so you know how to use that information.

YQL also helped internally because of varying APIs between departments.

Gives access to all Yahoo services, any data sources on the web, including html and microformats, and can scrape any website.

Open tables
Easy way to add own information to YQL. Tell Yahoo end point where can get the info.

Jim wanted to allow people to access data without building an API. All it needed was a simple XML file.

[Though you do need RSS results from a search engine to point to – I'm going to see what we can output from our Google Mini and will share any code – or would appreciate some time-saving pointers if anyone has any. Yes, hello, lazyweb, that's my coat, thanks.]

Basically it's a way of providing an API without having to develop one.

Concluding: you can piggyback on people's social connections with other people by making data shareable. [Then your data is shared, yay. Assuming your institution is down with that, and no copyrights or puppies were hurt in the process.]

APIs are a commitment – have to be available all the time, lot of traffic, but hard to measure traffic and benefits. Making APIs scale is a pain and have to be clever to do it. Pointing YQL open data table pointing to search engine on your site also works.

Saves documenting API? [??]

YQL handles the interface, caching and data conversion for you. Also limits the access to sensible levels – 10,000 hits/hour.

Jim – 'images from collection' displayed on page as badge thing with YQL as RSS browser. Can just create RSS feed for exhibition than can new badge for new exhibition.

Using YQL protects against injection attacks.

Comment from audience – YQL as meta-API.

Registering is basically making the XML file. You need a Yahoo ID to use the console. [The console is cool, basically like a SQL 'enterprise' system console, with errors and transaction processing costs.]

We had questions about adding in metrics, stats, to use both for reporting and keeping funders/bosses happy and for diagnostics – to e.g. find out which areas of the collection are being queried, what people are finding interesting.

github repository as place to register open tables to make them discoverable.

There's a YQL blog.

[So, that's it – it's probably worth a play, and while your organisation might not want to use it in production without checking out how long the service is likely to be around, etc, it seems like an easy way of playing with API-able data. It'd be really interesting to see what happened if a few museums with some overlap in their collections coverage all made their data available as an open table.]

Notes from 'UK Museums on the Web Conference 2008'

I'm back in London after
UK Museums on the Web Conference 2008 and the mashed museum day.

In the interests of getting my notes up quickly I'm putting them up pretty much 'as is', so they're still rough around the edges. I'll add links to the speaker slides when they are all online. Some photos from the two days are online – a general search for ukmw08 on Flickr will find some. I have some in a set online now, others are still to come, including some photos of slides so I'll update this as I check the text from the slides. These are my notes from the first session.

The keynote speech was given by Tom Loosemore of Ofcom on the Future of Public Service Content.

[For context, Ofcom is the 'independent regulator and competition authority for the UK communications industries' and their recently second review of public service broadcasting, 'The Digital Opportunity', caused a stir in the digital cultural heritage world for its assessment of the extent to which public sector websites delivered on 'public service purposes and characteristics'. You can read the summary or download the full report.]

'How many of you are on the main board of your institution?'

Leadership doesn't have the vision in place to take advantage of the internet.

Sees the internet as platform for public service, [most importantly] enlightenment. He's here today to enlist our help.

We view the internet through lens of expectations from the past, definitely in public service broadcasting – 'let's get our programs on the internet'.

What is value for money?

Would that other sectors did the same soul searching

[On the Ofcom review:] 'You can't really review the web, it's bonkers'

Public service characteristics to create a report card. Of the public service characteristics in the online market (high quality, original, innovative, challenging, engaging, discoverable and accessible), 'challenging' is the hardest.

Museums and cultural sector have amazing potential. What are the barriers between the people here who get it and being able to take that opportunity and redefine public service broadcasting?

It's not skills. Maybe ten years ago, not today. And it's not technology. The crucial missing link is leadership and vision, the lack of recognition by people who govern direction of institutions of the huge potential.

[Which does translate into 'more resources', eventually, but perhaps the missing gap right now is curatorial/interpretative resources? Every online project we do generates more enquiries, stretching these people further, and they don't have time to proactively create content for ad hoc projects as it is, especially as their time tends to be allocated a long time in advance.]

What's behind that reluctance, what can you do to help people on your board understand the opportunities? We can ask 'what business are we in? what's the purpose of our institution?'.

Tate recognise they're not just in the business of getting people to go to the Tate venues, they're in the business of informing people about art. Compare that to the Royal Shakespeare Company which is using its online site purely to get bums on seats.

Next opportunity… how do you take opportunity to digitise your collections and reach a whole new audience? How can you make better use of cultural objects that were previously constrained by physicalty.

What opportunities are native to the internet, can only happen there? How can it help your institution to deliver its purpose?

Recognise that you are in the (public service) media business.

How do you measure enlightenment? You could be changing the way people see the world, etc. but you need to measure it to make a case, to know whether you're succeeding. Metrics really really matter in public service arena.

BBC used to look at page views, but developers gamed the system. Then the metric was 'time online', but it stopped people thinking externally. Metric as proxy for quality.

Value = reach x quality. What kind of experience did they have?

Quality is the really hard part. As defined by BBC: quality is in the eye of the beholder. Did the user have an excellent experience?

BBC measure 'net promoter' – how likely are you to recommend this to a friend or colleague, on a scale of 1 – 10?

[But for our sector, what if you don't have any friends with the same interest in x? Would people extrapolate from their specific page on a Roman buckle to recommend the site generally?]

Throw away the 'soggy British middle' – the 7, 8s (out of ten).

Group them as Promoters (9-10/10), Passive (7-8/10), Detractors (0 – 6/10). The key measure is the difference between how many Promoters and how many Detractors. This was 'fabulously useful' at the BBC. 30% is good benchmark.

They mapped whole BBC portfolio against 'net promoters' % and reach, bubbles show cost.

It's not necessarily about reaching mass audiences. But when producing for niche audiences – they must love it, and it shouldn't cost that much.

He's telling us this because it's the language of funders, of KPIs, this is hard evidence with real people. You might use a different measure of quality but you can't talk about opportunities in abstract, must have numbers behind them.

Suggested the BBC's 15 Web Principles, including 'fall forward, fast'.

A measure of personal success for him would be that in x years when he asked 'who here is on the board of your institution, at least x should put hands up'.

[I really liked this keynote speech as a kick up the arse in case we started to get too complacent about having figured out what matters to us, as museum geeks. It doesn't count unless we can get through our organisations and get that content out to audiences in ways they can use (and re-use).]

In linking the sessions, Ross Parry mused about the legacy of 18th, 19th century ideas of how to build a museum, how would they be different if museums were created today?

Lee Iverson, How does the web connect content? "Semantic Pragmatics"
'Profoundly disagreed' with some of the things Tom was talking about, wants to have a dialogue.
He asked how many know the background to semantic web stuff? Quite a few hands were raised.

Talking about how the web works now and where it's going. Museums have significant opportunity to push things forward, but must understand possibilities and limitations.

Changing classic relationship – museum websites as face of institution to users. Huge opportunity for federating and aggregating content (between museums) – an order of magnitude better.

He's working with 13 museums, with north west native American artefacts. Communities are co-developers, virtually repatriating their (land).

Possibility to connect outside the museum. Powerhouse Museum as an excellent example of why (and how) you should connect.

Becoming connected:
Expose own data from behind presentation layers
Find other data
Integrate – creating a cohesive (situation)
Engage with users

Access to data is core business, curatorial stuff.

RDFa
Pragmatics of standards – get a sense of what it is you're doing [and start, don't try and create the system of everything first], it'll never work. Use existing standards if possible, grab chunks if you can. Never standardise what you minimally need to do to get the utility you need at the moment. Then extend, layers, version 2. A standard is an agreement between a minimum of two people [and doesn't have to be more complicated than that].

"Just do it" – make agreements, get it to work, then engage in the standardisation process.

Relationship between this and semantic web? Semantic web as 'data web'. Competing definitions.

Slide on Tim Berners-Lee on the semantic web in 1999.

Why hasn't it appeared? It's vapourware, you can't make effective standards for it.

Syntax – capability of being interpreted. Semantic – ability to interpret, and to connect interpretations.

Finding data – how much easier would it be if we could just grab the data we want directly from where we want it?

Key is relating what you're doing to what they're doing.

XML vs RDF
Semantic web built on RDF, it's designed for representing metadata. It's substantially different to XML. Lots of reaction against RDF has been reaction against XML encoding, syntactic resistance.

RDF is designed to be manipulated as data, XML is about annotating text. In XML, syntax is the thing, with RDF the data is the thing.

Grab entire XML doc before you can figure out how to smoosh then together. RDF works by reference, you can just build on it.

RDFa. A way of embedding RDF content directly in XHTML, relies on same strategies as microformats. Will be ignored by presentation oriented systems but readable by RDF parsers.

[RDF triples vs machine tags? RDF vs microformats? How RDF-like is OAI PMH?]

You can talk about things you don't have a representation for e.g. people.

Ignore the term 'ontology' – it's just a way of talking about a vocabulary.

Four steps for widespread adoption:
Promote practical applications
Develop applications now
[and the slide was gone and I missed the last two steps!]

There was also some stuff on limitations of lightweight approaches, and hermetically sealed museum data, user experiences. Also a bit on 'give away structured data' but with a good awareness of the need to keep some data private – object location and value, for example.

Ross – we've had the media context and technical context, now for the sector context.

Paul Marty, Engaging Audiences by connecting to collections online.
Vital connections…

What does it mean to say x% of your collection is online? For whom is it useful?

How to engage audiences around your collections? Not just presenting information.

Goes beyond providing access to data. Research shows audiences want engagement. Surveyed 1200 museum visitors about their requirements. [I would love to see the research] Virtuous circle between museum visits and website visits.

Build on interest, give experience that grabs people.

Romans in Sussex website – multiple museums offering collections for multiple audiences. Re-presenting same content in different ways on the fly.

Audiences
Don't just give general public a list of stuff. Give them a way to engage.

"Engaging a community around a collection is harder than providing access to data about a collection"

Photo of the week – says "What do you know about this photo? Please share your thoughts with us" But no link or instructions on how to do it. But at least they're trying…

Discussion – Tom, Lee and Paul.

"Why do you digitise collections before had need in mind?" [Because the driver is internal, not external, needs, would be the generous answer; because they could get funding to do it would be my ungenerous answer].

Tom on RDF – how seriously engaged with it to build audiences, tell stories.

BBC licence terms – couldn't re-use data for commercial purposes/at all.

Leadership need to understand opportunities because otherwise they won't support geek stuff.

Qu: terms of engagement – how is it defined?

Paul – US has made same mistakes re digitisation of collections and websites that don't have reusable data.

Participants must be involved in process from the beginning, need input at start from intended users on how it can engage them.

Fiona: why not use existing resources, go to existing sites with established audiences?

Lee: how did YouTube succeed – people were brought by embedded content. [This issue of using 'wrappers' around your content to help it go viral by being embeddable elsewhere was raised in another session too.]

Tom: letting go is how you win, but it's a profound challenge to institutions and their desire to maintain authority.

BBC on microformats, abbr (and why the machine-readable web is good)

This is a good summary of why content that has meaning to other computers (is machine-readable in an intelligent sense) is useful: Microformats and accessibility – a request for help:

The web is a wonderful place for humans but it's a less friendly place for machines. When we read a web page we bring along our own learning, mental models and opinions. The combination of what we read and what we know brings meaning. Machines are less bright.

Given a typical TV schedule page we can easily understand that Eastenders is on at 7:30 on the 15th May 2008. But computers can't parse text the way we can. If we want machines to be able to understand the web (and there are many reasons we might want to) we have to be more explicit about our meaning.

Which is where microformats come in. They're a relatively new technology that allow publishers to add semantic meaning to web pages. These might be events, contact details, personal relationships, geographic locations etc. With this additional machine friendly data you can add events from a web page directly to your calendar, contacts to your address book etc. In theory it's a great combination of a web for people and a web for machines. But it has some potential problems.

One potential problem is microformat's use of something called the abbreviation design pattern.

Basically, if you have a screen reader and have abbreviation expansion turned on, they'd like to hear from you.

This overloading of the abbreviation tag also has implications for people using abbr correctly. It's a nice inline way to help explain jargon, but if browsers and screen readers change the way they parse and present the content, we'll lose that functionality.

The BBC guys also have a very interesting post on 'Helping machines play with programmes'.

Notes from 'Aggregating Museum Data – Use Issues' at MW2008

These are my notes from the session 'Aggregating Museum Data – Use Issues' at Museums and the Web, Montreal, April 2008.

These notes are pretty rough so apologies for any mistakes; I hope they're a bit useful to people, even though it's so late after the event. I've tried to include most of what was covered but it's taken me a while to catch up on some of my notes and recollection is fading. Any comments or corrections are welcome, and the comments in [square brackets] below are me. All the Museums and the Web conference papers and notes I've blogged have been tagged with 'MW2008'.

This session was introduced by David Bearman, and included two papers:
Exploring museum collections online: the quantitative method by Frankie Roberto and Uniting the shanty towns – data combining across multiple institutions by Seb Chan.

David Bearman: the intentionality of the production of data process is interesting i.e. the data Frankie and Seb used wasn't designed for integration.

Frankie Roberto, Exploring museum collections online: the quantitative method (slides)
He didn't give a crap of the quality of the data, it was all about numbers – get as much as possible to see what he could do with it.

The project wasn't entirely authorised or part of his daily routine. It came in part from debates after the museum mash-up day.

Three problems with mashing museum data: getting it, (getting the right) structure, (dealing with) dodgy data

Traditional solutions:
Getting it – APIs
Structure – metadata standards
Dodgy data – hard work (get curators to fix it)

But it doesn't have to be perfect, it just has to be "good enough". Or "assez bon" (and he hopes that translation is good enough).

Options for getting it – screen scrapers, or Freedom of Information (FOI) requests.

FOI request – simple set of fields in machine-readable format.

Structure – some logic in the mapping into simple format.

Dodgy data – go for 'good enough'.

Presenting objects online: existing model – doesn't give you a sense of the archive, the collection, as it's about the individual pages.

So what was he hoping for?
Who, what, where, when, how. ['Why' is the other traditional journalists questions but too difficult in structured information]

And what did he get?
Who: hoping for collection/curator – no data.
What: hoping for 'this is an x'. Instead got categories (based on museum internal structures).
Where: lots of variation – 1496 unique strings. The specificity of terms varies on geographic and historical dimensions.
When: lots of variation
How: hoping for donation/purchase/loan. Got a long list of varied stuff.

[There were lots of bits about whacking the data together that made people around me (and me, at times) wince. But it took me a while to realise it was a collection-level view, not an individual object view – I guess that's just a reflection of how I think about digital collections – so that doesn't matter as much as if you were reading actual object records. And I'm a bit daft cos the clue ('quantitative') was in the title.

A big part of the museum publication process is making crappy date and location and classification data correct, pretty and human-readable, so the variation Frankie found in data isn't surprising. Catalogues are designed for managing collections, not for publication (though might curators also over-state the case because they'd always rather everything was tidied than published in a possible incorrect or messy state?).

It would have been interesting to hear how the chosen fields related to the intended audience, but it might also have been just a reasonable place to start – somewhere 'good enough' – I'm sure Frankie will correct me if I'm wrong.]

It will be on museum-collections.org. Frankie showed some stuff with Google graph APIs.

Prior art – Pitt Rivers Museum – analysis of collections, 'a picture of Englishness'.

Lessons from politics: theyworkforyou for curators.

Issues: visualisations count all objects equally. e.g. lots of coins vs bigger objects. [Probably just as well no natural history collections then. Damn ants!]

Interactions – present user comments/data back to museums?

Whose role is it anyway, to analyse collections data? And what about private collections?

Sebastian Chan, Uniting the shanty towns – data combining across multiple institutions (slides)
[A paraphrase from the introduction: Seb's team are artists who are also nerds (?)]

Paper is about dealing with the reality of mixing data.

Mess is good, but… mess makes smooshing things together hard. Trying to agree on standards takes a long time, you'll never get anything built.

Combination of methods – scraping + trust-o-meter to mediate 'risk' of taking in data from multiple sources.

Semantic web in practice – dbpedia.

Open Calais – bought out from Clearforest by Reuters. Dynamically generated metadata tags about 'entities' e.g. possible authority records. There are problems with automatically generated data e.g. guesses at people, organisations, whatever might not be right. 'But it's good enough'. Can then build onto it so users can browse by people then link to other sites with more information records about them in other datasets.

[But can museums generally cope with 'good enough'? What does that do to ideas of 'authority'? If it's machine-generated because there's not enough time for a person in the museum to do it, is there enough time for a person in the museum to clean it? OTOH, the Powerhouse model shows you can crowdsource the cleaning of tags so why not entities. And imagine if we could connect Powerhouse objects in Sydney with data about locations or people in London held at the Museum of London – authority versus utility?

Do we need to critically examine and change the environment in which catalogue data is viewed so that the reputation of our curators/finds specialists in some of the more critical (bitchy) or competitive fields isn't affected by this kind of exposure? I know it's a problem in archaeology too.]

They've published an OpenSearch feed as GeoRSS.

Fire eagle, Yahoo beta product. Link it to other data sets so you can see what's near you. [If you can get on the beta.]

I think that was the end, and the next bits were questions and discussion.

David Bearman: regarding linked authority files… if we wait until everything is perfect before getting it out there, then "all curators have to die before we can put anything on the web", "just bloody experiment".

Nate (Walker): is 'good enough' good enough? What about involving museums in creating better and correcting data. [I think, correct me if not]
Seb: no reason why a museum community shouldn't create an OpenCalais equivalent. David: Calais knows what reuters know about data. [So we should get together as a sector, nationally or internationally, or as art, science, history museums, and teach it about museum data.]

David – almost saying 'make the uncertainty an opportunity' in museum data – open it up to the public as you may find the answers. Crowdsource the data quality processes in cataloguing! "we find out more by admitting we know less".

Seb – geo-location is critical to allowing communities to engage with this material.

Frankie – doing a big database dump every few months could be enough of an API.

Location sensitive devices are going to be huge.

Seb – we think of search in a very particular way, but we don't know how people want to search i.e. what they want to search for, how they find stuff. [This is one of the sessions that made me think about faceted browsing.]

"Selling a virtual museum to a director is easier than saying 'put all our stuff there and let people take it'".

Tim Hart (Museum Victoria) – is the data from the public going back into the collection management system? Seb – yep. There's no field in EMu for some of the stuff that OpenCalais has, but the use of it from OpenCalais makes a really good business case for putting it into EMu.

Seb – we need tools to create metadata for us, we don't and won't have resources to do it with humans.

Seb – Commons on Flickr is good experiment in giving stuff away. Freebase – not sure if go to that level.

Overall, this was a great session – lots of ideas for small and large things museums can do with digital collections, and it generated lots of interesting and engaged discussion.

[It's interesting, we opened up the dataset from Çatalhöyük for download so that people could make their own interpretations and/or remix the data, but we never got around to implementing interfaces so people could contribute or upload the knowledge they created back to the project, or how to use the queries they'd run.]

Another model for connecting repositories

Dr Klaus Werner has been working with Intelligent Cultural Resources Information Management (ICRIM) on connecting repositories or information silos from "different cultural heritage organizations – museums, superintendencies, environmental and architectural heritage organizations" to make "information resources accessible, searchable, re-usable and interchangeable via the internet".

You can read more on these CAA07 conference slides: ICRIM: Interconnectivity of information resources across a network of federated repositories (pdf download), and the abstract from the CAA07 paper might also provide some useful context:

The HyperRecord system, used by the Capitoline Museums (Rome) and the Bibliotheca Hertziana (Max-Planck Institute, Rome) and developed as Culture2000 project, is a framework for the inter-connectivity of information resources from museums, archives and cultural institutes.
…
The repositories offer both the usual human interface for research (fulltext, title, etc.) and a smart REST API with a powerful behind-the-scenes direct machine-to-machine facility for querying and retrieving data.
…
The different information resources use digital object identifiers in the form of URNs (up to now, mostly for museum objects) for identification and direct-access. These allow easy aggregation of contents (data, records, documents) not only inside a repository but also across boundaries using the REST API for serving XML over a plain HTTP connection, in fact creating a loosely coupled network of repositories.

Thanks to Leif Isaksen for putting Dr Werner in contact with me after he saw his paper at CAA07.