It's here! Crowdsourcing our Cultural Heritage is now available

My edited volume, Crowdsourcing our Cultural Heritage, is now available! My introduction (Crowdsourcing our cultural heritage: Introduction), which provides an overview of the field and outlines the contribution of the 12 chapters, is online at Ashgate's site, along with the table of contents and index. There's a 10% discount if you order online.

If you're in London on the evening of Thursday 20th November, we're celebrating with a book launch party at the UCL Centre for Digital Humanities. Register at http://crowdsourcingculturalheritage.eventbrite.co.uk.

Here's the back page blurb: "Crowdsourcing, or asking the general public to help contribute to shared goals, is increasingly popular in memory institutions as a tool for digitising or computing vast amounts of data. This book brings together for the first time the collected wisdom of international leaders in the theory and practice of crowdsourcing in cultural heritage. It features eight accessible case studies of groundbreaking projects from leading cultural heritage and academic institutions, and four thought-provoking essays that reflect on the wider implications of this engagement for participants and on the institutions themselves.

Crowdsourcing in cultural heritage is more than a framework for creating content: as a form of mutually beneficial engagement with the collections and research of museums, libraries, archives and academia, it benefits both audiences and institutions. However, successful crowdsourcing projects reflect a commitment to developing effective interface and technical designs. This book will help practitioners who wish to create their own crowdsourcing projects understand how other institutions devised the right combination of source material and the tasks for their ‘crowd’. The authors provide theoretically informed, actionable insights on crowdsourcing in cultural heritage, outlining the context in which their projects were created, the challenges and opportunities that informed decisions during implementation, and reflecting on the results.

This book will be essential reading for information and cultural management professionals, students and researchers in universities, corporate, public or academic libraries, museums and archives."

Massive thanks to the following authors of chapters for their intellectual generosity and their patience with up to five rounds of edits, plus proofing, indexing and more…

  1. Crowdsourcing in Brooklyn, Shelley Bernstein;
  2. Old Weather: approaching collections from a different angle, Lucinda Blaser;
  3. ‘Many hands make light work. Many hands together make merry work’: Transcribe Bentham and crowdsourcing manuscript collections, Tim Causer and Melissa Terras;
  4. Build, analyse and generalise: community transcription of the Papers of the War Department and the development of Scripto, Sharon M. Leon;
  5. What's on the menu?: crowdsourcing at the New York Public Library, Michael Lascarides and Ben Vershbow;
  6. What’s Welsh for ‘crowdsourcing’? Citizen science and community engagement at the National Library of Wales, Lyn Lewis Dafis, Lorna M. Hughes and Rhian James;
  7. Waisda?: making videos findable through crowdsourced annotations, Johan Oomen, Riste Gligorov and Michiel Hildebrand;
  8. Your Paintings Tagger: crowdsourcing descriptive metadata for a national virtual collection, Kathryn Eccles and Andrew Greg.
  9. Crowdsourcing: Crowding out the archivist? Locating crowdsourcing within the broader landscape of participatory archives, Alexandra Eveleigh;
  10.  How the crowd can surprise us: humanities crowdsourcing and the creation of knowledge, Stuart Dunn and Mark Hedges;
  11. The role of open authority in a collaborative web, Lori Byrd Phillips;
  12. Making crowdsourcing compatible with the missions and values of cultural heritage organisations, Trevor Owens.

Defining the scope: week one as a CENDARI Fellow

I'm coming to the end of my first week as a Transnational Access Fellow with the CENDARI project at the Trinity College Dublin Long Room Hub. CENDARI 'aims to leverage innovative technologies to provide historians with the tools by which to contextualise, customise and share their research', which dovetails with my PhD research incredibly well. This Fellowship gives me an opportunity to extend my ideas about 'Enriching cultural heritage collections through a Participatory Commons' without trying to squish them into a history thesis, and is probably perfectly timed in giving me a break from writing up.

View over Trinity College Dublin

There are two parts to my CENDARI project 'Bridging collections with a participatory Commons: a pilot with World War One archives'. The first involves working on the technical, data and cultural context/requirements for the 'participatory history commons' as an infrastructure; the second is a demonstrator based on that infrastructure. I'll be working out how official records and 'shoebox archives' can be mined and indexed to help provide what I'm calling 'computationally-generated context' for people researching lives touched by World War One.

This week I've read metadata schema (MODS extended with TEI and a local schema, if you're interested) and ontology guidelines, attended some lively seminars on Irish history, gotten my head around CENDARI's work packages and the structure of the British army during WWI. I've started a list of nearby local history societies with active research projects to see if I can find some working on WWI history – I'd love to work with people who have sources they want to digitise and generally do more with, and people who are actively doing research on First World War lives. I've started to read sample primary materials and collect machine-readable sources so I can test out approaches by manually marking-up and linking different repositories of records. I'm going to spend the rest of the day tidying up my list of outcomes and deliverables and sketching out how all the different aspects of my project fit together. And tonight I'm going to check out some of the events at Discover Research Dublin. Nerd joy!

'The cooperative archive'?

Finally, I've dealt with something I'd put off for ages. 'Commons' is one of those tricky words that's less resonant than it could be, so I looked for a better name than the 'participatory history commons'. because 'commons' is one of those tricky words that's less resonant than it could be. I doodled around words like collation, congeries, cluster, demos, assemblage, sources, commons, active, engaged, participatory, opus, archive, digital, posse, mob, cahoots and phrases like collaborative collections, collaborative history, history cooperative, but eventually settled on 'cooperative archive'. This appeals because 'cooperative' encompasses attitudes or values around working together for a common purpose, and it includes those who share records and those who actively work to enhance and contextualise them. 'Archive' suggests primary sources, and can be applied to informal collections of 'shoebox archives' and the official holdings of museums, libraries and archives.

What do you think – does 'cooperative archive' work for you? Does your first reaction to the name evoke anything like my thoughts above?

Update, October 11: following some market testing on Facebook, it seems 'collaborative collections' best describes my vision.

These are a few of my favourite (audience research) things

On Friday I popped into London to give a talk at the Art of Digital meetup at the Photographer's Gallery. It's a great series of events organised by Caroline Heron and Jo Healy, so go along sometime if you can. I talked about different ways of doing audience research. (And when I wrote the line 'getting to know you' it gave me an earworm and a 'lessons from musicals' theme). It was a talk of two halves. In the first, I outlined different ways of thinking about audience research, then went into a little more detail about a few of my favourite (audience research) things.

There are lots of different ways to understand the contexts and needs different audiences bring to your offerings. You probably also want to test to see if what you're making works for them and to get a sense of what they're currently doing with your websites, apps or venues. It can help to think of research methods along scales of time, distance, numbers, 'density' and intimacy. (Or you could think of it as a journey from 'somewhere out there' to 'dancing cheek to cheek'…)

'Time' refers to both how much time a method asks from the audience and how much time it takes to analyse the results. There's no getting around the fact that nearly all methods require time to plan, prepare and pilot, sorry! You can run 5 second tests that ask remote visitors a single question, or spend months embedded in a workplace shadowing people (and more time afterwards analysing the results). On the distance scale, you can work with remote testers located anywhere across the world, ask people visiting your museum to look at a few prototype screens, or physically locate yourself in someone's office for an interview or observation.

Numbers and 'density' (or the richness of communication and the resulting data) tend to be inversely linked. Analytics or log files let you gather data from millions of website or app users, one-question surveys can garner thousands of responses, you can interview dozens of people or test prototypes with 5-8 users each time. However, the conversations you'll have in a semi-structured interview are much richer than the responses you'll get to a multiple-choice questionnaire. This is partly because it's a two-way dialogue, and partly because in-person interviews convey more information, including tone of voice, physical gestures, impressions of a location and possibly even physical artefacts or demonstrations. Generally, methods that can reach millions of remote people produce lots of point data, while more intimate methods that involve spending lots of time with just a few people produce small datasets of really rich data.

So here are few of my favourite things: analytics, one-question surveys, 5 second tests, lightweight usability tests, semi-structured interviews, and on-site observations. Ultimately, the methods you use are a balance of time and distance, the richness of the data required, and whether you want to understand the requirements for, or measure the performance of a site or tool.

Analytics are great for understanding how people found you, what they're doing on your site, and how this changes over time. Analytics can help you work out which bits of a website need tweaking, and for measuring to see the impact of changes. But that only gets you so far – how do you know which trends are meaningful and which are just noise? To understand why people are doing what they do, you need other forms of research to flesh them out. 
One question surveys are a great way of finding out why people are on your site, and whether they've succeeded in achieving their goals for being there. We linked survey answers to analytics for the last Let's Get Real project so we could see how people who were there for different reasons behaved on the site, but you don't need to go that far – any information about why people are on your site is better than none! 
5 second tests and lightweight usability tests are both ways to find out how well a design works for its intended audiences. 5 second tests show people an interface for 5 seconds, then ask them what they remember about it, or where they'd click to do a particular task. They're a good way to make sure your text and design are clear. Usability tests take from a few minutes to an hour, and are usually done in person. One of my favourite lightweight tests involves grabbing a sketch, an iPad or laptop and asking people in a café or other space if they'd help by testing a site for a few minutes. You can gather lots of feedback really quickly, and report back with a prioritised list of fixes by the end of the day. 
Semi-structured interviews use the same set of questions each time to ensure some consistency between interviews, but they're flexible enough to let you delve into detail and follow any interesting diversions that arise during the conversation. Interviews and observations can be even more informative if they're done in the space where the activities you're interested in take place. 'Contextual inquiry' goes a step further by including observations of the tasks you're interested in being performed. If you can 'apprentice' yourself to someone, it's a great way to have them explain to you why things are done the way they are. However, it's obviously a lot more difficult to find someone willing and able to let you observe them in this way, it's not appropriate for every task or research question, and the data that results can be so rich and dense with information that it takes a long time to review and analyse. 
And one final titbit of wisdom from a musical – always look on the bright side of life! Any knowledge is better than none, so if you manage to get any audience research or usability testing done then you're already better off than you were before.

[Update: a comment on twitter reminded me of another favourite research thing: if you don't yet have a site/app/campaign/whatever, test a competitor's!]

Does citizen science invite sabotage?

Q: Does citizen science invite sabotage?

A: No.

Ok, you may want a longer version. There's a paper on crowdsourcing competitions that has lost some important context in doing the rounds of media outlets. For example, on Australia's ABC, 'Citizen science invites sabotage':

'a study published in the Journal of the Royal Society Interface is urging caution at this time of unprecedented reliance on citizen science. It's found crowdsourced research is vulnerable to sabotage. […] MANUEL CEBRIAN: Money doesn't really matter, what matters is that you can actually get something – whether that's recognition, whether that's getting a contract, whether that's actually positioning an idea, for instance in the pro and anti-climate change debate – whenever you can actually get ahead.'.

The fact that the research is studying crowdsourcing competitions, which are fundamentally different to other forms of crowdsourcing that do not have a 'winner takes all' dynamic, is not mentioned. It also does not mention the years of practical and theoretical work on task validation which makes it quite difficult for someone to get enough data past various controls to significantly alter the results of crowdsourced or citizen science projects.

You can read the full paper for free, but even the title, Crowdsourcing contest dilemma, and the abstract makes the very specific scope of their study clear:

Crowdsourcing offers unprecedented potential for solving tasks efficiently by tapping into the skills of large groups of people. A salient feature of crowdsourcing—its openness of entry—makes it vulnerable to malicious behaviour. Such behaviour took place in a number of recent popular crowdsourcing competitions. We provide game-theoretic analysis of a fundamental trade-off between the potential for increased productivity and the possibility of being set back by malicious behaviour. Our results show that in crowdsourcing competitions malicious behaviour is the norm, not the anomaly—a result contrary to the conventional wisdom in the area. Counterintuitively, making the attacks more costly does not deter them but leads to a less desirable outcome. These findings have cautionary implications for the design of crowdsourcing competitions.

And from the paper itself:

'We study a non-cooperative situation where two players (or firms) compete to obtain a better solution to a given task. […] The salient feature is that there is only one winner in the competition. […] In scenarios of ‘competitive’ crowdsourcing, where there is an inherent desire to hurt the opponent, attacks on crowdsourcing strategies are essentially unavoidable.'
From Crowdsourcing contest dilemma by Victor Naroditskiy, Nicholas R. Jennings, Pascal Van Hentenryck and Manuel Cebrian. Published 20 August 2014 doi: 10.1098/​rsif.2014.0532 J. R. Soc. Interface 6 October 2014 vol. 11 no. 99 20140532

I don't know about you, but 'an inherent desire to hurt the opponent' doesn't sound like the kinds of cooperative crowdsourcing projects we tend to see in citizen science or cultural heritage crowdsourcing.   The study is interesting, but it is not generalisable to 'crowdsourcing' as a whole.

If you're interested in crowdsourcing competitions, you may also be interested in: On the trickiness of crowdsourcing competitions: some lessons from Sydney Design from May 2013. 

Helping us fly? Machine learning and crowdsourcing

Image of a man in a flying contrapation powered by birds
Moon Machine by Bernard Brussel-Smith via Serendip-o-matic

Over the past few years we've seen an increasing number of projects that take the phrase 'human-computer interaction' literally (perhaps turning 'HCI' into human-computer integration), organising tasks done by people and by computers into a unified system. One of the most obvious benefits of crowdsourcing on digital platforms has been the ability to coordinate the distribution and validation of tasks. Increasingly, data manually classified through crowdsourcing is being fed into computers to improve machine learning so that computers can learn to recognise images or words almost as well as we do. I've outlined a few projects putting this approach to work below.

This creates new challenges for the future: if fun, easy tasks like image tagging and text transcription can be done by computers, what are the implications for cultural heritage and digital humanities crowdsourcing projects that used simple tasks as the first step in public engagement? After all, Fast Company reported that 'at least one Zooniverse project, Galaxy Zoo Supernova, has already automated itself out of existence'. What impact will this have on citizen science and history communities? How might machine learning free us to fly further, taking on more interesting tasks with cultural heritage collections?

The Public Catalogue Foundation has taken tags created through Your Paintings Tagger and achieved impressive results in the art of computer image recognition: 'Using the 3.5 million or so tags provided by taggers, the research team at Oxford 'educated' image-recognition software to recognise the top tagged terms'. All paintings tagged with a particular subject (e.g. 'horse') were fed into feature extraction processes to build an 'object model' of a horse (a set of characteristics that would indicate that a horse is depicted) then tested to see the system could correctly tag horses.

The BBC World Service archive used an 'open-source speech recognition toolkit to listen to every programme and convert it to text' and keywords then asked people to check the correctness of the data created (Algorithms and Crowd-Sourcing for Digital Archives, see also What we learnt by crowdsourcing the World Service archive).

The CUbRIK project combines 'machine, human and social computation for multimedia search' in their technical demonstrator, HistoGraph. The SOCIAM: The Theory and Practice of Social Machines project is looking at 'a new kind of emergent, collective problem solving', including 'citizen science social machines'.

And of course the Zooniverse is working on this, most recently with Galaxy Zoo. A paper summarised on their Milky Way project blog, outlines the powerful synergy between citizens scientists, professional scientists, and machine learning: 'citizens can identify patterns that machines cannot detect without training, machine learning algorithms can use citizen science projects as input training sets, creating amazing new opportunities to speed-up the pace of discovery', addressing the weakness of each approach if deployed alone.

Further reading: an early discussion of human input into machine learning is in Quinn and Bederson's 2011 Human Computation: A Survey and Taxonomy of a Growing Field. You can get a sense of the state of the field from various conference papers, including ICML ’13 Workshop: Machine Learning Meets Crowdsourcing and ICML ’14 Workshop: Crowdsourcing and Human Computing. There's also a mega-list of academic crowdsourcing conferences and workshops, though it doesn't include much on the tiny corner of the world that is crowdsourcing in cultural heritage.

Last update: March 2015. This post collects my thoughts on machine learning and human-computer integration as I finish my thesis. Do you know of examples I've missed, or implications we should consider?