Does citizen science invite sabotage?

Q: Does citizen science invite sabotage?

A: No.

Ok, you may want a longer version. There's a paper on crowdsourcing competitions that has lost some important context in doing the rounds of media outlets. For example, on Australia's ABC, 'Citizen science invites sabotage':

'a study published in the Journal of the Royal Society Interface is urging caution at this time of unprecedented reliance on citizen science. It's found crowdsourced research is vulnerable to sabotage. […] MANUEL CEBRIAN: Money doesn't really matter, what matters is that you can actually get something – whether that's recognition, whether that's getting a contract, whether that's actually positioning an idea, for instance in the pro and anti-climate change debate – whenever you can actually get ahead.'.

The fact that the research is studying crowdsourcing competitions, which are fundamentally different to other forms of crowdsourcing that do not have a 'winner takes all' dynamic, is not mentioned. It also does not mention the years of practical and theoretical work on task validation which makes it quite difficult for someone to get enough data past various controls to significantly alter the results of crowdsourced or citizen science projects.

You can read the full paper for free, but even the title, Crowdsourcing contest dilemma, and the abstract makes the very specific scope of their study clear:

Crowdsourcing offers unprecedented potential for solving tasks efficiently by tapping into the skills of large groups of people. A salient feature of crowdsourcing—its openness of entry—makes it vulnerable to malicious behaviour. Such behaviour took place in a number of recent popular crowdsourcing competitions. We provide game-theoretic analysis of a fundamental trade-off between the potential for increased productivity and the possibility of being set back by malicious behaviour. Our results show that in crowdsourcing competitions malicious behaviour is the norm, not the anomaly—a result contrary to the conventional wisdom in the area. Counterintuitively, making the attacks more costly does not deter them but leads to a less desirable outcome. These findings have cautionary implications for the design of crowdsourcing competitions.

And from the paper itself:

'We study a non-cooperative situation where two players (or firms) compete to obtain a better solution to a given task. […] The salient feature is that there is only one winner in the competition. […] In scenarios of ‘competitive’ crowdsourcing, where there is an inherent desire to hurt the opponent, attacks on crowdsourcing strategies are essentially unavoidable.'
From Crowdsourcing contest dilemma by Victor Naroditskiy, Nicholas R. Jennings, Pascal Van Hentenryck and Manuel Cebrian. Published 20 August 2014 doi: 10.1098/​rsif.2014.0532 J. R. Soc. Interface 6 October 2014 vol. 11 no. 99 20140532

I don't know about you, but 'an inherent desire to hurt the opponent' doesn't sound like the kinds of cooperative crowdsourcing projects we tend to see in citizen science or cultural heritage crowdsourcing.   The study is interesting, but it is not generalisable to 'crowdsourcing' as a whole.

If you're interested in crowdsourcing competitions, you may also be interested in: On the trickiness of crowdsourcing competitions: some lessons from Sydney Design from May 2013. 

Helping us fly? Machine learning and crowdsourcing

Image of a man in a flying contrapation powered by birds
Moon Machine by Bernard Brussel-Smith via Serendip-o-matic

Over the past few years we've seen an increasing number of projects that take the phrase 'human-computer interaction' literally (perhaps turning 'HCI' into human-computer integration), organising tasks done by people and by computers into a unified system. One of the most obvious benefits of crowdsourcing on digital platforms has been the ability to coordinate the distribution and validation of tasks. Increasingly, data manually classified through crowdsourcing is being fed into computers to improve machine learning so that computers can learn to recognise images or words almost as well as we do. I've outlined a few projects putting this approach to work below.

This creates new challenges for the future: if fun, easy tasks like image tagging and text transcription can be done by computers, what are the implications for cultural heritage and digital humanities crowdsourcing projects that used simple tasks as the first step in public engagement? After all, Fast Company reported that 'at least one Zooniverse project, Galaxy Zoo Supernova, has already automated itself out of existence'. What impact will this have on citizen science and history communities? How might machine learning free us to fly further, taking on more interesting tasks with cultural heritage collections?

The Public Catalogue Foundation has taken tags created through Your Paintings Tagger and achieved impressive results in the art of computer image recognition: 'Using the 3.5 million or so tags provided by taggers, the research team at Oxford 'educated' image-recognition software to recognise the top tagged terms'. All paintings tagged with a particular subject (e.g. 'horse') were fed into feature extraction processes to build an 'object model' of a horse (a set of characteristics that would indicate that a horse is depicted) then tested to see the system could correctly tag horses.

The BBC World Service archive used an 'open-source speech recognition toolkit to listen to every programme and convert it to text' and keywords then asked people to check the correctness of the data created (Algorithms and Crowd-Sourcing for Digital Archives, see also What we learnt by crowdsourcing the World Service archive).

The CUbRIK project combines 'machine, human and social computation for multimedia search' in their technical demonstrator, HistoGraph. The SOCIAM: The Theory and Practice of Social Machines project is looking at 'a new kind of emergent, collective problem solving', including 'citizen science social machines'.

And of course the Zooniverse is working on this, most recently with Galaxy Zoo. A paper summarised on their Milky Way project blog, outlines the powerful synergy between citizens scientists, professional scientists, and machine learning: 'citizens can identify patterns that machines cannot detect without training, machine learning algorithms can use citizen science projects as input training sets, creating amazing new opportunities to speed-up the pace of discovery', addressing the weakness of each approach if deployed alone.

Further reading: an early discussion of human input into machine learning is in Quinn and Bederson's 2011 Human Computation: A Survey and Taxonomy of a Growing Field. You can get a sense of the state of the field from various conference papers, including ICML ’13 Workshop: Machine Learning Meets Crowdsourcing and ICML ’14 Workshop: Crowdsourcing and Human Computing. There's also a mega-list of academic crowdsourcing conferences and workshops, though it doesn't include much on the tiny corner of the world that is crowdsourcing in cultural heritage.

Last update: March 2015. This post collects my thoughts on machine learning and human-computer integration as I finish my thesis. Do you know of examples I've missed, or implications we should consider?

Who loves your stuff? How to collect links to your site

If you've ever wondered who's using content from your site or what people find interesting, here are some ways to find out, using the Design Museum's URL as an example.

'Links to your site' via Google Webmaster Tools https://support.google.com/webmasters/answer/55281

Reddit – plug your URL in after /domain/
http://www.reddit.com/domain/designmuseum.org

Wikipedia – plug your URL in after target=
http://en.wikipedia.org/w/index.php?title=Special%3ALinkSearch&target=*.designmuseum.org
Depending on your topic coverage you may want to look at other language Wikipedias.

Pinterest – plug your URL in after /source/
http://www.pinterest.com/source/designmuseum.org/

Twitter – search for the URL with quotes around it e.g. "designmuseum.org"

If you can see one particular page shooting up in your web stats, you could try a reverse image search on TinEye to see where it's being referenced.

What am I missing? I'd love to hear about similar links and methods for other sites – tell me in the comments or on twitter @mia_out.

Update: in a similar vein, Tim Sherratt @wragge launched a new experiment called Trove Traces the same day, to 'explore how Trove newspapers are used' by listing pages that link to articles: http://trovespace.webfactional.com/traces/

Update 2: Desi Gonzalez @desigonz tried out some of these techniques and put together a great post on 'Thoughts on what museums can learn from Reddit, Yelp, and what @briandroitcour calls vernacular criticism'
You might also be interested in: Can you capture visitors with a steampunk arm?

The sounds of silence

I've been reading World War One diaries and letters (getting distracted by sources is an occupational hazard in my research) as I look for sample primary sources for teaching crowdsourcing at the HILT summer school in Maryland next week and for my CENDARI fellowship later this year.

I noticed one line in the Diary of William Henry Winter WWI 1915 that manages to convey a lot without directly giving any information about his opinions or relationship with this person:

'Major Saunders is supposed to be on his way back here as well but I don't know as he is coming back to our Coy, I hope not any way. We have got a good man now.'

There's nothing in the rest of the entries online that provides any further background. It may be that sections of this correspondence either didn't survive, weren't held by the same person, or perhaps were edited before deposit with the library or during transcription (it's particularly hard to judge as the site doesn't have images of the original document), so this particular silence may not have been intentional.

Whatever the case, it's a good reminder that there are silences behind every piece of content. While it's an amazing time to research the lives of those caught up in WWI as more and more private and public material is digitised and shared, silences can be created in many ways – official archives privilege some voices over others, personal collections can be censored or remain tucked away in a shoebox, and large parts of people's experiences simply went unrecorded. Content hidden behind paywalls or inaccessible to search engines (whether inadvertently hidden behind a search box or through lack of text transcription or description) is effectively hushed, if not exactly silenced. Sources and information about WWI collected via community groups on Facebook may be lost the next time they change their terms and conditions, or only partially shared. Our challenge is to make the gaps and questions about what was collected visible (audible?) while also being careful not to render the undigitised or unsearchable invisible in our rush to privilege the easily-accessible.

[Update: I've just realised that Winter might not have needed to provider further context as it seems many men in his unit were from the same region as him, and therefore his relationship with the Major may have pre-dated the war. Tacit knowledge is of course another example of the unrecorded, and one perhaps more familiar to us now than the unsayable.]

How did 'play' shape the design and experience of creating Serendip-o-matic?

Here are my notes from the Digital Humanities 2014 paper on 'Play as Process and Product' I did with Brian Croxall, Scott Kleinman and Amy Papaelias based on the work of the 2013 One Week One Tool team.

Scott has blogged his notes about the first part of our talk, Brian's notes are posted as '“If hippos be the Dude of Love…”: Serendip-o-matic at Digital Humanities 2014' and you'll see Amy's work adding serendip-o-magic design to our slides throughout our three posts.

I'm Mia, I was dev/design team lead on Serendipomatic, and I'll be talking about how play shaped both what you see on the front end and the process of making it.

How did play shape the process?

The playful interface was a purposeful act of user advocacy – we pushed against the academic habit of telling, not showing, which you see in some form here. We wanted to entice people to try Serendipomatic as soon as they saw it, so the page text, graphic design, 1 – 2 – 3 step instructions you see at the top of the front page were all designed to illustrate the ethos of the product while showing you how to get started.

How can a project based around boring things like APIs and panic be playful? Technical decision-making is usually a long, painful process in which we juggle many complex criteria. But here we had to practice 'rapid trust' in people, in languages/frameworks, in APIs, and this turned out to be a very freeing experience compared to everyday work.
Serendip-o-matic_ Let Your Sources Surprise You.png
First, two definitions as background for our work…

Just in case anyone here isn't familiar with APIs, APIs are a set of computational functions that machines use to talk to each other. Like the bank in Monopoly, they usually have quite specific functions, like taking requests and giving out information (or taking or giving money) in response to those requests. We used APIs from major cultural heritage repositories – we gave them specific questions like 'what objects do you have related to these keywords?' and they gave us back lists of related objects.
2013-08-01 10.14.45.jpg
The term 'UX' is another piece of jargon. It stands for 'user experience design', which is the combination of graphical, interface and interaction design aimed at making products both easy and enjoyable to use. Here you see the beginnings of the graphic design being applied (by team member Amy) to the underlying UX related to the 1-2-3 step explanation for Serendipomatic.

Feed.

serendipomatic_presentation p9.png
The 'feed' part of Serendipomatic parsed text given in the front page form into simple text 'tokens' and looked for recognisable entities like people, places or dates. There's nothing inherently playful in this except that we called the system that took in and transformed the text the 'magic moustache box', for reasons lost to time (and hysteria).

Whirl.

These terms were then mixed into database-style queries that we sent to different APIs. We focused on primary sources from museums, libraries, archives available through big cultural aggregators. Europeana and the Digital Public Library of America have similar APIs so we could get a long way quite quickly. We added Flickr Commons into the list because it has high-quality, interesting images and brought in more international content. [It also turns out this made it more useful for my own favourite use for Serendipomatic, finding slide or blog post images.] The results are then whirled up so there's a good mix of sources and types of results. This is the heart of the magic moustache.

Marvel.

User-focused design was key to making something complicated feel playful. Amy's designs and the Outreach team work was a huge part of it, but UX also encompasses micro-copy (all the tiny bits of text on the page), interactions (what happened when you did anything on the site), plus loading screens, error messages, user documentation.

We knew lots of people would be looking at whatever we made because of OWOT publicity; you don't get a second shot at this so it had to make sense at a glance to cut through social media noise. (This also meant testing it for mobiles and finding time to do accessibility testing – we wanted every single one of our users to have a chance to be playful.)


Without all this work on the graphic design – the look and feel that reflected the ethos of the product – the underlying playfulness would have been invisible. This user focus also meant removing internal references and in-jokes that could confuse people, so there are no references to the 'magic moustache machine'. Instead, 'Serendhippo' emerged as a character who guided the user through the site.

moustache.png But how does a magic moustache make a process playful?

magicmoustachediagram.jpgThe moustache was a visible signifier of play. It appeared in the first technical architecture diagram – a refusal to take our situation too seriously was embedded at the heart of the project. This sketch also shows the value of having a shared physical or visual reference – outlining the core technical structure gave people a shared sense of how different aspects of their work would contribute to the whole. After all, if there aren't any structure or rules, it isn't a game.

This playfulness meant that writing code (in a new language, under pressure) could then be about making the machine more magic, not about ticking off functions on a specification document. The framing of the week as a challenge and as a learning experience allowed a lack of knowledge or the need to learn new skills to be a challenge, rather than a barrier. My role was to provide just enough structure to let the development team concentrate on the task at hand.

In a way, I performed the role of old-fashioned games master, defining the technical constraints and boundaries much as someone would police the rules of a game. Previous experience with cultural heritage APIs meant I was able to make decisions quickly rather than letting indecision or doubt become a barrier to progress. Just as games often reduce complex situations to smaller, simpler versions, reducing the complexity of problems created a game-like environment.

UX matters


Ultimately, a focus on the end user experience drove all the decisions about the backend functionality, the graphic design and micro-copy and how the site responded to the user.

It's easy to forget that every pixel, line of code or text is there either through positive decisions or decisions not consciously taken. User experience design processes usually involve lots of conversation, questions, analysis, more questions, but at OWOT we didn't have that time, so the trust we placed in each other to make good decisions and in the playful vision for Serendipomatic created space for us to focus on creating a good user experience. The whole team worked hard to make sure every aspect of the design helps people on the site understand our vision so they can get with exploring and enjoying Serendipomatic.

Some possible real-life lessons I didn't include in the paper

One Week One Tool was an artificial environment, but here are some thoughts on lessons that could be applied to other projects:

  • Conversations trump specifications and showing trumps telling; use any means you can to make sure you're all talking about the same thing. Find ways to create a shared vision for your project, whether on mood boards, technical diagrams, user stories, imaginary product boxes. 
  • Find ways to remind yourself of the real users your product will delight and let empathy for them guide your decisions. It doesn't matter how much you love your content or project, you're only doing right by it if other people encounter it in ways that make sense to them so they can love it too (there's a lot of UXy work on 'on-boarding' out there to help with this). User-centred design means understanding where users are coming from, not designing based on popular opinion.you can use tools like customer journey maps to understand the whole cycle of people finding their way to and using your site (I guess I did this and various other UXy methods without articulating them at the time). 
  • Document decisions and take screenshots as you go so that you've got a history of your project – some of this can be done by archiving task lists and user stories. 
  • Having someone who really understands the types of audiences, tools and materials you're working with helps – if you can't get that on your team, find others to ask for feedback – they may be able to save you lots of time and pain.
  • Design and UX resources really do make a difference, and it's even better if those skills are available throughout the agile development process.

What are the hidden costs when you attend an event?

I think quite hard about how to make Museums Computer Groups events as inclusive as possible, from the diversity of the speakers on stage, to setting dates and times as early as possible to allow cheaper pre-booked travel, to keeping event costs down and more, but there's always more to learn.

I've been thinking about the 'shadow' or hidden costs accrued when people attend events. For me, it's the cost of getting to London (up to £50 if it's at short notice) and the time it takes (up to 3 hours each way if I'm unlucky). For others, accessibility requirements add to the cost of events, whether that's sign language translators, taxis to accessible train stations, or someone else's time as an aide. For parents or people with other caring responsibilities, childcare costs may add to the expense of attending an event. This in turn affects our ability to put together a broad range of speakers for an event. So –

Hello parents in the UK! I'm thinking about hidden costs for speakers turning up to an event. How much does a day's childcare cost you?
— Mia (@mia_out) June 15, 2014

I'm asking parents in the UK for a rough estimate of childcare costs for a day. You can share yours by tweeting @mia_out or share anonymously via this form if 140 characters won't allow you to mention things like your location, number and age of kids: What are the hidden costs when you attend an event?* The second question on the form is more general, so if your costs have nothing to do with parenting, go for it! I'll share the answers so that other event organisers have a sense of the costs too.

Here are some responses to get you started – with thanks to those who've already shared their costs:

@otfrom @mia_out @JeniT £12 / hr for 2. A full day is usually 100 plus. Quite difficult to justify often esp w/ travel.
— Yodit Stanton (@yoditstanton) June 15, 2014

about £40 a day in the West Midlands.
— Andrew Fray (@tenpn) June 15, 2014

@mia_out childcare is £3/hr for a 5 year old. Was 330/mth as a child for 5 hrs a day.
— Mick Brennan (@lightzenton) June 15, 2014

We're a volunteer Committee rather than professional events organisers, and there's a humbling amount to learn from people out there. What hidden costs have I missed? Are there factors apart from cost that we should consider? We've got a Call for Papers for November 7's UKMW14: Museums Beyond the Web open at the moment (until June 30, 2014) – is there any language on that CfP or our Guidance for Speakers we should look at?

Update – more responses below.

@mia_out assuming you can actually find a reliable (qualified?) babysitter for a whole day or two, then min. wage c£6 per hour at very least
— Internet Archaeology (@IntarchEditor) June 15, 2014

@mia_out nursery (under 5s) costs might be between £30-50 per day but of course they won't do weekends
— Internet Archaeology (@IntarchEditor) June 15, 2014

@mia_out day would cost us £90 or so, but not always poss; last conf my wife presented at I took leave and came along to look after babies.
— Jakob Whitfield (@thrustvector) June 15, 2014

@mia_out @otfrom Nurseries in East Dulwich vary, but since they're all oversubscribed 1500/month isn't outrageous.
— JulianBirch (@JulianBirch) June 15, 2014

@mia_out That's 8-6 including meals. Costs typically go down after age 2.But plenty of people can't find childcare at all and give up work.
— JulianBirch (@JulianBirch) June 15, 2014

@mia_out About £50 a day. We’re lucky in that our nursery will usually have space if we need an extra day.
— suzicatherine (@suzicatherine) June 15, 2014

The daily childcare costs (£40 pc,pd) are usually factored into a working day @mia_out The early drop off and/or late pick up can be tricky!
— Kathryn Eccles (@KathrynEccles) June 15, 2014

I'd been thinking of single-day events and the impact on speaker availability, but I was reminded of the impact of childcare and other responsibilities for people wanting to attend residential programmes or longer events (or day events that require an overnight stay to fit the travel in). For example:

@mia_out can't bring pets to conf., and dog walkers are not cheap #hiddencost
— Scott (@moltude) June 16, 2014

The ability to attend residential events for career or research fellowships is obviously going to have an impact on the types of people we see in leadership positions in later years, so thinking about things like childcare (which might be as simple as providing space for someone who already helps look after the family) now would make a positive difference later. On the positive side, many fellowships provide honorariums, which could help cover the hidden costs many of you have shared with me.

* I'm experimenting with typeform but already I'm concerned that their forms don't seem accessible – how are they for you?

Piloting a Participatory History Commons

I've been awarded a CENDARI Visiting Research Fellowship at Trinity College Dublin for a project called 'Bridging collections with a participatory Commons: a pilot with World War One archives'. I've posted my proposal at the link above, and when I start in September I'll post about my progress here. CENDARI have now published the list of all 2014 Fellows and a neat summary of the programme: 'The CENDARI Visiting Research Fellowships are intended to support and stimulate historical research in the two pilot areas of medieval European culture and the First World War, by facilitating access to key archives, specialist knowledge and collections in CENDARI host institutions'.

As I said in my post, 'it's an ambitious project which requires tackling community building, user experience design, historical materials and programming, and I'll be drawing on the expertise of many people'. I'll post as I go – but first, I'd best get back to finishing up my PhD thesis!

In the meantime, here's a small collection of things I've written as I think through what a participatory commons is and how it might work: my poster and talk notes for Herrenhausen conference and my keynote for Sharing is Caring, 'Enriching cultural heritage collections through a Participatory Commons platform: a provocation about collaborating with users'.

How can we connect museum technologists with their history?

A quick post triggered by an article on the role of domain knowledge (knowledge of a field) in critical thinking, Deep in thought:

Domain knowledge is so important because of the way our memories work. When we think, we use both working memory and long-term memory. Working memory is the space where we take in new information from our environment; everything we are consciously thinking about is held there. Long-term memory is the store of knowledge that we can call up into working memory when we need it. Working memory is limited, whereas long-term memory is vast. Sometimes we look as if we are using working memory to reason, when actually we are using long-term memory to recall. Even incredibly complex tasks that seem as if they must involve working memory can depend largely on long-term memory.

When we are using working memory to progress through a new problem, the knowledge stored in long-term memory will make that process far more efficient and successful. … The more parts of the problem that we can automate and store in long-term memory, the more space we will have available in working memory to deal with the new parts of the problem.

A few years ago I defined a 'museum technologist' as 'someone who can appropriately apply a range of digital solutions to help meet the goals of a particular museum project', and deep domain knowledge clearly has a role to play in this (also in the kinds of critical thinking that will save technologists from being unthinking cheerleaders for the newest buzzword or geek toy). 

There's a long history of hard-won wisdom, design patterns and knowledge (whether about ways not to tender for or specify software, reasons why proposed standards may or may not work, translating digital methods and timelines for departments raised on print, etc – I'm sure you all have examples) contained in the individual and collective memory of individual technologists and teams. Some of it is represented in museum technology mailing lists, blogs or conference proceedings, but the lessons learnt in the past aren't always easily discoverable by people encountering digital heritage issues for the first time. And then there's the issue of working out which knowledge relates to specific, outdated technologies and which still holds while not quashing the enthusiasm of new people with a curt 'we tried that before'…

Something in the juxtaposition of the 20th anniversary of BritPop and the annual wave of enthusiasm and discovery from the international Museums and the Web (#MW2014) conference prompted me to look at what the Museums Computer Group (MCG) and Museum Computer Network (MCN) lists were talking about in April five and ten years ago (i.e. in easily-accessible archives):

Five years ago in #musetech – open web, content distribution, virtualisation, wifi https://www.jiscmail.ac.uk/cgi-bin/webadmin?A1=ind0904&L=mcg&X=498A43516F310B2193 http://mcn.edu/pipermail/mcn-l/2009-April/date.html

Ten years ago in #musetech people were talking about knowledge organisation and video links with schools https://www.jiscmail.ac.uk/cgi-bin/webadmin?A1=ind04&L=mcg&F=&S=&X=498A43516F310B2193

Some of the conversations from that random sample are still highly relevant today, and more focused dives into various archives would probably find approaches and information that'd help people tackling current issues.

So how can we help people new to the sector find those previous conversations and get some of this long-term memory into their own working memory? Pointing people to search forms for the MCG and MCN lists is easy, some of the conference proceedings are a bit trickier (e.g. search within the museumsandtheweb.com) and there's no central list of museum technology blogs that I know of. Maybe people could nominate blog posts they think stand the test of time, mindful of the risk of it turning into a popularity/recency thing?

If you're new(ish) to digital heritage, how did you find your feet? Which sites or communities helped you, and how did you find them? Or if you have a new team member, how do you help them get up to speed with museum technology? Or looking further afield, which resources would you send to someone from academia or related heritage fields who wanted to learn about building heritage resources for or with specialists and the public?

Sharing is caring keynote 'Enriching cultural heritage collections through a Participatory Commons'

Enriching cultural heritage collections through a Participatory Commons platform: a provocation about collaborating with users

Mia Ridge, Open University Contact me: @mia_out or https://miaridge.com/

[I was invited to Copenhagen to talk about my research on crowdsourcing in cultural heritage at the 3rd international Sharing is Caring seminar on April 1. I'm sharing my notes in advance to make life easier for those awesome people following along in a second or third language, particularly since I'm delivering my talk via video.]

Today I'd like to present both a proposal for something called the 'Participatory Commons', and a provocation (or conversation starter): there's a paradox in our hopes for deeper audience engagement through crowdsourcing: projects that don't grow with their participants will lose them as they develop new skills and interests and move on. This talk presents some options for dealing with this paradox and suggests a Participatory Commons provides a way to take a sector-wide view of active engagement with heritage content and redefine our sense of what it means when everybody wins.

I'd love to hear your thoughts about this – I'll be following the hashtag during the session and my contact details are above.

Before diving in, I wanted to reflect on some lessons from my work in museums on public engagement and participation.

My philosophy for crowdsourcing in cultural heritage (aka what I've learnt from making crowdsourcing games)

One thing I learnt over the past years: museums can be intimidating places. When we ask for help with things like tagging or describing our collections, people want to help but they worry about getting it wrong and looking stupid or about harming the museum.

The best technology in the world won't solve a single problem unless it's empathically designed and accompanied by social solutions. This isn't a talk about technology, it's a talk about people – what they want, what they're afraid of, how we can overcome all that to collaborate and work together.

Dora's Lost Data

So a few years ago I explored the potential of crowdsourcing games to make helping a museum less scary and more fun. In this game, 'Dora's Lost Data', players meet a junior curator who asks them to tag objects so they'll be findable in Google. Games aren't the answer to everything, but identifying barriers to participation is always important. You have to understand your audiences – their motivations for starting and continuing to participate; the fears, anxieties, uncertainties that prevent them participating. [My games were hacked together outside of work hours, more information is available at My MSc dissertation: crowdsourcing games for museums; if you'd like to see more polished metadata games check out Tiltfactor's http://www.metadatagames.org/#games]

Mutual wins – everybody's happy

My definition of crowdsourcing: cultural heritage crowdsourcing projects ask the public to undertake tasks that cannot be done automatically, in an environment where the activities, goals (or both) provide inherent rewards for participation, and where their participation contributes to a shared, significant goal or research area.

It helps to think of crowdsourcing in cultural heritage as a form of volunteering. Participation has to be rewarding for everyone involved. That sounds simple, but focusing on the audiences' needs can be difficult when there are so many organisational needs competing for priority and limited resources for polishing the user experience. Further, as many projects discover, participant needs change over time…

What is a Participatory Commons and why would we want one?

First, I have to introduce you to some people. These are composite stories (personas) based on my research…

Two archival historians, Simone and Andre. Simone travels to archives in her semester breaks to stock up on research material, taking photos of most documents 'in case they're useful later', transcribing key text from others. Andre is often at the next table, also looking for material for his research. The documents he collected for his last research project would be useful for Simone's current book but they've never met and he has no way of sharing that part of his 'personal research collection' with her. Currently, each of these highly skilled researchers take their cumulative knowledge away with them at the end of the day, leaving no trace of their work in the archive itself. Next…

Two people from a nearby village, Martha and Bob. They joined their local history society when they retired and moved to the village. They're helping find out what happened to children from the village school's class of 1898 in the lead-up to and during World War I. They are using census returns and other online documents to add records to a database the society's secretary set up in Excel. Meanwhile…

A family historian, Daniel. He has a classic 'shoebox archive' – a box containing his grandmother Sarah's letters and diary, describing her travels and everyday life at the turn of the century. He's transcribing them and wants to put them online to share with his extended family. One day he wants to make a map for his kids that shows all the places their great-grandmother lived and visited. Finally, there's…

Crowdsourcer Nisha.She has two young kids and works for a local authority. She enjoys playing games like Candy Crush on her mobile, and after the kids have gone to bed she transcribes ship logs on the Old Weather website while watching TV with her husband. She finds it relaxing, feels good about contributing to science and enjoys the glimpses of life at sea. Sites like Old Weather use 'microtasks' – tiny, easily accomplished tasks – and crowdsourcing to digitise large amounts of text.

Helping each other?

None of our friends above know it, but they're all looking at material from roughly the same time and place. Andre and Simone could help each other by sharing the documents they've collected over the years. Sarah's diaries include the names of many children from her village that would help Martha and Bob's project, and Nisha could help everyone if she transcribed sections of Sarah's diary.

Connecting everyone's efforts for the greater good: Participatory Commons

This image shows the two main aspects of the Participatory Commons: the different sources for content, and the activities that people can do with that content.

The Participatory Commons (image: Mia Ridge)

The Participatory Commons is a platform where content from different sources can be aggregated. Access to shared resources underlies the idea of the 'Commons', particularly material that is not currently suitable for sites like Europeana, like 'shoebox archives' and historians' personal record collections. So if the 'Commons' part refers to shared resources, how is it participatory?

The Participatory Commons interface supports a range of activities, from the types of tasks historians typically do, like assessing and contextualising documents, activities that specialists or the public can do like identifying particular people, places, events or things in sources, or typical crowdsourcing tasks like fulltext transcription or structured tagging.

By combining the energy of crowdsourcing with the knowledge historians create on a platform that can store or link to primary sources from museums, libraries and archives with 'shoebox archives', the Commons could help make our shared heritage more accessible to all. As a platform that makes material about ordinary people available alongside official archives and as an interface for enjoyable, meaningful participation in heritage work, the Commons could be a basis for 'open source history', redressing some of the absences in official archives while improving the quality of all records.

As a work in progress, this idea of the Participatory Heritage Commons has two roles: an academic thought experiment to frame my research, and as a provocation for GLAMs (galleries, museums, libraries, archives) to think outside their individual walls. As a vision for 'open source history', it's inspired by community archives, public history, participant digitisation and history from below… This combination of a large underlying repository and more intimate interfaces could be quite powerful. Capturing some of the knowledge generated when scholars access collections would benefit both archives and other researchers.

'Niche projects' can be built on a Participatory Commons

As a platform for crowdsourcing, the Participatory Commons provides efficiencies of scale in the backend work for verifying and validating contributions, managing user accounts, forums, etc. But that doesn't mean that each user would experience the same front-end interface.

Niche projects build on the Participatory Commons
(quick and dirty image: Mia Ridge)

My research so far suggests that tightly-focused projects are better able to motivate participants and create a sense of community. These 'niche' projects may be related to a particular location, period or topic, or to a particular type of material. The success of the New York Public Library's What's on the Menu project, designed around a collection of historic menus, and the British Library's GeoReferencer project, designed around their historic map collection, both demonstrate the value of defining projects around niche topics.

The best crowdsourcing projects use carefully designed interactions tailored to the specific content, audience and data requirements of a given project. These interactions are usually For example, the Zooniverse body of projects use much of the same underlying software but projects are designed around specific tasks on specific types of material, whether classifying simple galaxy types, plankton or animals on the Serengeti, or transcribing ship logs or military diaries.

The Participatory Commons is not only a collection of content, it also allows 'niche' projects to be layered on top, presenting more focused sets of content, and specialist interfaces designed around the content, audience and purpose.

Barriers

But there are still many barriers to consider, including copyright and technical issues and important cultural issues around authority, reliability, trust, academic credit and authorship. [There's more background on this at my earlier post on historians and the Participatory Commons and Early PhD findings: Exploring historians' resistance to crowdsourced resources.]

Now I want to set the idea of the Participatory Commons aside for a moment, and return to crowdsourcing in cultural heritage. I've been looking for factors in the success or otherwise of crowdsourcing projects, from grassroots, community-lead projects to big glamorous institutionally-lead sites.

I mentioned that Nisha found transcribing text relaxing. Like many people who start transcribing text, she found herself getting interested in the events, people and places mentioned in the text. Forums or other methods for participants to discuss their questions seem to help keep participants motivated, and they also provide somewhere for a spark of curiosity to grow (as in this forum post). We know that some people on crowdsourcing projects like Old Weather get interested in history, and even start their own research projects.

Crowdsourcing as gateway to further activity

You can see that happening on other crowdsourcing projects too. For example, Herbaria@Homeaims to document historical herbarium collections within museums based on photographs of specimen cards. So far participants have documented over 130,000 historic specimens. In the process, some participants also found themselves being interested in the people whose specimens they were documenting.

As a result, the project has expanded to include biographies of the original specimen collectors. It was able to accommodate this new interest through a project wiki, which has a combination of free text and structured data linking records between the transcribed specimen cards and individual biographies.

'Levels of Engagement' in citizen science

There's a consistent enough pattern in science crowdsourcing projects that there's a model from 'citizen science' that outlines different stages participants can move through, from undertaking simple tasks, joining in community discussion, through to 'working independently on self-identified research projects'.[1]

Is this 'mission accomplished'?

This is Nick Poole's word cloud based on 40 museum missionstatements. With words like 'enjoyment', 'access', 'learning' appearing in museum missions, doesn't this mean that turning transcribers into citizen historians while digitising and enhancing collections is a success? Well, yes, but…

Paths diverge; paradox ahead?

There's a tension between GLAM's desire to invite people to 'go deeper', to find their own research interests, to begin to become citizen historians; and the desire to ask people to help us with tasks set by GLAMs to help their work. Heritage organisations can try to channel that impulse to start research into questions about their own collections, but sometimes it feels like we're asking people to do our homework for us. The scaffolds put in place to help make tasks easier may start to feel like a constraint.

Who has agency?

If people move beyond simple tasks into more complex tasks that require a greater investment of time and learning, then issues of agency – participants' ability to make choices about what they're working on and why – start to become more important. Would Wikipedia have succeeded if it dictated what contributors had to write about? We shouldn't mistake volunteers for a workforce just because they can be impressively dedicated contributors.

Participatory project models

Turning again to citizen science – this time public participation in science research, we have a model for participatory projects according to the amount of control participants have over the design of the project itself – or to look at it another way, how much authority the organisation has ceded to the crowd. This model contains three categories: 'contributory', where the public contributes data to a project designed by the organisation; 'collaborative', where the public can help refine project design and analyse data in a project lead by the organisation; and 'co-creative', where the public can take part in all or nearly all processes, and all parties design the project together.[2]

As you can imagine, truly co-creative projects are rare. It seems cultural organisations find it hard to truly collaborate with members of the public; for many understandable reasons. The level of transparency required, and the investment of time for negotiating mutual interests, goals and capabilities increase as collaboration deepens. Institutional constraints and lack of time to engage in deep dialogue with participants make it difficult to find shared goals that work for all parties. It seems GLAMs sometimes try to take shortcuts and end up making decisions for the group, which means their 'co-creative' project is actually more just 'collaborative'.

New challenges

When participants start to out-grow the tasks that originally got them hooked, projects face a choice. Some projects are experimenting with setting challenges for participants. Here you see 'mysteries' set by the UK's Museum of Design in Plastics, and by San FranciscoPublic Library on History Pin. Finding the right match between the challenge set and the object can be difficult without some existing knowledge of the collection, and it can require a lot of on-going time to encourage participants. Putting the mystery under the nose of the person who has the knowledge or skills to solve it is another challenge that projects like this will have to tackle.

Working with existing communities of interest is a good start, but it also takes work to figure out where they hang out online (or in-person) and understand how they prefer to work. GLAMs sometimes fall into the trap of choosing the technology first, or trying something because it's trendy; it's better to start with the intersection between your content and the preferences of potential audiences.

But is it wishful thinking to hope that others will be interested in answering the questions GLAMs are asking?

A tension?

Should projects accept that some people will move on as they develop new interests, and concentrate on recruiting new participants to replace them? Do they try to find more interesting tasks or new responsibilities for participants, such as helping moderate discussions, or checking and validating other people's work? Or should they find ways for the project grow as participants' skill and knowledge increase? It's important to make these decisions mindfully as the default is otherwise to accept a level of turnover as participants move on.

To return to lessons from citizen science, possible areas for deeper involvement include choosing or defining questions for study, analysing or interpreting data and drawing conclusions, discussing results and asking new questions.[3]However, heritage organisations might have to accept that the questions people want to ask might not involve their collections, and that these citizen historians' new interests might not leave time for their previous crowdsourcing tasks.

Why is a critical mass of content in a Participatory Commons useful?

And now we return to the Participatory Commons and the question of why a critical mass of content would be useful.

Increasingly, the old divisions between museum, library and archive collections don't make sense. For most people, content is content, and they don't understand why a pamphlet about a village fete in 1898 would be described and accessed differently depending on whether it had ended up in a museum, library or archive catalogue.

Basing niche projects on a wider range of content creates opportunities for different types of tasks and levels of responsibility. Projects that provide a variety of tasks and roles can support a range of different levels and types of participant skills, availability, knowledge and experience.

A critical mass of material is also important for the discoverability of heritage content. Even the most sophisticated researcher turns to Google sometimes, and if your content doesn't come up in the first few results, many researchers will never know it exists. It's easy to say but less easy to make a reality: the easier it is to find your collections, the more likely it is that researchers will use them.

Commons as party?

More importantly, a critical mass of content in a Commons allows us to re-define 'winning'. If participation is narrowly defined as belonging to individual GLAMs, when a citizen historian moves onto a project that doesn't involve your collection then it can seem like you've lost a collaborator. But the people who developed a new research interest through a project at one museum might find they end up using records from the archive down the road, and transcribing or enhancing their records during their investigation. If all the institutions in the region shared their records on the Commons or let researchers take and share photos while using their collections, the researcher has a critical mass of content for their research and hopefully as a side-effect, their activities will improve links between collections. If the Commons allows GLAMs to take a sector-wide view then someone moving on to a different collection becomes a moment to celebrate, a form of graduation. In our wildest imagination, the Commons could be like a fabulous party where you never know what fabulous interesting people and things you'll discover…

To conclude – by designing platforms that allow people to collect and improve records as they work, we're helping everybody win.

Thank you! I'm looking forward to hearing your thoughts.


[1]M. Jordan Raddick et al., 'Citizen Science: Status and Research Directions for the Coming Decade', in astro2010: The Astronomy and Astrophysics Decadal Survey, vol. 2010, 2009, http://www8.nationalacademies.org/astro2010/DetailFileDisplay.aspx?id=454.

[2]Rick Bonney et al., Public Participation in Scientific Research: Defining the Field and Assessing Its Potential for Informal Science Education. A CAISE Inquiry Group Report (Washington D.C.: Center for Advancement of Informal Science Education (CAISE), July 2009), http://caise.insci.org/uploads/docs/PPSR%20report%20FINAL.pdf.

[3]Bonney et al., Public Participation in Scientific Research: Defining the Field and Assessing Its Potential for Informal Science Education. A CAISE Inquiry Group Report.


Image credits in order of appearance: Glider, Library of Congress, Great hall, Library of CongressCurzona Allport from Tasmanian Archive and Heritage Office, Hålanda Church, Västergötland, Sweden, Swedish National Heritage Board, Smithsonian Institution, Postmaster, General James A. Farley During National Air Mail Week, 1938Powerhouse Museum, Canterbury Bankstown Rugby League Football Club's third annual Ball.

Early PhD findings: Exploring historians' resistance to crowdsourced resources

I wrote up some early findings from my PhD research for conferences back in 2012 when I was working on questions around 'but will historians really use resources created by unknown members of the public?'. People keep asking me for copies of my notes (and I've noticed people citing an online video version which isn't ideal) and since they might be useful and any comments would help me write-up the final thesis, I thought I'd be brave and post my notes.

A million caveats apply – these were early findings, my research questions and focus have changed and I've interviewed more historians and reviewed many more participative history projects since then; as a short paper I don't address methods etc; and obviously it's only a huge part of a tiny topic… (If you're interested in crowdsourcing, you might be interested in other writing related to scholarly crowdsourcing and collaboration from my PhD, or my edited volume on 'Crowdsourcing our cultural heritage'.) So, with those health warnings out of the way, here it is. I'd love to hear from you, whether with critiques, suggestions, or just stories about how it relates to your experience. And obviously, if you use this, please cite it!

Exploring historians' resistance to crowdsourced resources

Scholarly crowdsourcing may be seen as a solution to the backlog of historical material to be digitised, but will historians really use resources created by unknown members of the public?

The Transcribe Bentham project describes crowdsourcing as 'the harnessing of online activity to aid in large scale projects that require human cognition' (Terras, 2010a). 'Scholarly crowdsourcing' is a related concept that generally seems to involve the collaborative creation of resources through collection, digitisation or transcription. Crowdsourcing projects often divide up large tasks (like digitising an archive) into smaller, more manageable tasks (like transcribing a name, a line, or a page); this method has helped digitise vast numbers of primary sources.

My doctoral research was inspired by a vision of 'participant digitization', a form of scholarly crowdsourcing that seeks to capture the digital records and knowledge generated when researchers access primary materials in order to openly share and re-use them. Unlike many crowdsourcing projects which are designed for tasks performed specifically for the project, participant digitization harnesses the transcription, metadata creation, image capture and other activities already undertaken during research and aggregates them to create re-usable collections of resources.

Research questions and concepts

When Howe clarified his original definition, stating that the 'crucial prerequisite' in crowdsourcing is 'the use of the open call format and the large network of potential laborers', a 'perfect meritocracy' based not on external qualifications but on 'the quality of the work itself', he created a challenge for traditional academic models of authority and credibility (Howe 2006, 2008). Furthermore, how does anonymity or pseudonymity (defined here as often long-standing false names chosen by users of websites) complicate the process of assessing the provenance of information on sites open to contributions from non-academics? An academic might choose to disguise their identity to mask their research activities from competing peers, from a desire to conduct early exploratory work in private or simply because their preferred username was unavailable; but when contributors are not using their real names they cannot derive any authority from their personal or institutional identity. Finally, which technical, social and scholarly contexts would encourage researchers to share (for example) their snippets of transcription created from archival documents, and to use content transcribed by others? What barriers exist to participation in crowdsourcing or prevent the use of crowdsourced content?

Methods

I interviewed academic and family/local historians about how they evaluate, use, and contribute to crowdsourced and traditional resources to investigate how a resource based on 'meritocracy' disrupts current notions of scholarly authority, reliability, trust, and authorship. These interviews aimed to understand current research practices and probe more deeply into how participants assess different types of resources, their feelings about resources created by crowdsourcing, and to discover when and how they would share research data and findings.

I sought historians investigating the same country and time period in order to have a group of participants who faced common issues with the availability and types of primary sources from early modern England. I focused on academic and 'amateur' family or local historians because I was interested in exploring the differences between them to discover which behaviours and attitudes are common to most researchers and which are particular to academics and the pressures of academia.

I recruited participants through personal networks and social media, and conducted interviews in person or on Skype. At the time of writing, 17 participants have been interviewed for up to 2 hours each. It should be noted that these results are of a provisional nature and represent a snapshot of on-going research and analysis.

Early results

I soon discovered that citizen historians are perfect examples of Pro-Ams: 'knowledgeable, educated, committed, and networked' amateurs 'who work to professional standards' (Leadbeater and Miller, 2004; Terras, 2010b).

How do historians assess the quality of resources?

Participants often simply said they drew on their knowledge and experience when sniffing out unreliable documents or statements. When assessing secondary sources, their tacit knowledge of good research and publication practices was evident in common statements like '[I can tell from] it's the way it's written'. They also cited the presence and quality of footnotes, and the depth and accuracy of information as important factors. Transcribed sources introduced another layer of quality assessment – researchers might assess a resource by checking for transcription errors that are often copied from one database to another. Most researchers used multiple sources to verify and document facts found in online or offline sources.

When and how do historians share research data and findings?

It appears that between accessing original records and publishing information, there are several key stages where research data and findings might be shared. Stages include acquiring and transcribing records, producing visualisations like family trees and maps, publishing informal notes and publishing synthesised content or analysis; whether a researcher passes through all the stages depends on their motivation and audience. Information may change formats between stages, and since many claim not to share information that has not yet been sufficiently verified, some information would drop out before each stage. It also appears that in later stages of the research process the size of the potential audience increases and the level of trust required to share with them decreases.

For academics, there may be an additional, post-publication stage when resources are regarded as 'depleted' – once they have published what they need from them, they would be happy to share them. Family historians meanwhile see some value in sharing versions of family trees online, or in posting names of people they are researching to attract others looking for the same names.

Sharing is often negotiated through private channels and personal relationships. Methods of controlling sharing include showing people work in progress on a screen rather than sending it to them and using email in preference to sharing functionality supplied by websites – this targeted, localised sharing allows the researcher to retain a sense of control over early stage data, and so this is one key area where identity matters. Information is often shared progressively, and getting access to more information depends on your behaviour after the initial exchange – for example, crediting the provider in any further use of the data, or reciprocating with good data of your own.

When might historians resist sharing data?

Participants gave a range of reasons for their reluctance to share data. Being able to convey the context of creation and the qualities of the source materials is important for historians who may consider sharing their 'depleted' personal archives – not being able to provide this means they are unlikely to share. Being able to convey information about data reliability is also important. Some information about the reliability of a piece of information is implicitly encoded in its format (for example, in pencil in notebooks versus electronic records), hedging phrases in text, in the number of corroborating sources, or a value judgement about those sources. If it is difficult to convey levels of 'certainty' about reliability when sharing data, it is less likely that people will share it – participants felt a sense of responsibility about not publishing (even informally) information that hasn't been fully verified. This was particularly strong in academics. Some participants confessed to sneaking forbidden photos of archival documents they ran out of time to transcribe in the archive; unsurprisingly it is unlikely they would share those images.

Overall, if historians do not feel they would get information of equal value back in exchange, they seem less likely to share. Professional researchers do not want to give away intellectual property, and feel sharing data online is risky because the protocols of citation and fair use are presently uncertain. Finally, researchers did not always see a point in sharing their data. Family history content was seen as too specific and personal to have value for others; academics may realise the value of their data within their own tightly-defined circles but not realise that their records may have information for other biographical researchers (i.e. people searching by name) or other forms of history.

Which concerns are particular to academic historians?

Reputational risk is an issue for some academics who might otherwise share data. One researcher said: 'we are wary of others trawling through our research looking for errors or inconsistencies. […] Obviously we were trying to get things right, but if we have made mistakes we don't want to have them used against us. In some ways, the less you make available the better!'. Scholarly territoriality can be an issue – if there is another academic working on the same resources, their attitude may affect how much others share. It is also unclear how academic historians would be credited for their work if it was performed under a pseudonym that does not match the name they use in academia.

What may cause crowdsourced resources to be under-used?

In this research, 'amateur' and academic historians shared many of the same concerns for authority, reliability, and trust. The main reported cause of under-use (for all resources) is not providing access to original documents as well as transcriptions. Researchers will use almost any information as pointers or leads to further sources, but they will not publish findings based on that data unless the original documents are available or the source has been peer-reviewed. Checking the transcriptions against the original is seen as 'good practice', part of a sense of responsibility 'to the world's knowledge'.

Overall, the identity of the data creator is less important than expected – for digitised versions of primary sources, reliability is not vested in the identity of the digitiser but in the source itself. Content found on online sites is tested against a set of finely-tuned ideas about the normal range of documents rather than the authority of the digitiser.

Cite as:

Ridge, Mia. “Early PhD Findings: Exploring Historians’ Resistance to Crowdsourced Resources.” Open Objects, March 19, 2014. https://www.openobjects.org.uk/2014/03/early-phd-findings-exploring-historians-resistance-to-crowdsourced-resources/.

References

Howe, J. (undated). Crowdsourcing: A Definition http://crowdsourcing.typepad.com

Howe, J. (2006). Crowdsourcing: A Definition. http://crowdsourcing.typepad.com/cs/2006/06/crowdsourcing_a.html

Howe, J. (2008). Join the crowd: Why do multinationals use amateurs to solve scientific and technical problems? The Independent. http://www.independent.co.uk/life-style/gadgets-and-tech/features/join-the-crowd-why-do-multinationals-use-amateurs-to-solve-scientific-and-technical-problems-915658.html

Leadbeater, C., and Miller, P. (2004). The Pro-Am Revolution: How Enthusiasts Are Changing Our Economy and Society. Demos, London, 2004. http://www.demos.co.uk/files/proamrevolutionfinal.pdf

Terras, M. (2010a) Crowdsourcing cultural heritage: UCL's Transcribe Bentham project. Presented at: Seeing Is Believing: New Technologies For Cultural Heritage. International Society for Knowledge Organization, UCL (University College London). http://eprints.ucl.ac.uk/20157/

Terras, M. (2010b). “Digital Curiosities: Resource Creation via Amateur Digitization.” Literary and Linguistic Computing 25, no. 4 (October 14, 2010): 425–438. http://llc.oxfordjournals.org/cgi/doi/10.1093/llc/fqq019