digital humanities – Open Objects

From piles of material to patchwork: How do we embed the production of usable collections data into library work?

These notes were prepared for a panel discussion at the 'Always Already Computational: Collections as Data' (#AACdata) workshop, held in Santa Barbara in March 2017. While my latest thinking on the gap between the scale of collections and the quality of data about them is informed by my role in the Digital Scholarship team at the British Library, I've also drawn on work with catalogues and open cultural data at Melbourne Museum, the Museum of London, the Science Museum and various fellowships. My thanks to the organisers and the Institute of Museum and Library Services for the opportunity to attend. My position paper was called 'From libraries as patchwork to datasets as assemblages?' but in hindsight, piles and patchwork of material seemed a better analogy.

The invitation to this panel asked us to share our experience and perspective on various themes. I'm focusing on the challenges in making collections available as data, based on years of working towards open cultural data from within various museums and libraries. I've condensed my thoughts about the challenges down into the question on the slide: How do we embed the production of usable collections data into library work?

It has to be usable, because if it's not then why are we doing it? It has to be embedded because data in one-off projects gets isolated and stale. 'Production' is there because infrastructure and workflow is unsexy but necessary for access to the material that makes digital scholarship possible.

One of the biggest issues the British Library (BL) faces is scale. The BL's collections are vast – maybe 200 million items – and extremely varied. My experience shows that publishing datasets (or sharing them with aggregators) exposes the shortcomings of past cataloguing practices, making the size of the backlog all too apparent.

Good collections data (or metadata, depending on how you look at it) is necessary to avoid the overwhelmed, jumble sale feeling of using a huge aggregator like Europeana, Trove, or the DPLA, where you feel there's treasure within reach, if only you could find it. Publishing collections online often increases the number of enquiries about them – how can institution deal with enquiries at scale when they already have a cataloguing backlog? Computational methods like entity identification and extraction could complement the 'gold standard' cataloguing already in progress. If they're made widely available, these other methods might help bridge the resourcing gaps that mean it's easier to find items from richer institutions and countries than from poorer ones.

You probably already all know this, but it's worth remembering: our collections aren't even (yet) a patchwork of materials. The collections we hold, and the subset we can digitise and make available for re-use are only a tiny proportion of what once existed. Each piece was once part of something bigger, and what we have now has been shaped by cumulative practical and intellectual decisions made over decades or centuries. Digitisation projects range from tiny specialist databases to huge commercial genealogy deals, while some areas of the collections don't yet have digital catalogue records. Some items can't be digitised because they're too big, small or fragile for scanning or photography; others can't be shared because of copyright, data protection or cultural sensitivities. We need to be careful in how we label datasets so that the absences are evident.

(Here, 'data' may include various types of metadata, automatically generated OCR or handwritten text recognition transcripts, digital images, audio or video files, crowdsourced enhancements or any combination or these and more)

Image credit: https://www.flickr.com/photos/teen_s/6251107713/

In addition to the incompleteness or fuzziness of catalogue data, when collections appear as data, it's often as great big lumps of things. It's hard for normal scholars to process (or just unzip) 4gb of data.

Currently, datasets are often created outside normal processes, and over time they become 'stale' as they're not updated when source collections records change. And when they manage to unzip them, the records rely on internal references – name authorities for people, places, etc – that can only be seen as strings rather than things until extra work is undertaken.

The BL's metadata team have experimented with 'researcher format' CSV exports around specific themes (eg an exhibition), and CSV is undoubtedly the most accessible format – but what we really need is the ability for people to create their own queries across catalogues, and create their own datasets from the results. (And by queries I don't mean SPARQL but rather faceted browsing or structured search forms).

Image credit: screenshot from http://data.bl.uk/

Collections are huge (and resources relatively small) so we need to supplement manual cataloguing with other methods. Sometimes the work of crafting links from catalogues to external authorities and identifiers will be a machine job, with pieces sewn together at industrial speed via entity recognition tools that can pull categories out or text and images. Sometimes it's operated by a technologist who runs records through OpenRefine to find links to name authorities or Wikidata records. Sometimes it's a labour of scholarly love, with links painstakingly researched, hand-tacked together to make sure they fit before they're finally recorded in a bespoke database.

This linking work often happens outside the institution, so how can we ingest and re-use it appropriately? And if we're to take advantage of computational methods and external enhancements, then we need ways to signal which categories were applied by catalogues, which by software, by external groups, etc.

The workflow and interface adjustments required would be significant, but even more challenging would be the internal conversations and changes required before a consensus on the best way to combine the work of cataloguers and computers could emerge.

The trick is to move from a collection of pieces to pieces of a collection. Every collection item was created in and about places, and produced by and about people. They have creative, cultural, scientific and intellectual properties. There's a web of connections from each item that should be represented when they appear in datasets. These connections help make datasets more usable, turning strings of text into references to things and concepts to aid discoverability and the application of computational methods by scholars. This enables structured search across datasets – potentially linking an oral history interview with a scientist in the BL sound archive, their scientific publications in journals, annotated transcriptions of their field notebooks from a crowdsourcing project, and published biography in the legal deposit library.

A lot of this work has been done as authority files like AAT, ULAN etc are applied in cataloguing, so our attention should turn to turning local references into URIs and making the most of that investment.

Applying identifiers is hard – it takes expert care to disambiguate personal names, places, concepts, even with all the hinting that context-aware systems might be able to provide as machine learning etc techniques get better. Catalogues can't easily record possible attributions, and there's understandable reluctance to publish an imperfect record, so progress on the backlog is slow. If we're not to be held back by the need for records to be perfectly complete before they're published, then we need to design systems capable of capturing the ambiguity, fuzziness and inherent messiness of historical collections and allowing qualified descriptors for possible links to people, places etc. Then we need to explain the difference to users, so that they don't overly rely on our descriptions, making assumptions about the presence or absence of information when it's not appropriate.

Image credit: http://europeana.eu/portal/record/2021648/0180_N_31601.html

A lot of what we need relies on more responsive infrastructure for workflows and cataloguing systems. For example, the BL's systems are designed around the 'deliverable unit' – the printed or bound volume, the archive box – because for centuries the reading room was where you accessed items. We now need infrastructure that makes items addressable at the manuscript, page and image level in order to make the most of the annotations and links created to shared identifiers.

(I'd love to see absorbent workflows, soaking up any related data or digital surrogates that pass through an organisation, no matter which system they reside in or originate from. We aren't yet making the most of OCRd text, let alone enhanced data from other processes, to aid discoverability or produce datasets from collections.)

Image credit: https://www.flickr.com/photos/snorski/34543357
My final thought – we can start small and iterate, which is just as well, because we need to work on understanding what users of collections data need and how they want to use them. We're making a start and there's a lot of thoughtful work behind the scenes, but maybe a bit more investment is needed from research libraries to become as comfortable with data users as they are with the readers who pass through their physical doors.

Looking for (crowdsourcing) love in all the right places

One of the most important exercises in the crowdsourcing workshops I run is the 'speed dating' session. The idea is to spend some time looking at a bunch of crowdsourcing projects until you find a project you love. Finding a project you enjoy gives you a deeper insight into why other people participate in crowdsourcing, and will see you through the work required to get a crowdsourcing project going. I think making a personal connection like this helps reduce some of the cynicism I occasionally encounter about why people would volunteer their time to help cultural heritage collections. Trying lots of projects also gives you a much better sense of the types of barriers projects can accidentally put in the way of participation. It's also a good reminder that everyone is a nerd about something, and that there's a community of passion for every topic you can think of.

If you want to learn more about designing history or cultural heritage crowdsourcing projects, trying out lots of project is a great place to start. The more time you can spend on this the better – an hour is ideal – but trying just one or two projects is better than nothing. In a workshop I get people to note how a project made them feel – what they liked most and least about a project, and who they'd recommend it to. You can also note the input and output types to help build your mental database of relevant crowdsourcing projects.

The list of projects I suggest varies according to the background of workshop participants, and I'll often throw in suggestions tailored to specific interests, but here's a generic list to get you started.

10 Most Wanted http://10most.org.uk/	Research object histories
Ancient Lives http://ancientlives.org/	Humanities, language, text transcription
British Library Georeferencer http://www.bl.uk/maps/	Locating and georeferencing maps (warning: if it's running, only hard maps may be left!)
Children of the Lodz Ghetto http://online.ushmm.org/lodzchildren/	Citizen history, research
Describe Me http://describeme.museumvictoria.com.au/	Describe objects
DIY History http://diyhistory.lib.uiowa.edu/	Transcribe historical letters, recipes, diaries
Family History Transcription Project http://www.flickr.com/photos/statelibrarync/collections/	Document transcription (Flickr/Yahoo login required to comment)
Herbaria@home http://herbariaunited.org/atHome/ (for bonus points, compare it with Notes from Nature https://www.zooniverse.org/project/notes_from_nature)	Transcribing specimen sheets (or biographical research)
HistoryPin Year of the Bay 'Mysteries' https://www.historypin.org/attach/project/22-yearofthebay/mysteries/index/	Help find dates, locations, titles for historic photographs; overlay images on StreetView
iSpot http://www.ispotnature.org/	Help 'identify wildlife and share nature'
Letters of 1916 http://dh.tcd.ie/letters1916/	Transcribe letters and/or contribute letters
London Street Views 1840 http://crowd.museumoflondon.org.uk/lsv1840/	Help transcribe London business directories
Micropasts http://crowdsourced.micropasts.org/app/photomasking/newtask	Photo-masking to help produce 3D objects; also structured transcription
Museum Metadata Games: Dora http://museumgam.es/dora/	Tagging game with cultural heritage objects (my prototype from 2010)
NYPL Building Inspector http://buildinginspector.nypl.org/	A range of tasks, including checking building footprints, entering addresses
Operation War Diary http://operationwardiary.org/	Structured transcription of WWI unit diaries
Papers of the War Department http://wardepartmentpapers.org/	Document transcription
Planet Hunters http://planethunters.org/	Citizen science; review visualised data
Powerhouse Museum Collection Search http://www.powerhousemuseum.com/collection/database/menu.php	Tagging objects
Reading Experience Database http://www.open.ac.uk/Arts/RED/	Text selection, transcription, description.
Smithsonian Digital Volunteers: Transcription Center https://transcription.si.edu/	Text transcription
Tiltfactor Metadata Games http://www.metadatagames.org/	Games with cultural heritage images
Transcribe Bentham http://www.transcribe-bentham.da.ulcc.ac.uk/	History; text transcription
Trove http://trove.nla.gov.au/newspaper?q=	Correct OCR errors, transcribe text, tag or describe documents
US National Archives http://www.amara.org/en/teams/national-archives/	Transcribing videos
What's the Score at the Bodleian http://www.whats-the-score.org/	Music and text transcription, description
What's on the menu http://menus.nypl.org/	Structured transcription of restaurant menus
What's on the menu? Geotagger http://menusgeo.herokuapp.com/	Geolocating historic restaurant menus
Wikisource – random item link http://en.wikisource.org/wiki/Special:Random/Index	Transcribing texts
Worm Watch http://www.wormwatchlab.org	Citizen science; video
Your Paintings Tagger http://tagger.thepcf.org.uk/	Paintings; free-text or structured tagging

NB: crowdsourcing is a dynamic field, some sites may be temporarily out of content or have otherwise settled in transit. Some sites require registration, so you may need to find another site to explore while you're waiting for your registration email.

It's here! Crowdsourcing our Cultural Heritage is now available

My edited volume, Crowdsourcing our Cultural Heritage, is now available! My introduction (Crowdsourcing our cultural heritage: Introduction), which provides an overview of the field and outlines the contribution of the 12 chapters, is online at Ashgate's site, along with the table of contents and index. There's a 10% discount if you order online.

If you're in London on the evening of Thursday 20th November, we're celebrating with a book launch party at the UCL Centre for Digital Humanities. Register at http://crowdsourcingculturalheritage.eventbrite.co.uk.

Here's the back page blurb: "Crowdsourcing, or asking the general public to help contribute to shared goals, is increasingly popular in memory institutions as a tool for digitising or computing vast amounts of data. This book brings together for the first time the collected wisdom of international leaders in the theory and practice of crowdsourcing in cultural heritage. It features eight accessible case studies of groundbreaking projects from leading cultural heritage and academic institutions, and four thought-provoking essays that reflect on the wider implications of this engagement for participants and on the institutions themselves.

Crowdsourcing in cultural heritage is more than a framework for creating content: as a form of mutually beneficial engagement with the collections and research of museums, libraries, archives and academia, it benefits both audiences and institutions. However, successful crowdsourcing projects reflect a commitment to developing effective interface and technical designs. This book will help practitioners who wish to create their own crowdsourcing projects understand how other institutions devised the right combination of source material and the tasks for their ‘crowd’. The authors provide theoretically informed, actionable insights on crowdsourcing in cultural heritage, outlining the context in which their projects were created, the challenges and opportunities that informed decisions during implementation, and reflecting on the results.

This book will be essential reading for information and cultural management professionals, students and researchers in universities, corporate, public or academic libraries, museums and archives."

Massive thanks to the following authors of chapters for their intellectual generosity and their patience with up to five rounds of edits, plus proofing, indexing and more…

Crowdsourcing in Brooklyn, Shelley Bernstein;
Old Weather: approaching collections from a different angle, Lucinda Blaser;
‘Many hands make light work. Many hands together make merry work’: Transcribe Bentham and crowdsourcing manuscript collections, Tim Causer and Melissa Terras;
Build, analyse and generalise: community transcription of the Papers of the War Department and the development of Scripto, Sharon M. Leon;
What's on the menu?: crowdsourcing at the New York Public Library, Michael Lascarides and Ben Vershbow;
What’s Welsh for ‘crowdsourcing’? Citizen science and community engagement at the National Library of Wales, Lyn Lewis Dafis, Lorna M. Hughes and Rhian James;
Waisda?: making videos findable through crowdsourced annotations, Johan Oomen, Riste Gligorov and Michiel Hildebrand;
Your Paintings Tagger: crowdsourcing descriptive metadata for a national virtual collection, Kathryn Eccles and Andrew Greg.
Crowdsourcing: Crowding out the archivist? Locating crowdsourcing within the broader landscape of participatory archives, Alexandra Eveleigh;
How the crowd can surprise us: humanities crowdsourcing and the creation of knowledge, Stuart Dunn and Mark Hedges;
The role of open authority in a collaborative web, Lori Byrd Phillips;
Making crowdsourcing compatible with the missions and values of cultural heritage organisations, Trevor Owens.

How did 'play' shape the design and experience of creating Serendip-o-matic?

Here are my notes from the Digital Humanities 2014 paper on 'Play as Process and Product' I did with Brian Croxall, Scott Kleinman and Amy Papaelias based on the work of the 2013 One Week One Tool team.

Scott has blogged his notes about the first part of our talk, Brian's notes are posted as '“If hippos be the Dude of Love…”: Serendip-o-matic at Digital Humanities 2014' and you'll see Amy's work adding serendip-o-magic design to our slides throughout our three posts.

I'm Mia, I was dev/design team lead on Serendipomatic, and I'll be talking about how play shaped both what you see on the front end and the process of making it.

How did play shape the process?

The playful interface was a purposeful act of user advocacy – we pushed against the academic habit of telling, not showing, which you see in some form here. We wanted to entice people to try Serendipomatic as soon as they saw it, so the page text, graphic design, 1 – 2 – 3 step instructions you see at the top of the front page were all designed to illustrate the ethos of the product while showing you how to get started.

How can a project based around boring things like APIs ~~and panic~~ be playful? Technical decision-making is usually a long, painful process in which we juggle many complex criteria. But here we had to practice 'rapid trust' in people, in languages/frameworks, in APIs, and this turned out to be a very freeing experience compared to everyday work.

First, two definitions as background for our work…

Just in case anyone here isn't familiar with APIs, APIs are a set of computational functions that machines use to talk to each other. Like the bank in Monopoly, they usually have quite specific functions, like taking requests and giving out information (or taking or giving money) in response to those requests. We used APIs from major cultural heritage repositories – we gave them specific questions like 'what objects do you have related to these keywords?' and they gave us back lists of related objects.

The term 'UX' is another piece of jargon. It stands for 'user experience design', which is the combination of graphical, interface and interaction design aimed at making products both easy and enjoyable to use. Here you see the beginnings of the graphic design being applied (by team member Amy) to the underlying UX related to the 1-2-3 step explanation for Serendipomatic.

Feed.

The 'feed' part of Serendipomatic parsed text given in the front page form into simple text 'tokens' and looked for recognisable entities like people, places or dates. There's nothing inherently playful in this except that we called the system that took in and transformed the text the 'magic moustache box', for reasons lost to time (and hysteria).

Whirl.

These terms were then mixed into database-style queries that we sent to different APIs. We focused on primary sources from museums, libraries, archives available through big cultural aggregators. Europeana and the Digital Public Library of America have similar APIs so we could get a long way quite quickly. We added Flickr Commons into the list because it has high-quality, interesting images and brought in more international content. [It also turns out this made it more useful for my own favourite use for Serendipomatic, finding slide or blog post images.] The results are then whirled up so there's a good mix of sources and types of results. This is the heart of the magic moustache.

Marvel.

User-focused design was key to making something complicated feel playful. Amy's designs and the Outreach team work was a huge part of it, but UX also encompasses micro-copy (all the tiny bits of text on the page), interactions (what happened when you did anything on the site), plus loading screens, error messages, user documentation.

We knew lots of people would be looking at whatever we made because of OWOT publicity; you don't get a second shot at this so it had to make sense at a glance to cut through social media noise. (This also meant testing it for mobiles and finding time to do accessibility testing – we wanted every single one of our users to have a chance to be playful.)

Without all this work on the graphic design – the look and feel that reflected the ethos of the product – the underlying playfulness would have been invisible. This user focus also meant removing internal references and in-jokes that could confuse people, so there are no references to the 'magic moustache machine'. Instead, 'Serendhippo' emerged as a character who guided the user through the site.

But how does a magic moustache make a process playful?

The moustache was a visible signifier of play. It appeared in the first technical architecture diagram – a refusal to take our situation too seriously was embedded at the heart of the project. This sketch also shows the value of having a shared physical or visual reference – outlining the core technical structure gave people a shared sense of how different aspects of their work would contribute to the whole. After all, if there aren't any structure or rules, it isn't a game.

This playfulness meant that writing code (in a new language, under pressure) could then be about making the machine more magic, not about ticking off functions on a specification document. The framing of the week as a challenge and as a learning experience allowed a lack of knowledge or the need to learn new skills to be a challenge, rather than a barrier. My role was to provide just enough structure to let the development team concentrate on the task at hand.

In a way, I performed the role of old-fashioned games master, defining the technical constraints and boundaries much as someone would police the rules of a game. Previous experience with cultural heritage APIs meant I was able to make decisions quickly rather than letting indecision or doubt become a barrier to progress. Just as games often reduce complex situations to smaller, simpler versions, reducing the complexity of problems created a game-like environment.

UX matters

Ultimately, a focus on the end user experience drove all the decisions about the backend functionality, the graphic design and micro-copy and how the site responded to the user.

It's easy to forget that every pixel, line of code or text is there either through positive decisions or decisions not consciously taken. User experience design processes usually involve lots of conversation, questions, analysis, more questions, but at OWOT we didn't have that time, so the trust we placed in each other to make good decisions and in the playful vision for Serendipomatic created space for us to focus on creating a good user experience. The whole team worked hard to make sure every aspect of the design helps people on the site understand our vision so they can get with exploring and enjoying Serendipomatic.

Some possible real-life lessons I didn't include in the paper

One Week One Tool was an artificial environment, but here are some thoughts on lessons that could be applied to other projects:

Conversations trump specifications and showing trumps telling; use any means you can to make sure you're all talking about the same thing. Find ways to create a shared vision for your project, whether on mood boards, technical diagrams, user stories, imaginary product boxes.
Find ways to remind yourself of the real users your product will delight and let empathy for them guide your decisions. It doesn't matter how much you love your content or project, you're only doing right by it if other people encounter it in ways that make sense to them so they can love it too (there's a lot of UXy work on 'on-boarding' out there to help with this). User-centred design means understanding where users are coming from, not designing based on popular opinion.you can use tools like customer journey maps to understand the whole cycle of people finding their way to and using your site (I guess I did this and various other UXy methods without articulating them at the time).
Document decisions and take screenshots as you go so that you've got a history of your project – some of this can be done by archiving task lists and user stories.
Having someone who really understands the types of audiences, tools and materials you're working with helps – if you can't get that on your team, find others to ask for feedback – they may be able to save you lots of time and pain.
Design and UX resources really do make a difference, and it's even better if those skills are available throughout the agile development process.

How can we connect museum technologists with their history?

A quick post triggered by an article on the role of domain knowledge (knowledge of a field) in critical thinking, Deep in thought:

Domain knowledge is so important because of the way our memories work. When we think, we use both working memory and long-term memory. Working memory is the space where we take in new information from our environment; everything we are consciously thinking about is held there. Long-term memory is the store of knowledge that we can call up into working memory when we need it. Working memory is limited, whereas long-term memory is vast. Sometimes we look as if we are using working memory to reason, when actually we are using long-term memory to recall. Even incredibly complex tasks that seem as if they must involve working memory can depend largely on long-term memory.

When we are using working memory to progress through a new problem, the knowledge stored in long-term memory will make that process far more efficient and successful. … The more parts of the problem that we can automate and store in long-term memory, the more space we will have available in working memory to deal with the new parts of the problem.

A few years ago I defined a 'museum technologist' as 'someone who can appropriately apply a range of digital solutions to help meet the goals of a particular museum project', and deep domain knowledge clearly has a role to play in this (also in the kinds of critical thinking that will save technologists from being unthinking cheerleaders for the newest buzzword or geek toy).

There's a long history of hard-won wisdom, design patterns and knowledge (whether about ways not to tender for or specify software, reasons why proposed standards may or may not work, translating digital methods and timelines for departments raised on print, etc – I'm sure you all have examples) contained in the individual and collective memory of individual technologists and teams. Some of it is represented in museum technology mailing lists, blogs or conference proceedings, but the lessons learnt in the past aren't always easily discoverable by people encountering digital heritage issues for the first time. And then there's the issue of working out which knowledge relates to specific, outdated technologies and which still holds while not quashing the enthusiasm of new people with a curt 'we tried that before'…

Something in the juxtaposition of the 20th anniversary of BritPop and the annual wave of enthusiasm and discovery from the international Museums and the Web (#MW2014) conference prompted me to look at what the Museums Computer Group (MCG) and Museum Computer Network (MCN) lists were talking about in April five and ten years ago (i.e. in easily-accessible archives):

Five years ago in #musetech – open web, content distribution, virtualisation, wifi https://www.jiscmail.ac.uk/cgi-bin/webadmin?A1=ind0904&L=mcg&X=498A43516F310B2193 http://mcn.edu/pipermail/mcn-l/2009-April/date.html

Ten years ago in #musetech people were talking about knowledge organisation and video links with schools https://www.jiscmail.ac.uk/cgi-bin/webadmin?A1=ind04&L=mcg&F=&S=&X=498A43516F310B2193

Some of the conversations from that random sample are still highly relevant today, and more focused dives into various archives would probably find approaches and information that'd help people tackling current issues.

So how can we help people new to the sector find those previous conversations and get some of this long-term memory into their own working memory? Pointing people to search forms for the MCG and MCN lists is easy, some of the conference proceedings are a bit trickier (e.g. search within the museumsandtheweb.com) and there's no central list of museum technology blogs that I know of. Maybe people could nominate blog posts they think stand the test of time, mindful of the risk of it turning into a popularity/recency thing?

If you're new(ish) to digital heritage, how did you find your feet? Which sites or communities helped you, and how did you find them? Or if you have a new team member, how do you help them get up to speed with museum technology? Or looking further afield, which resources would you send to someone from academia or related heritage fields who wanted to learn about building heritage resources for or with specialists and the public?

Early PhD findings: Exploring historians' resistance to crowdsourced resources

I wrote up some early findings from my PhD research for conferences back in 2012 when I was working on questions around 'but will historians really use resources created by unknown members of the public?'. People keep asking me for copies of my notes (and I've noticed people citing an online video version which isn't ideal) and since they might be useful and any comments would help me write-up the final thesis, I thought I'd be brave and post my notes.

A million caveats apply – these were early findings, my research questions and focus have changed and I've interviewed more historians and reviewed many more participative history projects since then; as a short paper I don't address methods etc; and obviously it's only a huge part of a tiny topic… (If you're interested in crowdsourcing, you might be interested in other writing related to scholarly crowdsourcing and collaboration from my PhD, or my edited volume on 'Crowdsourcing our cultural heritage'.) So, with those health warnings out of the way, here it is. I'd love to hear from you, whether with critiques, suggestions, or just stories about how it relates to your experience. And obviously, if you use this, please cite it!

Exploring historians' resistance to crowdsourced resources

Scholarly crowdsourcing may be seen as a solution to the backlog of historical material to be digitised, but will historians really use resources created by unknown members of the public?

The Transcribe Bentham project describes crowdsourcing as 'the harnessing of online activity to aid in large scale projects that require human cognition' (Terras, 2010a). 'Scholarly crowdsourcing' is a related concept that generally seems to involve the collaborative creation of resources through collection, digitisation or transcription. Crowdsourcing projects often divide up large tasks (like digitising an archive) into smaller, more manageable tasks (like transcribing a name, a line, or a page); this method has helped digitise vast numbers of primary sources.

My doctoral research was inspired by a vision of 'participant digitization', a form of scholarly crowdsourcing that seeks to capture the digital records and knowledge generated when researchers access primary materials in order to openly share and re-use them. Unlike many crowdsourcing projects which are designed for tasks performed specifically for the project, participant digitization harnesses the transcription, metadata creation, image capture and other activities already undertaken during research and aggregates them to create re-usable collections of resources.

Research questions and concepts

When Howe clarified his original definition, stating that the 'crucial prerequisite' in crowdsourcing is 'the use of the open call format and the large network of potential laborers', a 'perfect meritocracy' based not on external qualifications but on 'the quality of the work itself', he created a challenge for traditional academic models of authority and credibility (Howe 2006, 2008). Furthermore, how does anonymity or pseudonymity (defined here as often long-standing false names chosen by users of websites) complicate the process of assessing the provenance of information on sites open to contributions from non-academics? An academic might choose to disguise their identity to mask their research activities from competing peers, from a desire to conduct early exploratory work in private or simply because their preferred username was unavailable; but when contributors are not using their real names they cannot derive any authority from their personal or institutional identity. Finally, which technical, social and scholarly contexts would encourage researchers to share (for example) their snippets of transcription created from archival documents, and to use content transcribed by others? What barriers exist to participation in crowdsourcing or prevent the use of crowdsourced content?

Methods

I interviewed academic and family/local historians about how they evaluate, use, and contribute to crowdsourced and traditional resources to investigate how a resource based on 'meritocracy' disrupts current notions of scholarly authority, reliability, trust, and authorship. These interviews aimed to understand current research practices and probe more deeply into how participants assess different types of resources, their feelings about resources created by crowdsourcing, and to discover when and how they would share research data and findings.

I sought historians investigating the same country and time period in order to have a group of participants who faced common issues with the availability and types of primary sources from early modern England. I focused on academic and 'amateur' family or local historians because I was interested in exploring the differences between them to discover which behaviours and attitudes are common to most researchers and which are particular to academics and the pressures of academia.

I recruited participants through personal networks and social media, and conducted interviews in person or on Skype. At the time of writing, 17 participants have been interviewed for up to 2 hours each. It should be noted that these results are of a provisional nature and represent a snapshot of on-going research and analysis.

Early results

I soon discovered that citizen historians are perfect examples of Pro-Ams: 'knowledgeable, educated, committed, and networked' amateurs 'who work to professional standards' (Leadbeater and Miller, 2004; Terras, 2010b).

How do historians assess the quality of resources?

Participants often simply said they drew on their knowledge and experience when sniffing out unreliable documents or statements. When assessing secondary sources, their tacit knowledge of good research and publication practices was evident in common statements like '[I can tell from] it's the way it's written'. They also cited the presence and quality of footnotes, and the depth and accuracy of information as important factors. Transcribed sources introduced another layer of quality assessment – researchers might assess a resource by checking for transcription errors that are often copied from one database to another. Most researchers used multiple sources to verify and document facts found in online or offline sources.

When and how do historians share research data and findings?

It appears that between accessing original records and publishing information, there are several key stages where research data and findings might be shared. Stages include acquiring and transcribing records, producing visualisations like family trees and maps, publishing informal notes and publishing synthesised content or analysis; whether a researcher passes through all the stages depends on their motivation and audience. Information may change formats between stages, and since many claim not to share information that has not yet been sufficiently verified, some information would drop out before each stage. It also appears that in later stages of the research process the size of the potential audience increases and the level of trust required to share with them decreases.

For academics, there may be an additional, post-publication stage when resources are regarded as 'depleted' – once they have published what they need from them, they would be happy to share them. Family historians meanwhile see some value in sharing versions of family trees online, or in posting names of people they are researching to attract others looking for the same names.

Sharing is often negotiated through private channels and personal relationships. Methods of controlling sharing include showing people work in progress on a screen rather than sending it to them and using email in preference to sharing functionality supplied by websites – this targeted, localised sharing allows the researcher to retain a sense of control over early stage data, and so this is one key area where identity matters. Information is often shared progressively, and getting access to more information depends on your behaviour after the initial exchange – for example, crediting the provider in any further use of the data, or reciprocating with good data of your own.

When might historians resist sharing data?

Participants gave a range of reasons for their reluctance to share data. Being able to convey the context of creation and the qualities of the source materials is important for historians who may consider sharing their 'depleted' personal archives – not being able to provide this means they are unlikely to share. Being able to convey information about data reliability is also important. Some information about the reliability of a piece of information is implicitly encoded in its format (for example, in pencil in notebooks versus electronic records), hedging phrases in text, in the number of corroborating sources, or a value judgement about those sources. If it is difficult to convey levels of 'certainty' about reliability when sharing data, it is less likely that people will share it – participants felt a sense of responsibility about not publishing (even informally) information that hasn't been fully verified. This was particularly strong in academics. Some participants confessed to sneaking forbidden photos of archival documents they ran out of time to transcribe in the archive; unsurprisingly it is unlikely they would share those images.

Overall, if historians do not feel they would get information of equal value back in exchange, they seem less likely to share. Professional researchers do not want to give away intellectual property, and feel sharing data online is risky because the protocols of citation and fair use are presently uncertain. Finally, researchers did not always see a point in sharing their data. Family history content was seen as too specific and personal to have value for others; academics may realise the value of their data within their own tightly-defined circles but not realise that their records may have information for other biographical researchers (i.e. people searching by name) or other forms of history.

Which concerns are particular to academic historians?

Reputational risk is an issue for some academics who might otherwise share data. One researcher said: 'we are wary of others trawling through our research looking for errors or inconsistencies. […] Obviously we were trying to get things right, but if we have made mistakes we don't want to have them used against us. In some ways, the less you make available the better!'. Scholarly territoriality can be an issue – if there is another academic working on the same resources, their attitude may affect how much others share. It is also unclear how academic historians would be credited for their work if it was performed under a pseudonym that does not match the name they use in academia.

What may cause crowdsourced resources to be under-used?

In this research, 'amateur' and academic historians shared many of the same concerns for authority, reliability, and trust. The main reported cause of under-use (for all resources) is not providing access to original documents as well as transcriptions. Researchers will use almost any information as pointers or leads to further sources, but they will not publish findings based on that data unless the original documents are available or the source has been peer-reviewed. Checking the transcriptions against the original is seen as 'good practice', part of a sense of responsibility 'to the world's knowledge'.

Overall, the identity of the data creator is less important than expected – for digitised versions of primary sources, reliability is not vested in the identity of the digitiser but in the source itself. Content found on online sites is tested against a set of finely-tuned ideas about the normal range of documents rather than the authority of the digitiser.

Cite as:

Ridge, Mia. “Early PhD Findings: Exploring Historians’ Resistance to Crowdsourced Resources.” Open Objects, March 19, 2014. https://www.openobjects.org.uk/2014/03/early-phd-findings-exploring-historians-resistance-to-crowdsourced-resources/.

References

Howe, J. (undated). Crowdsourcing: A Definition http://crowdsourcing.typepad.com

Howe, J. (2006). Crowdsourcing: A Definition. http://crowdsourcing.typepad.com/cs/2006/06/crowdsourcing_a.html

Howe, J. (2008). Join the crowd: Why do multinationals use amateurs to solve scientific and technical problems? The Independent. http://www.independent.co.uk/life-style/gadgets-and-tech/features/join-the-crowd-why-do-multinationals-use-amateurs-to-solve-scientific-and-technical-problems-915658.html

Leadbeater, C., and Miller, P. (2004). The Pro-Am Revolution: How Enthusiasts Are Changing Our Economy and Society. Demos, London, 2004. http://www.demos.co.uk/files/proamrevolutionfinal.pdf

Terras, M. (2010a) Crowdsourcing cultural heritage: UCL's Transcribe Bentham project. Presented at: Seeing Is Believing: New Technologies For Cultural Heritage. International Society for Knowledge Organization, UCL (University College London). http://eprints.ucl.ac.uk/20157/

Terras, M. (2010b). “Digital Curiosities: Resource Creation via Amateur Digitization.” Literary and Linguistic Computing 25, no. 4 (October 14, 2010): 425–438. http://llc.oxfordjournals.org/cgi/doi/10.1093/llc/fqq019

2013 in review: crowdsourcing, digital history, visualisation, and lots and lots of words

A quick and incomplete summary of my 2013 for those days when I wonder where the year went… My PhD was my main priority throughout the year, but the slow increase in word count across my thesis is probably only of interest to me and my supervisors (except where I've turned down invitations to concentrate on my PhD). Various other projects have spanned the years: my edited volume on 'Crowdsourcing our Cultural Heritage', working as a consultant on the 'Let's Get Real' project with Culture24, and I've continued to work with the Open University Digital Humanities Steering Group, ACH and to chair the Museums Computer Group.

In January (and April/June) I taught all-day workshops on 'Data Visualisation for Analysis in Scholarly Research' and 'Crowdsourcing in Libraries, Museums and Cultural Heritage Institutions' for the British Library's Digital Scholarship Training Programme.

In February I was invited to give a keynote on 'Crowd-sourcing as participation' at iSay: Visitor-Generated Content in Heritage Institutions in Leicester (my event notes). This was an opportunity to think through the impact of the 'close reading' people do while transcribing text or describing images, crowdsourcing as a form of deeper engagement with cultural heritage, and the potential for 'citizen history' this creates (also finally bringing together my museum work and my PhD research). This later became an article for Curator journal, From Tagging to Theorizing: Deepening Engagement with Cultural Heritage through Crowdsourcing (proof copy available at http://oro.open.ac.uk/39117). I also ran a workshop on 'Data visualisation for humanities researchers' with Dr. Elton Barker (one of my PhD supervisors) for the CHASE 'Going Digital' doctoral training programme.

In March I was in the US for THATCamp Feminisms in Claremont, California (my notes), to do a workshop on Data visualisation as a gateway to programming and I gave a paper on 'New Challenges in Digital History: Sharing Women's History on Wikipedia' at the Women's History in the Digital World' conference at Bryn Mawr, Philadelphia (posted as 'New challenges in digital history: sharing women's history on Wikipedia – my draft talk notes'). I also wrote an article for Museum Identity magazine, Where next for open cultural data in museums?.

In April I gave a paper, 'A thousand readers are wanted, and confidently asked for': public participation as engagement in the arts and humanities, on my PhD research at Digital Impacts: Crowdsourcing in the Arts and Humanities (see also my notes from the event), and a keynote on 'A Brief History of Open Cultural Data' at GLAM-WIKI 2013.

In May I gave an online seminar on crowdsourcing (with a focus on how it might be used in teaching undergraduates wider skills) for the NITLE Shared Academics series. I gave a short paper on 'Digital participation and public engagement' at the London Museums Group's 'Museums and Social Media' at Tate Britain on May 24, and was in Belfast for the Museums Computer Group's Spring meeting, 'Engaging Visitors Through Play' then whipped across to Venice for a quick keynote on 'Participatory Practices: Inclusion, Dialogue and Trust' (with Helen Weinstein) for the We Curate kick-off seminar at the start of June.

In June the Collections Trust and MCG organised a Museum Informatics event in York and we organised a 'Failure Swapshop' the evening before. I also went to Zooniverse's ZooCon (my notes on the citizen science talks) and to Canterbury Cathedral Archives for a CHASE event on 'Opening up the archives: Digitization and user communities'.

In July I chaired a session on Digital Transformations at the Open Culture 2013 conference in London on July 2, gave an invited lightning talk at the Digital Humanities Oxford Summer School 2013, ran a half-day workshop on 'Designing successful digital humanities crowdsourcing projects' at the Digital Humanities 2013 conference in Nebraska, and had an amazing time making what turned out to be Serendip-o-matic at the Roy Rosenzweig Center for History and New Media at George Mason University's One Week, One Tool in Fairfax, Virginia (my posts on the process), with a museumy road trip via Amtrak and Greyhound to Chicago, Cleveland, Pittsburg inbetween the two events.

In August I tidied up some talk notes for publication as 'Tips for digital participation, engagement and crowdsourcing in museums' on the London Museums Group blog.

October saw the publication of my Curator article and Creating Deep Maps and Spatial Narratives through Design with Don Lafreniere and Scott Nesbit for the International Journal of Humanities and Arts Computing, based on our work at the Summer 2012 NEH Advanced Institute on Spatial Narrative and Deep Maps: Explorations in the Spatial Humanities. (I also saw my family in Australia and finally went to MONA).

In November I presented on 'Messy understandings in code' at Speaking in Code at UVA's Scholars' Lab, Charlottesville, Virginia, gave a half-day workshop on 'Data Visualizations as an Introduction to Computational Thinking' at the University of Manchester and spoke at the Digital Humanities at Manchester conference the next day. Then it was down to London for the MCG's annual conference, Museums on the Web 2013 at Tate Modern. Later than month I gave a talk on 'Sustaining Collaboration from Afar' at Sustainable History: Ensuring today's digital history survives.

In December I went to Hannover, Germany for the Herrenhausen Conference: "(Digital) Humanities Revisited – Challenges and Opportunities in the Digital Age" where I presented on 'Creating a Digital History Commons through crowdsourcing and participant digitisation' (my lightning talk notes and poster are probably the best representation of how my PhD research on public engagement through crowdsourcing and historians' contributions to scholarly resources through participant digitisation are coming together). In final days of 2013, I went back to my old museum metadata games, and updated them to include images from the British Library and took a first pass at making them responsive for mobile and tablet devices.

Why we need to save the material experience of software objects

Conversations at last month's Sustainable History: Ensuring today's digital history survives event [my slides] (and at the pub afterwards) touched on saving the data underlying websites as a potential solution for archiving them. This is definitely better than nothing, but as a human-computer interaction researcher and advocate for material culture in historical research, I don't think it's enough.

Just as people rue the loss of the information and experiential data conveyed by the material form of objects when they're converted to digital representations – size, paper and print/production quality, marks from wear through use and manufacture, access to its affordances, to name a few – future researchers will rue the information lost if we don't regard digital interfaces and user experiences as vital information about the material form of digital content and record them alongside the data they present.

Can you accurately describe the difference between using MySpace and Facebook in their various incarnations? There's no perfect way to record the experience of using Facebook in December 2013 so it could be compared with the experience of using MySpace in 2005, but usability techniques like screen-recording software linked to eyetracking or think-aloud tests would help preserve some of the tacit knowledge and context users bring to sites alongside the look-and-feel, algorithms and treatments of data the sites present to us. It's not a perfect solution, but a recording of the interactions and designs from both sites for common tasks like finding and adding a friend would tell future researchers infinitely more about changes to social media sites over eight years than simple screenshots or static webpages. But in this case we're still missing the notifications on other people's screens, the emails and algorithmic categorisations that fan out from simple interactions like these…

Even if you don't care about history, anyone studying software – whether websites, mobile apps, digital archives, instrument panels or procedural instructions embedded in hardware – still needs solid methods for capturing the dynamic and subjective experience of using digital technologies. As Lev Manovich says in The Algorithms of Our Lives, when we use software we're "engaging with the dynamic outputs of computation; studying software culture requires us to "record and analyze interactive experiences, following individual users as they navigate a website or play a video game … to watch visitors of an interactive installation as they explore the possibilities defined by the designer—possibilities that become actual events only when the visitors act on them".

The Internet Archive does a great job, but in researching the last twenty years of internet history I'm constantly hitting the limits of their ability to capture dynamic content, let alone the nuance of interfaces. The paradox is that as more of our experiences are mediated through online spaces and the software contained within small boxy devices, we risk leaving fewer traces of our experiences than past generations.

Lighting beacons: research software engineers event and related topics

I've realised that it could be useful to share my reading at the intersection of research software engineers/cultural heritage technologist/digital humanities, so at the end I've posted some links to current discussions or useful reference points and work to provide pointers to interesting work.

But first; notes from last week's workshop for research software engineers, an event for people who 'not only develop the software, they also understand the research that it makes possible'. The organisers did a great job with the structure (and provided clear instructions on running a breakout session) – each unconference-style session had to appoint a scribe and report back to a plenary session as well as posting their notes to the group's discussion list so there's an instant archive of the event.

Discussions included:

How do you manage quality and standards in training – how do you make sure people are doing their work properly, and what are the core competencies and practices of an RSE?
How should the research community recognise the work of RSEs?
Sharing Research Software
Routes into research software development – why did you choose to be an RSE?
Do we need a RSE community?
and the closing report from the Steering Committee and group discussion on what an RSE community might be or do.

I ended up in the 'How should the research community recognise the work of RSES?' session. I like the definition we came up with: 'research software engineers span the role of researchers and software engineers. They have the domain knowledge of researchers and the development skills to be able to represent this knowledge in code'. On the other hand, if you only work as directed, you're not an RSE. This isn't about whether you make stuff, it's about how much you're shaping what you're making. The discussion also teased out different definitions of 'recognition' and how they related to people's goals and personal interests; the impact of 'short-termism' and project funding on stable careers, software quality, training and knowledge sharing. Should people cite the software they use in their research in the methods section of any publications? How do you work out and acknowledge someone's contribution to on-going or collaborative projects – and how do you account for double-domain expertise when recognising contributions made in code?

I'd written about the event before I went (in Beyond code monkeys: recognising technologists' intellectual contributions, which relates it to digital humanities and cultural heritage work) but until I was there I hadn't realised the extra challenges RSEs in science face – unlike museum technologists, science RSEs are deeply embedded in a huge variety of disciplines and can't easily swap between them.

The event was a great chance to meet people facing similar issues in their work and careers, and showed how incredibly useful the right label can be for building a community. If you work with science+software in the UK and want to help work out what a research software engineer community might be, join in the RSE discussion.

If you're reading this post, you might also be interested in:

Paul Walk's 2011 posts on The Strategic Developer: How Local Developers Can Make A Difference and The Value of Local Developers raised many of the issues discussed at the RSE event and the DevCSI ('Developer Community Supporting Innovation') initiative. There's also a Developer Contact Google Group.
On Ant-Lions and Scholar-Programmers by Doug Reside – the emphasis is on scholars rather than programmers and it's also interesting for the differences between academia in the US and UK (though it doesn't directly address them)
Alan Liu's "Why I'm In It" x 2 – Antiphonal Response to Stephan Ramsay on Digital Humanities and Cultural Criticism and Stephen Ramsey's 'Why I'm In It' – if you haven't been keeping up with the digital humanities, dive into the deep end here.
Michael Widner's Towards a Front Page for the Digital Humanities #dhthis, because it's one of many posts discussing DHThis and because I forget that not everyone lives and breaths this: 'software is a form of writing/making with its own assumptions and affordances, we have a responsibility to critique the software'

In ye olden days, beacon fires were lit on hills to send signals between distant locations. These days we have blogs.

Beyond code monkeys: recognising technologists' intellectual contributions

Two upcoming events suggest that academia is starting to recognise that specialist technologists – AKA 'research software engineers' or 'digital humanities software developers' – make intellectual contributions to research software, and further, that it is starting to realise the cost of not recognising them. In the UK, there's a 'workshop for research software engineers' on September 11; in the US there's Speaking in Code in November (which offers travel bursaries and is with ace people, so do consider applying).

But first, who are these specialist technologists, and why does it matter? The UK Software Sustainability Institute's 'workshop for research software engineers' says 'research software engineers … not only develop the software, they also understand the research that it makes possible'. In an earlier post, The Craftsperson and the Scholar, UCL's James Hetherington says a 'good scientific coder combines two characters: the scholar and the craftsperson'. Research software needs people who are both scholar – 'the archetypical researcher who is driven by a desire to understand things to their fullest capability' and craftsperson who 'desires to create and leave behind an artefact which reifies their efforts in a field': 'if you get your kicks from understanding the complex and then making a robust, clear and efficient tool, you should consider becoming a research software engineer'. A supporting piece in the Times Higher Education, 'Save your work – give software engineers a career track' points out that good developers can leave for more rewarding industries, and raises one of the key issues for engineers: not everyone wants to publish academic papers on their development work, but if they don't publish, academia doesn't know how to judge the quality of their work.

Over in the US, and with a focus on the humanities rather than science, the Scholar's Lab is running the 'Speaking in Code' symposium to highlight 'what is almost always tacitly expressed in our work: expert knowledge about the intellectual and interpretive dimensions of DH code-craft, and unspoken understandings about the relation of that work to ethics, scholarly method, and humanities theory'. In a related article, Devising New Roles for Scholars Who Can Code, Bethany Nowviskie of the Scholar's Lab discussed some of the difficulties in helping developers have their work recognised as scholarship rather than 'service work' or just 'building the plumbing':

"I have spent so much of my career working with software developers who are attached to humanities projects," she says. "Most have higher degrees in their disciplines." Unlike their professorial peers, though, they aren't trained to "unpack" their thinking in seminars and scholarly papers. "I've spent enough time working with them to understand that a lot of the intellectual codework goes unspoken," she says.

Women at work on C-47 Douglas cargo transport.
LOC image via Serendip-o-matic

Digital humanists spend a lot of time thinking about the role of 'making things' in the digital humanities but, to cross over to my other domain of interest, I think the international Museums and the Web conference's requirement for full written papers for all presentations has helped more museum technologists translate some of their tacit knowledge into written form. Everyone who wants to present their work has to find a way to write up their work, even if it's painful at the time – but once it's done, they're published as open access papers well before the conference. Museum technologists also tend to blog and discuss their work on mailing lists, which provides more opportunities to tease out tacit knowledge while creating a visible community of practice.

I wasn't at Museums and the Web 2013 but one of the sessions I was most interested in was Rich Cherry and Rob Stein's 'What’s a Museum Technologist today?' as they were going to report on the results of a survey they ran earlier this year to come up with 'a more current and useful description of our profession'. (If you're interested in the topic, my earlier posts on museum technologists include On 'cultural heritage technologists', Confluence on digital channels; technologists and organisational change?, Museum technologists redux: it's not about us, Survey results: issues facing museum technologists.) Rob's posted their slides at What is a Museum Technologist Anyway? and I'd definitely recommend you go check them out. Looking through the responses, the term 'museum technologist' seems to have broadened as more museum jobs involve creating content for or publishing on digital channels (whether web sites, mobile apps, ebooks or social media), but to me, a museum technologist isn't just someone who uses technology or social media – rather, there's a level of expertise or 'domain knowledge' across both museums and technology – and the articles above have reinforced my view that there's something unique in working so deeply across two or more disciplines. (Just to be clear: this isn't a diss for people who use social media rather than build things – there's also a world of expertise in creating content for the web and social media). Or to paraphrase James Hetherington, "if you get your kicks from understanding the complex and then making a robust, clear and efficient tool, you should consider becoming a museum technologist'.

To further complicate things, not everyone needs their work to reflect all their interests – some programmers and tech staff are happy to leave their other interests outside the office door, and leave engineering behind at the end of the day – and my recent experiences at One Week | One Tool reminded me that promiscuous interdisciplinarity can be tricky. Even when you revel in it, it's hard to remember that people wear multiple hats and can swap from production-mode to critically reflecting on the product through their other disciplinary lenses, so I have some sympathy for academics who wonder why their engineer expects their views on the relevant research topic to be heard. That said, hopefully events like these will help the research community work out appropriate ways of recognising and rewarding the contributions of researcher developers.

[Update, September 2013: I've posted brief notes and links to session reports from the research software engineers event at Lighting signals: research software engineers event and related topics.]