Crowdsourcing as connection: a constant star over a sea of change / Établir des connexions: un invariant des projets de crowdsourcing

As I'm speaking today at an event that's mostly in French, I'm sharing my slides outline so it can be viewed at leisure, or copy-and-pasted into a translation tool like Google Translate.

Colloque de clôture du projet Testaments de Poilus, Les Archives nationales de France, 25 Novembre 2022

Crowdsourcing as connection: a constant star over a sea of change, Mia Ridge, British Library

GLAM values as a guiding star

(Or, how will AI change crowdsourcing?) My argument is that technology is changing rapidly around us, but our skills in connecting people and collections are as relevant as ever:

  • Crowdsourcing connects people and collections
  • AI is changing GLAM work
  • But the values we express through crowdsourcing can light the way forward

(GLAM – galleries, libraries, archives and museums)

A sea of change

AI-based tools can now do many crowdsourced tasks:

  • Transcribe audio; typed and handwritten text
  • Classify / label images and text – objects, concepts, 'emotions'

AI-based tools can also generate new images, text

  • Deep fakes, emerging formats – collecting and preservation challenges

AI is still work-in-progress

Automatic transcription, translation failure from this morning: 'the encephalogram is no longer the mother of weeks'

  • Results have many biases; cannot be used alone
  • White, Western, 21st century view
  • Carbon footprint
  • Expertise and resources required
  • Not easily integrated with GLAM workflows

Why bother with crowdsourcing if AI will soon be 'good enough'?

The elephant in the room; been on my mind for a couple of years now

The rise of AI means we have to think about the role of crowdsourcing in cultural heritage. Why bother if software can do it all?

Crowdsourcing brings collections to life

  • Close, engaged attention to 'obscure' collection items
  • Opportunities for lifelong learning; historical and scientific literacy
  • Gathers diverse perspectives, knowledge

Crowdsourcing as connection

Crowdsourcing in GLAMs is valuable in part because it creates connections around people and collections

  • Between volunteers and staff
  • Between people and collections
  • Between collections

Examples from the British Library

In the Spotlight: designing for productivity and engagement

Living with Machines: designing crowdsourcing projects in collaboration with data scientists that attempt to both engage the public with our research and generate research datasets. Participant comments and questions inspired new tasks, shaped our work.

How do we follow the star?

Bringing 'crowdsourcing as connection' into work with AI

Valuing 'crowdsourcing as connection'

  • Efficiency isn't everything. Participation is part of our mission
  • Help technologists and researchers understand the value in connecting people with collections
  • Develop mutual understanding of different types of data – editions, enhancement, transcription, annotation
  • Perfection isn't everything – help GLAM staff define 'data quality' in different contexts
  • Where is imperfect, AI data at scale more useful than perfect but limited data?
  • 'réinjectée' – when, where, and how?
  • How does crowdsourcing, AI change work for staff?
  • How do we integrate data from different sources (AI, crowdsourcing, cataloguers), at different scales, into coherent systems?
  • How do interfaces show data provenance, confidence?

Transforming access, discovery, use

  • A single digitised item can be infinitely linked to places, people, concepts – how does this change 'discovery'?
  • What other user needs can we meet through a combination of AI, better data systems and public participation?

Merci de votre attention!

Pour en savoir plus: https://bl.uk/digital https://livingwithmachines.ac.uk

Essayez notre activité de crowdsourcing: http://bit.ly/LivingWithMachines

Nous attendons vos questions: digitalresearch@bl.uk

Screenshot of images generated by AI, showing variations on dark blue or green seas and shining stars
Versions of image generation for the text 'a bright star over the sea'
Presenting at Les Archives nationales de France, Paris, from home

Experimenting with Mastodon

I'd signed up to mastodon.cloud during an earlier twitter kerfuffle in 2017, then with ausglam.space in January last year, and glammr.us on a whim. [Edit to add, I've taken the plunge and migrated to hcommons.social/@mia as my main account].

2008-era Nokia phone with a tweet on the screen: @miaridge 'those twitters on screen are really distracting me at #mw2008'
Tweet from Museums and the Web 2008 'complaining' about being distracted by a twitterfall (remember that?) screen

This week I've gone back and taken another look. (So that's me, me and me). The energy that's poured in must be quite disconcerting for long-term users, but making new connections and thinking differently about how I want to post on social media has been quite exhilarating. It's also been a chance to think about what twitter's meant for me in the nearly 15 years I've been posting.

I've realised how constrained my tweeting has become over time, and in particular how a sense of surveillance has sucked the joy out of posting. The idea that an employer's HR, a tabloid journalist, or someone on the lookout to take offence could seize on something and blow it up – the uncertainty about how things could be taken out of context and take on a life of their own – had a chilling effect.

[Edited to add, also I've never stopped being annoyed about the way Twitter turned 'stars' into 'likes' or hearts, then shared them into timelines, as described well in this guide to Mastodon. I also acted defensively against the worst changes in twitter – my location is set to Jordan so that trending topics are in Arabic and therefore unreadable to me (except when BTS fans take over); I use the 'latest' view if I have to use Twitter's own client; and I normally use clients that only show things that people I follow have consciously tweeted, not random 'likes'.]

15 years is a long time, and I've also had to be more thoughtful about what I post as my job titles and institutions have changed. Lots of us have grown up while on the site, and benefited hugely from the conversations, friendships, provocations and more we've found there.

Twitter completely transformed events for me – you could find like-minded folk in a crowd as talks were live tweeted. Some of those conversations have continued for years. I have fond memories of making good trouble at events like Museums and the Web (and of course the Museums Computer Group's events) with people I met via their tweets.

I'll also miss the sheer size of Twitter that made random searches so interesting. You could search on any word you liked and get so many glimpses into other lives and ways of being in the world. I've never understood the 'town square' thing but it was a brilliant coffee shop. [Edit to add: that ability to search out very specific terms is also part of the surveillance vibe – it's easy to search for terms to get upset about, or to find a tweet posted to a few hundred people and pull it out of context. Mastodon apparently only allows searches on hashtagged terms, as explained in this post, so the original poster has to consciously make a word publicly searchable]

Over time, we've lost many voices as some people found twitter too toxic, or too time-consuming. Post-2016, it's been much harder to love a platform so full of harmful misinformation. At the moment this definitely feels like the last days of twitter, though I'm sure lots of us will keep our accounts, even if we don't go there as much.

If twitter doesn't last, thank thanks to everyone who's kept me entertained, changed how I think about things, commiserated, cheered me up, shared wins and losses over the years.

My IFPH panel notes, 'shared authority as work in progress'

I'm in Berlin for the International Council for Public History 20202 #IFPH2022 conference, where I'm on a panel on 'Revisiting A Shared Authority in the Age of Digital Public History'. It's part of a working group with Thomas Cauvin (Luxembourg), Michael Frisch (United States), Serge Noiret (Italy), Mark Tebeau (United States), Mia Ridge (United Kingdom), Sharon Leon (United States), Rebecca Wingo (United States), Dominique Santana (Luxembourg), Violeta Tsenova (Luxembourg). My panel notes will make more sense in that wider context, but I'm sharing them here for reference.

Shared authority as work in progress

What does 'shared authority' mean to cultural heritage institutions? (Or GLAMs – galleries, libraries, archives and museums). The view will really depend on many factors, possibly including whether GLAM staff feel the need to do any professional gatekeeping, reserving 'library' or 'archive' professional status for themselves, much as some historians do more gatekeeping than others around who's allowed to say they're 'doing history'.

Thinking about ephemerality and what’s left of the processes of sharing authority a few years after it happens…

[Visual metaphor – think of the layers around the core of an onion. At the heart are collections, then catalogue metadata about those collections, often an additional layer of related metadata that doesn’t fit into the catalogue but is required for GLAM business, then public programmes including outreach and education, then there’s the unmediated access to collections and knowledge via social media and galleries]

I think GLAMs are getting comfortable with sharing, and shared authority. Crowdsourcing, in its many forms, is relatively common in GLAMs. Collaboration with Wikipedians of various sorts is widespread. There's a body of knowledge about co-curating exhibitions, community collecting and more, shared over conferences and publications and praxis. Texts and metadata and AV of all sorts have been created – usually *by* the public, *for* institutions.

Collaboration with other GLAMs on information standards and shared cataloguing has a long history, and those practices have moved online. [And now we’re sharing authority by putting records on wikidata, where they can be updated by anyone]

There's something interesting in the idea of the 'catalogue' as a source of authority. GLAM cataloguing practices are shaped by the needs of organisations – keeping track of their collections, adding information from structured vocabularies, perhaps adding extensive notes and bibliographies – for internal use and for their readers (particularly for libraries and archives), and by the commercial vendors that produce the cataloguing platforms. 

Cataloguing platforms often lag behind the needs of GLAMs, and have been slow to respond to requests to include sources of information outside the organisation. That may be because some of this work in sharing authority happens outside cataloguing and registrar teams, or because there's not one single, clear way in which cataloguing systems should change to include information from the community about collection items.

Some GLAMs are more challenged than others by thinking generously about where 'authority' resides. Researchers in reading rooms, or open collection stores are clearly visibly engaged with specialist research. Their discussions with reference staff will often reveal the depths of their knowledge about specific parts of a collection. Authority is already shared between readers and staff. However, the expertise (or authority) of the same readers is not visible when they use online collections – all online visits and searches look the same in Google Analytics unless you really delve into the reports. Similarly, a crowdsourcing participant transcribing text or tagging images might be entirely new to the source materials, or have a deep familiarity with them. Their questions and comments might reveal something of this, but the data recorded by a crowdsourcing platform lacks the social cues that might be present in an in-person conversation.

In the UK, generations of funding cuts have reduced the number of specialist curators in GLAMs. These days, curators are more likely to be generalists, selected for their ability to speak eloquently about collections and grasp the shape, significance and history of a collection quickly. Looking externally for authoritative information – whether the lived experience of communities who used or still care for similar items, or specialist academic and other researchers – is common.

It's important to remember that 'crowdsourcing' is a broad term that includes 'type what you see' tasks such as transcription or correction, tasks such as free-text tagging or information that rely on knowledge and experience, and more involved co-creative tasks such as organising projects or analysing results. But an important part of my definition is that each task contributes towards a shared, significant goal – if data isn't recorded somewhere, it's just 'user generated content'.

For me, the value of crowdsourcing in cultural heritage is the intimate access it gives members of the public to collection items they would otherwise never encounter. As long as a project offers some way for participants to share things they've noticed, ask questions and mark items for their own use – in short, a way of reflecting on historical items – I consider that even 'simple' transcription tasks have the potential to be citizen history (or citizen science). 

The questions participants ask on my projects shape my own practice, and influence the development of new tasks and features – and in the last year helped shape an exhibition I co-curated with another museum curator. The same exhibition featured 'community comments', responses from people I or the museum have worked with over some time. Some of these comments were reflections from crowdsourcing volunteers on how their participation in the project changed how they thought about mechanisations in the 1800s (the subject of the exhibition).

Attitudes have shifted; data hasn't

However, years after folksonomies and web 2.0 were big news, the data the public creates through crowdsourcing is still difficult to integrate with existing catalogues. Flickr Commons, Omeka, Wikidata, Zooniverse and other platforms might hold information that would make collections more discoverable online, but it’s not easy to link data from those platforms to internal systems. That is in part because GLAM catalogues struggle with the granularity of digitised items – catalogues can help you order a book or archive box to a reading room, but they can't as easily store tags or research notes about what's on a particular page of that item. It's also in part because data nearly always needs reviewing and transforming before ingest. 

But is it also because GLAMs don't take shared authority seriously enough to advocate and pay for changes to their cataloguing systems to support them recording material from the public alongside internal data? Data that isn't in 'strategic' systems is more easily left behind when platforms migrate and staff move on.

This lack of flexibility in recording information from the public also plays out in ‘traditional’ volunteering, where spreadsheets and mini-databases might be used to supplement the main catalogue. The need for import and export processes to manage volunteer data can intentionally or unintentionally create a barrier to more closely integrating different sources of authoritative information.

So authority might be shared – but when it counts, whose information is regarded as vital, as 'core', and integrated into long-term systems, and whose is left out?

I realised that for me, at heart it’s about digital preservation. If it's not in an organisation’s digital preservation plans, or content is with an organisation that isn't supported in having a digital preservation plan; is it really valued? And if content isn't valued, is authority really shared?

Diagram showing an 'onion' of data from 'core metadata' at the centre to 'additional metadata' (with arrows marked 'community content' and 'algorithmic content' pointing to it, to 'public programmes' to 'unmediated public access'

Talk notes for #AIUK on the British Library and crowdsourcing

I had a strict five minute slot for my talk in the panel on 'Reimagining the past with AI' at Turing's AI UK event today, so wrote out my notes and thought I might as well share them…

The panel blurb was 'The past shapes the present and influences the future, but the historical record isn’t straightforward, and neither are its digital representations. Join the AHRC project Living with Machines and friends on their journey to reimagine the past through AI and data science and the challenges and opportunities within.' It was a delight to chat with Dave Beavan, Mariona Coll Ardanuy, Melodee Wood and Tim Hitchcock.

My prepared talk: A bit about the British Library for those who aren't familiar with it. It's one of the two biggest libraries in the world, and it’s the national library for the UK. 
 
Its collections are vast – somewhere between 180 and 200 million collection items, including 14 million books; hundreds of terrabytes of archived websites; over 600,000 bound volumes of historical newspapers, of which about 60 million pages have been digitised with partners FindMyPast so far)… 
 
We've been working with crowdsourcing – which we defined as working with the public on tasks that contribute to a shared, significant goal related to cultural heritage collections or knowledge – for about a decade now. We've collected local sounds and accents around Britain, georeferenced gorgeous historical maps, matched card catalogue records in Urdu and Chinese to digital catalogue records, and brought the history of theatre across the UK to life via old playbills. 
 
Some of our crowdsourcing work is designed to help improve the discoverability of cultural heritage collections, and some, like our work with Living with Machines, is designed to build datasets to help answer wider research questions. 
 
In all cases, our work with crowdsourcing is closely aligned with the BL's mission: it helps make our shared intellectual heritage available for research, inspiration and enjoyment. 
 
We think of crowdsourcing activities as a form of digital volunteering, where participation in the task is rewarding in its own right. Our crowdsourcing projects are a platform for privileged access and deeper engagement with our digitised collections. They're an avenue for people who wouldn't normally encounter historical records close up to work with them, while helping make those items easier for others to access.
 
Through Living with Machines, we've worked out how to design tasks that fit into computational linguistic research questions and timelines… 
 
So that's all great – but… the scale of our collections is hard to ignore. Individual crowdsourcing tasks that make items more accessible by transcribing or classifying items are beyond the capacity of even the keenest crowd. Enter machine learning, human computation, human in the loop… 
 
While we're keen to start building systems that combine machine learning and human input to help scale up our work, we don't want to buy into terms like 'crowdworkers' or ‘gig work’ that we see in some academic and commercial work. If crowdsourcing is a form of public engagement, as well as a productive platform for tasks, we can't think of our volunteers as 'cogs' in a system. 
 
We think that it's important to help shape the future of 'human computation' systems; to ensure that work on machine learning / AI are in alignment with Library values . We look to work that peers at the Library of Congress are doing to create human-in-the-loop systems that 'cultivate responsible practices'. 
 
We want to retain the opportunities for the public to get started with simpler tasks based on historical collections, while also being careful not to 'waste clicks' by having people do tasks that computers can do faster. 
 
With Living with Machines, we've built tasks that provide opportunities for participants to think about how their classifications form training datasets for machine learning. 
 
So my questions for the next year are: how can we design human computation systems that help participants acquire new literacies and skills, while scaling up and amplifying their work?

Screenshot of Zoom view from the conference stage with a large green clock and red countdown timer
The conference 'backstage' view on Zoom