We have published this first version of our collaborative text to provide early access to our work, and to invite comment and discussion from anyone interested in crowdsourcing, citizen science, citizen history, digital / online volunteer projects, programmes, tools or platforms with cultural heritage collections.
I'm curious to see how much of a difference this period of open comment makes. The comments so far have been quite specific and useful, but I'd like to know where we *really* got it right, and where we could include other examples. You need a pubpub account to comment but after that it's pretty straightforward – select text, and add a comment, or comment on an entire chapter.
Having some distance from the original writing period has been useful for me – not least, the realisation that the title should have been 'perspectives on crowdsourcing in cultural heritage and digital humanities'.
With people self-isolating to slow the spread of the COVID-19 pandemic, parents and educators (as well as people looking for an art or history fix) may be looking to replace in-person trips to galleries, libraries, archives and museums* with online access to images of artefacts and information about them. GLAMs have spent decades getting some of the collections digitised and online so that you can view items and information from home.
* Collectively known as 'GLAMs' because it's a mouthful to say each time
Search a bunch of GLAM portals at once
I've made a quick 'custom search engine' so you can search most of the sites above with one Google search box. Search a range of portals that collect digitised objects, texts and media from galleries, libraries, archives and museums internationally:
Various platforms have large collections of objects from different institutions, in formats ranging from 'virtual exhibitions' or 'tours' to 'deep zooms' to catalogue-style pages about objects. I've focused on sites that include collections from multiple institutions, but this also means some of them are huge and you'll have to explore a bit to find relevant content. Try:
Things are moving fast, so let me know about other sets of links to collections, stories and tours online that'll help people staying home get their fix of history and culture and I'll update this post. Comment below, email me or @mia_out on twitter.
I came to Liverpool for the 'Festival of Maintenance', a celebration of maintainers. I'm blogging my talk notes so that I'm not just preaching to the converted in the room. As they say:
'Maintenance and repair are just as important as innovation, but sometimes these ideas seem left behind. Amidst the rapid pace of innovation, have we missed opportunities to design things so that they can be fixed?'.
microsites and collections online: innovation and maintenance in digital
My talk was about different narratives about 'digital' in cultural heritage organisations and how they can make maintenance harder or easier to support and resource. If last year's innovation is this year's maintenance task, how do we innovate to meet changing needs while making good decisions about what to maintain? At one museum job I calculated that c.85% of my time was spent on legacy systems, leaving less than a day a week for new work, so it's a subject close to my heart.
I began with an introduction to 'What does a cultural heritage technologist do?'. I might be a digital curator now but my roots lie in creating and maintaining systems for managing and sharing collections information and interpretative knowledge. This includes making digitised items available as individual items or computationally-ready datasets. There was also a gratuitous reference to Abba to illustrate the GLAM (galleries, libraries, archives and museums) acronym.
What do galleries, libraries, archives and museums have to maintain?
Exhibition apps and audio guides. Research software. Microsites by departments including marketing, education, fundraising. Catalogues. More catalogues. Secret spreadsheets. Digital asset management systems. Collections online pulled from the catalogue. Collections online from a random database. Student projects. Glueware. Ticketing. Ecommerce. APIs. Content on social media sites, other 3rd party sites and aggregators. CMS. CRM. DRM. VR, AR, MR.
Stories considered harmful
These stories mean GLAMs aren't making the best decisions about maintaining digital resources:
It's fine for social media content to be ephemeral
'Digital' is just marketing, no-one expects it to be kept
We have limited resources, and if we spend them all maintaining things then how will we build the new cool things the Director wants?
We're a museum / gallery / library / archive, not a software development company, what do you mean we have to maintain things?
What do you mean, software decays over time? People don't necessarily know that digital products are embedded in a network of software dependencies. User expectations about performance and design also change over time.
'Digital' is just like an exhibition; once it's launched you're done. You work really hard in the lead-up to the opening, but after the opening night you're free to move onto the next thing
That person left, it doesn't matter anymore. But people outside won't know that – you can't just let things drop.
Why do these stories matter?
If you don't make conscious choices about what to maintain, you're leaving it to fate.
Today's ephemera is tomorrow's history. Organisations need to be able to tell their own history. They also need to collect digital ephemera so that we can tell the history of wider society. (Social media companies aren't archives for your photos, events and stories.)
Better stories for the future
You can't save everything: make the hard choices. Make conscious decisions about what to maintain and how you'll close the things you can't maintain. Assess the likely lifetime of a digital product before you start work and build it into the roadmap.
Plan for a graceful exit – for all stakeholders. What lessons need to be documented and shared? Do you need to let any collaborators, funders, users or fans know? Can you make it web archive ready? How can you export and document the data? How can you document the interfaces and contextual reasons for algorithmic logic?
Refresh little and often, where possible. It's a pain, but it means projects stay in institutional memory
Build on standards, work with communities. Every collection is a special butterfly, but if you work on shared software and standards, someone else might help you maintain it. IIIF is a great example of this.
Check whether your websites are archiveready.com (and nominate UK websites for the UK Web Archive)
Support GLAMs with the legislative, rights and technical challenges of collecting digital ephemera. It's hard to collect social media, websites, podcasts, games, emerging formats, but if we don't, how will we tell the story of 'now' in the future?
And it's been on my mind a lot lately, but I didn't include it: consider the carbon footprint of cloud computing and machine learning, because we also need to maintain the planet.
In closing, I'd slightly adapt the Festival's line: 'design things so that they can be fixed or shut down when their job is done'. I'm sure I've missed some better stories that cultural institutions could tell themselves – let me know what you think!
I’ve just spent Monday and Tuesday in New York for a workshop on ‘Museums + AI’. Funded by the AHRC and led by Oonagh Murphy and Elena Villaespesa, this was the second workshop in the year-long project.
As there’s so much interest in artificial intelligence /
machine learning / data science right now, I thought I’d revive the lost art of
event blogging and share my notes. These notes are inevitably patchy, so keep
an eye out for more formal reports from the team. I’ve used ‘museum’
throughout, as in the title of the event, but many of these issues are relevant
to other collecting institutions (libraries, archives) and public venues. I’m
writing this on the Amtrak to DC so I’ve been lazy about embedding links in
text – sorry!
After a welcome from Pratt (check out their student blog https://museumsdigitalculture.prattsi.org/), Elena’s opening remarks introduced the two themes of the workshop: AI + visitor data and AI + Collections data. Questions about visitor data include whether museums have the necessary data governance and processes in place; whether current ethical codes and regulations are adequate for AI; and what skills staff might need to gain visitor insights with AI. Questions about collections data include how museums can minimise algorithmic biases when interpreting collections; whether the lack of diversity in both museum and AI staff would be reflected in the results; and the implications of museums engaging with big tech companies.
Achim Koh’s talk raised many questions I’ve had as we’ve
thought about AI / machine learning in the library, including how staff
traditionally invested with the authority to talk about collections (curators,
cataloguers) would feel about machines taking on some of that work. I think
we’ve broadly moved past that at the library if we can assume that we’d work
within systems that can distinguish between ‘gold standard’ records created by
trained staff and those created by software (with crowdsourced data somewhere
inbetween, depending on the project).
John Stack and Jamie Unwin from the (UK) Science Museum shared some the challenges of using pre-built commercial
models (AWS Rekognition and Comprehend) on museum collections – anything long and thin is marked as a
'weapon' – and demonstrated a nice tool for seeing 'what the machine saw' https://johnstack.github.io/what-the-machine-saw/.
They don’t currently show machine-generated tags to users, but they’re used
behind-the-scenes for discoverability. Do we need more transparency about how
search results were generated – but will machine tags ever be completely safe
to show people without vetting, even if confidence scores and software versions
are included with the tags?
Andrew Lih talked about image classification work with the Metropolitan Museum and Wikidata which picked up on the issue of questionable tags. Wikidata has a game-based workflow for tagging items, which in addition to tools for managing vandalism or miscreants allows them to trust the ‘crowd’ and make edits live immediately. Being able to sift incorrect from correct tags is vital – but this in turn raises questions of ‘round tripping’ – should a cultural institution ingest the corrections? (I noticed this issue coming up a few times because it’s something we’ve been thinking about as we work with a volunteer creating Wikidata that will later be editable by anyone.) Andrew said that the Met project put AI more firmly into the Wikimedia ecosystem, and that more is likely to come. He closed by demonstrating how the data created could put collections in the centre of networks of information http://w.wiki/6Bf Keep an eye out for the Wiki Art Depiction Explorer https://docs.google.com/presentation/d/1H87K5yjlNNivv44vHedk9xAWwyp9CF9-s0lojta5Us4/edit#slide=id.g34b27a5b18_0_435
Jeff Steward from Harvard Art Museums gave a thoughtful talk
about how different image tagging and captioning tools (Google Vision, Imagga,
Clarifai, Microsoft Cognitive Services) saw the collections, e.g. Imagga might
talk about how fruit depicted in a painting tastes: sweet, juicy; how a bowl is
used: breakfast, celebration. Microsoft tagger and caption tools have different
views, don’t draw on each other.
Chris Alen Sula led a great session on ‘Ethical
Considerations for AI’.
That evening, we went to an event at the Cooper Hewitt for more discussion of https://twitter.com/hashtag/MuseumsAI and the launch of their Interaction Lab https://www.cooperhewitt.org/interaction-lab/ Andrea Lipps and Harrison Pim’s talks reminded me of earlier discussion about holding cultural institutions to account for the decisions they make about AI, surveillance capitalism and more. Workshops like this (and the resulting frameworks) can provide the questions but senior staff must actually ask them, and pay attention to the answers. Karen Palmer’s talk got me thinking about what ‘democratising AI’ really means, and whether it’s possible to democratise something that relies on training data and access to computing power. Democratising knowledge about AI is a definite good, but should we also think about alternatives to AI that don’t involve classifications, and aren’t so closely linked to surveillance capitalism and ad tech?
The next day began with an inspiring talk from Effie Kapsalis on the Smithsonian Institution’s American Women’s History Initiative https://womenshistory.si.edu/ They’re thinking about machine learning and collections as data to develop ethical guidelines for AI and gender, analysing representations of women in multidisciplinary collections, enhancing data at scale and infusing the web with semantic data on historical women.
Shannon Darrough, MoMA, talked about a machine learning
project with Google Arts and Culture to identify artworks in 30,000
installation photos, based on 70,000 collection images https://moma.org/calendar/exhibitions/history/identifying-art
It was great at 2D works, not so much 3D, installation, moving image or
performance art works. The project worked because they identified a clear
problem that machine learning could solve. His talk led to discussion about
sharing training models (i.e. once software is trained to specialise in
particular subjects, others can re-use the ‘models’ that are created), and the
alignment between tech companies’ goals (generally, shorter-term,
self-contained) and museums’ (longer-term, feeding into core systems).
I have fewer notes from talks by Lawrence Swiader (American Battlefield Trust) with good advice on human-centred processes, Juhee Park (V&A) on frameworks for thinking about AI and museums, Matthew Cock (VocalEyes) on chat bots for venue accessibility information, and Carolyn Royston and Rachel Ginsberg (on the Cooper Hewitt’s Interaction Lab), but they added to the richness of the day. My talk was on ‘operationalising AI at a national library’, my slides are online https://www.slideshare.net/miaridge/operationalising-ai-at-a-national-library The final activity was on ‘managing AI’, a subject that’s become close to my heart.
Notion of complexity, and incompleteness familiar to Africa. Africans frown on attempts to simplify
How do notions of incompleteness provide food for thought in digital humanities?
Nyamnjoh decries the sense of superiority inspired by zero sum games. 'Humans are incomplete, nature is incomplete. Religious bit. No one can escape incompleteness.' (Phew! This is something of a mantra when you work with collections at scale – working in cultural institutions comes with a daily sense that the work is so large it will continue after you're just a memory. Let's embrace rather than apologise for it)
References books by Amos Tutuola
Nyamnjoh on hidden persuaders, activators. Juju as a technology of self-extension. With juju, you can extend your presence; rise beyond ordinary ways of being. But it can also be spyware. (Timely, on the day that Zoom was found to allow access to your laptop camera – this has positives and negatives)
Nyamnjoh: DH as the compositeness of being; being incomplete is something to celebrate. Proposes a scholarship of conviviality that takes in practices from different academic disciplines to make itself better.
Nyamnjoh in response to Micki K's question about history as a zero-sum game in which people argue whether something did or didn't happen: create archives that can tell multiple stories, complexify the stories that exist
How to combine new media production with DH methodologies to create kit for recording and locating in the field.
Why georeference? Situate context, comparison old and new maps, feature extraction, or exploring map complexity.
Maps Re-imagined: Digital, Informational, and Perceptional
Experimentations in Progress by Tyng-Ruey Chuang, Chih-Chuan Hsu,
Huang-Sin Syu used OpenStreetMap with historical Taiwanese maps.
Interesting base map options inc ukiyo style https://bcfuture.github.io/tileserver/Switch.html
Oceanic Exchanges: Transnational Textual Migration And Viral Culture
Challenges: imperfect comparability of corpora – data is provided in
different ways by each data provider; no unifying ontology between
archives (no generic identification of specific items); legal
restrictions; TEI and other work hasn't been suitable for newspaper
Limited ability to conduct research across repositories. Deep
semantic multilingual text mining remains a challenge. Political
(national) and practical organisation of archives currently determines
questions that can be asked, privileges certain kinds of enquiry.
Oceanic Exchanges project includes over 100 million pages. Corpus
exploration tool needed to support: exploring data (metadata and text);
other things that went by too quickly.
The Past, Present and Future of Digital Scholarship with Newspaper Collections
@RossiAtanassova Laurel Brake: A researcher's wish list for digitised newspaper journals pic.twitter.com/rNmuuBOFb8
@giovanni1085 @printjournalism list of existing (and very much
felt) problems/challenges for digital media history. But, there is hope
and we persevere #DH2019 pic.twitter.com/LSilbMi9vg
@juliannenyhan Crucial points by @Ajprescott about necessity of
developing critical frameworks for scholarship with digital newspapers
that assist in helping us understand how & why digital newspaper
collections take form they do & how e.g. power, bias & absence
act on and through them #dh2019 pic.twitter.com/WSfjC2aq2t
Working with historical text (digitised newspapers, books, whatever)
collections at scale has some interesting challenges and rewards.
Inspired by all the newspaper sessions? Join an emerging community of
practitioners, researchers and critical friends via this document from a
'DH2019 Lunch session – Researchers & Libraries working together on
improving digitised newspapers' https://docs.google.com/document/d/1JJJOjasuos4yJULpquXt8pzpktwlYpOKrRBrCds8r2g/edit
Another panel where I enjoyed listening and learning about a field I
haven't explored in depth. Tweet from the Q&A: 'Love the 'XR in DH:
Extended Reality in the Digital Humanities' panel responses to a
question about training students only for them to go off and get jobs in
industry: good! Industry needs diversity, PhDs need to support multiple
career paths beyond academia'
Data Science & Digital Humanities: new collaborations, new opportunities and new complexities
(me) 'I'm wondering about this dichotomy between 'new' or novel, and
'useful' or applied – is there actually a sweet spot where data
scientists can work with DH / GLAMs or should we just apply data science
methods and also offer collections for novel data science research?
Thinking of it as a scale of different aspects of 'new to applied
research' rather than a simple either/or'.
SP-19: Cultural Heritage, Art/ifacts and Institutions
“Un Manuscrit Naturellement ” Rescuing a library buried in digital sand
1979, agreement with Ministry of Culture and IRHT to digitise all
manuscripts stored in French public libraries. (Began with microfilm,
not digital). Safe, but not usable. Financial cost of preserving 40TB of
data was prohibitive, but BnF started converting TIFFs to JP2 which
made storage financially feasible. Huge investment by France in data
preservation for digitised manuscripts.
Big data cleaning and deduplication process, got rid of 1 million
files. Discovered errors in TIFF when converting to JP2. Found
inconsistencies with metadata between databases and files. 3 years to do
the prep work and clean the data!
‘A project which lasts for 40 years produces a lot of
variabilities’. Needed a team, access to proper infrastructure; the
person with memory of the project was key.
A Database of Islamic Scientific Manuscripts — Challenges of Past and Future
(Following on from the last paper, digital preservation takes
continuous effort). Moving to RDF model based on CIDOC-CRM, standard
triple store database, standard ResearchSpace/Metaphactory front end.
Trying to separate the data from the software to make maintenance
Analytical Edition Detection In Bibliographic Metadata; The Emerging Paradigm of Bibliographic Data Science
Tweet: Two solid papers on a database for Islamic Scientific
Manuscripts and data science work with the ESTC (English Short Title
Catalogue) plus reflections on the need for continuous investment in
digital preservation. Back on familiar curatorial / #MuseTech ground!
Lahti – Reconciling / data harmonisation for early modern books is
so complex that there are different researchers working on editions,
authors, publishers, places
Syriac Persons, Events, and Relations: A Linked Open Factoid-based Prosopography
Prosopography and factoids. His project relies heavily on authority files that http://syriaca.org/ produces. Modelling factoids in TEI; usually it’s done in relational databases.
Prosopography used to be published as snippets of narrative text about people that enough information was available about
Factoid – a discrete piece of prosopographical information asserted in a primary source text and sourced to that text.
Person, event and relation factoids. Researcher attribution at the
factoid level. Using TEI because (as markup around the text) it stays
close to the primary source material; can link out to controlled
Srophe app – an open source platform for cultural heritage data used to present their prosopographical data https://srophe.app/
Harold Short says how pleased he is to hear a project like that
taking the approach they have; TEI wasn’t available as an option when
they did the original work (seriously beautiful moment)
Why SNAP? ‘FOAF isn’t really good at describing relationships that have come about as a result of slave ownership’
Torsten Roeder: Tracing debate about a particular work through German
music magazines and daily newspapers. OCR and mass digitisation made it
easier to compose representative text corpora about specific subjects.
Authorship information isn’t available so don’t know their backgrounds
etc, means a different form of analysis. ‘Horizontal reading’ as a
metaphor for his approach. Topic modelling didn’t work for looking for
Roeder's requirements: accessible digital copies of newspapers;
reliable metadata; high quality OCR or transcriptions; article borders;
some kind of segmentation; deep semantic annotation – ‘but who does
what?’ What should collection holders / access providers do, and what
should researchers do? (e.g. who should identify entities and concepts
within texts? This question was picked up in other discussion in the
session, on twitter and at an impromptu lunchtime meetup)
Zeg Segal. The Periodical as a Geographical Space. Relation between
the two isn’t unidirectional. Imagined space constructed by the text and
its layout. Periodicals construct an imaginary space that refers back
to the real. Headlines, para text, regular text. Divisions between
articles. His case study for exploring the issues: HaZefirah. (sample
slide image https://twitter.com/mia_out/status/1149581497680052224)
Nanette Rißler-Pipka, Historical Periodicals Research, Opportunities
and Limitations. The limitations she encounters as a researcher.
Building a corpus of historical periodicals for a research question
often means using sources from more than one provider of digitised
texts. Different searches, rights, structure. (The need for multiple
forms of interoperability, again)
Wants article / ad / genre classifications. For metadata wants,
bibliographical data about the title (issue, date); extractable data
(dates, names, tables of contents), provenance data (who digitised,
when?). When you download individual articles, you lose the metadata
which would be so useful for research. Open access is vital;
interoperability is important; the ability to create individual
collections across individual libraries is a wonderful dream
Working on: expanding a query – find neighbouring terms and frequent
OCR errors. Overview of query: where and when is it? Whole corpus has
been processed with topic modelling.
Complex queries: help me find the mention of places, countries,
person in a particular thematic context. Can save to collection or
export for further processing.
See the unsearchable: missing issues, failure to digitise issues, failure to OCRise, corrupt files
Transparency helps researchers discover novel opportunities and make informed decisions about sources.
Clifford Wulfman – how to support transcriptions, linked open data
that allows exploration of notions of periodicity, notions of the
periodical. My tweet: Clifford Wulfman acknowledging that libraries
don't have the resources to support special 'snowflake' projects because
they're working to meet the most common needs. IME this question/need
doesn't go away so how best to tackle and support it?
Q&A comment: what if we just put all newspapers on Impresso?
Discussion of standardisation, working jointly, collaborating
Melodee Beals comments: libraries aren’t there just to support
academic researchers, academics could look to supporting the work of
creative industries, journalists and others to make it easier for
libraries to support them.
Subject librarian from Leiden University points out that copyright
limits their ability to share newspapers after 1880. (Innovating is hard
when you can't even share the data)
Nanette Rißler says researchers don't need fancy interfaces, just
access to the data (which probably contradicts the need for 'special
snowflake' systems and explains why libraries can never ever make all
LP-34: Cultural Heritage, Art/ifacts and Institutions
Mark Hill, early modern (1500-1800 but 18thC in particular)
definitions of ‘authorship’. How does authorship interact with
structural aspects of publishing? Shift of authorship from gentlemanly
to professional occupation.
Using the ESTC. Has about 1m actors, 400k documents with actors
attached to them. Actors include authors, editors, publishers, printers,
translators, dedicatees. Early modern print trade was ‘trade on a human
scale’. People knew each other ‘hand-operated printing press required
individual actors and relationships’.
As time goes on, printers work with fewer, publishers work with more people, authors work with about the same number of people.
They manually created a network of people associated with Bernard
Mandeville and compared it with a network automatically generated from
Looking at a work network for Edmond Hoyle’s Short Treatise on the
Game of Whist. (Today I learned that Hoyle's Rules, determiner of
victory in family card games and of 'according to Hoyle' fame, dates
back to a book on whist in the 18thC)
(Really nice use of social network analysis to highlight changes in
publisher and authorship networks.) Eigenvector very good at finding
important actors. In the English Civil War, who you know does matter
when it comes to publishing. By 18thC publishers really matter. See http://ceur-ws.org/Vol-2364/19_paper.pdf for more.
Richard Freedman, David Fiala, Andrew Janco et al
What is a musical quotation? Borrowing, allusion, parody, commonplace, contrafact, cover, plagiat, sampling, signifying.
Tweet: Freedman et al.'s slides for 'Citations: The Renaissance
Imitation Mass (CRIM) and The Quotable Musical Text in a Digital Age' https://bit.ly/CRIM_Utrecht are a rich introduction to applications of #DigitalMusicology encoding and markup
I spend so much time in text worlds that it's really refreshing to
hear from musicologists who play music to explain their work and place
so much value on listening while also exploiting digital processing
tools to the max
Digging Into Pattern Usage Within Jazz Improvisation (Pattern History Explorer, Pattern Search and Similarity Search)
Frank Höger, Klaus Frieler, Martin Pfleiderer
?? Space as a container for understanding, organising information. Chorography, the writing of the region.
Tweet: In the spatial humanities panel where a speaker mentions
chorography, which along with prosopography is my favourite
Daniel Alves. Do history and literature researchers feel the need to
incorporate spatial analysis in their work? A large number who do don’t
use GIS. Most of them don’t believe in it (!). The rest are so tired
that they prefer theorising (!!)
His goal, ref last night keynote, is not to build models, tools, the
next great algorithm; it’s to advance knowledge in his specific field.
Tweet: @DanielAlvesFCSH Is #SpatialDH revolutionary? Do history and
literature researchers feel the need to incorporate spatial analysis in
their work? A large number who do don’t use GIS. Most of them don’t
believe in it(!). The rest are so tired that they prefer theorising(!!)
Tweet: @DanielAlvesFCSH close reading is still essential to take in
the inner subjectivity of historical / literary sources with a partial
and biases conception of space and place
Tien Danniau, Ghent Centre for Digital Humanities – deep maps. How is the concept working for them?
Tweet: Deep maps! A slide showing some of the findings from the 2012
NEH Advanced Institute on spatial narratives and deep mapping, which is
where I met many awesome DH and spatial history people
Katie McDonough, Spatial history between maps and texts: lessons
from the 18thC. Refers to Richard White’s spatial history essay in her
abstract. Rethinking geographic information extraction. Embedded
entities, spatial relations, other stuff.
Tweet #DH2019 Preserving deep maps? I'd talk to folk in web
archiving for a sense of which issues re recording complex,
multi-format, dynamic items are tricky and which are more solveable
Closing keynote: Digital Humanities — Complexities of Sustainability, Johanna Drucker
(By this point my laptop and mental batteries were drained so I just
listened and tweeted. I was also taking part in a conversation about the
environmental sustainability of travel for conferences, issues with
access to visas and funding, etc, that might be alleviated by better
incorporating talks from remote presenters, or even having everyone
Finally, the DH2020 conference is calling for reviewers. Reviewing is
an excellent way to give something back to the DH community while
learning about the latest work as it appears in proposals, and perhaps
more importantly, learning how to write a good proposal yourself. Find
out more: http://dh2020.adho.org/cfps/reviewers/
explores some of the challenges and paradoxes in the application of data
science methods to cultural heritage collections. It is drawn from long
experience in the cultural heritage sector, predating but broadly aligned to
the 'OpenGLAM' and 'Collections as Data' movements. Experiences that have
shaped this thinking include providing open cultural data for computational
use; creating APIs for catalogue and interpretive records, running hackathons,
and helping cultural organisations think through the preparation of
'collections as data'; and supervising undergraduate and MSc projects for
students of computer science.
opportunities are many. Cultural heritage institutions (aka GLAMS – galleries,
libraries, archives and museums) hold diverse historical, scientific and
creative works – images, printed and manuscript works, objects, audio or video
– that could be turned into some form of digital 'data' for use in data science
and digital humanities research. GLAM staff have expert knowledge about the
collections and their value to researchers. Data scientists bring a rigour,
specialist expertise and skills, and a fresh perspective to the study of
cultural heritage collections.
quest to publish cultural heritage records and digital surrogates for use in
data science is relatively new, the barriers within cultural organisations to
creating suitable infrastructure with others are historically numerous. They
include different expectations about the pace and urgency of work, different
levels of technical expertise, resourcing and infrastructure, and different
goals. They may even include different expectations about what 'data' is –
metadata drawn from GLAM catalogues is the most readily available and shared
data, but not only is this rarely complete, often untidy and inconsistent
(being the work of decades or centuries and many hands over that time), it is
also a far cry from datasets rich with images or transcribed text that data
scientists might expect.
data protection and commercial licensing can limit access to digitised
materials (though this varies greatly). 'Orphaned works', where the rights
holder cannot be traced in order to licence the use of in-copyright works, mean
that up to 40% of some collections, particularly sound or video collections,
are unavailable for risk-free use.(2012)
GLAMs have experimented with APIs, downloadable datasets and SPARQL endpoints,
they rarely have the resources or institutional will to maintain and refresh
these indefinitely. Records may be available through multi-national aggregators
such as Europeana, DPLA, or national aggregators, but as aggregation often
requires that metadata is mapped to the lowest common denominator, their value
for research may be limited.
The area of overlap between 'computationally interesting problems' and 'solutions useful for GLAMs' may be smaller than expected to date, but collaboration between cultural institutions and data scientists on shared projects in the 'sweet spot' – where new data science methods are explored to enhance the discoverability of collections – may provide a way forward. Sector-wide collaborations like the International Image Interoperability Framework (IIIF, https://iiif.io/) provide modern models for lightweight but powerful standards. Pilot projects with students or others can help test the usability of collection data and infrastructure while exploring the applicability of emerging technologies and methods. It is early days for these collaborations, but the future is bright.
An excerpt from the longer panel description by David Beavan and Barbara McGillivray.
This panel highlights the emerging collaborations and opportunities between the fields of Digital Humanities (DH), Data Science (DS) and Artificial Intelligence (AI). It charts the enthusiastic progress of the Alan Turing Institute, the UK national institute for data science and artificial intelligence, as it engages with cultural heritage institutions and academics from arts, humanities and social sciences disciplines. We discuss the exciting work and learnings from various new activities, across a number of high-profile institutions. As these initiatives push the intellectual and computational boundaries, the panel considers both the gains, benefits, and complexities encountered. The panel latterly turns towards the future of such interdisciplinary working, considering how DS & DH collaborations can grow, with a view towards a manifesto. As Data Science grows globally, this panel session will stimulate new discussion and direction, to help ensure the fields grow together and arts & humanities remain a strong focus of DS & AI. Also so DH methods and practices continue to benefit from new developments in DS which will enable future research avenues and questions.
It's not easy to find the abstracts for presentations within panels on the Digital Humanities 2019 (DH2019) site, so I've shared mine here. The panel was designed to bring together range of interdisciplinary newspaper-based digital humanities and/or data science projects, with 'provocations' from two senior scholars who will provide context for current ambitions, and to start conversations among practitioners.
Short Paper: Living with Machines
Paper authors: Mia Ridge, Giovanni Colavizzawith Ruth Ahnert, Claire Austin, David Beavan, Kaspar Beelens, Mariona Coll Ardanuy, Adam Farquhar, Emma Griffin, James Hetherington, Jon Lawrence, Katie McDonough, Barbara McGillivray, André Piza, Daniel van Strien, Giorgia Tolfo, Alan Wilson, Daniel Wilson.
Living with Machines is a five-year interdisciplinary research project, whose ambition is to blend data science with historical enquiry to study the human impact of the industrial revolution. Set to be one of the biggest and most ambitious digital humanities research initiatives ever to launch in the UK, Living with Machines is developing a large-scale infrastructure to perform data analyses on a variety of historical sources, and in so doing provide vital insights into the debates and discussions taking place in response to today’s digital industrial revolution.
Seeking to make the most of a self-described
'radical collaboration', the project will iteratively develop research
questions as computational linguists, historians, library curators and data
scientists work on a shared corpus of digitised newspapers, books and
biographical data (census, birth, death, marriage, etc. records). For example,
in the process of answering historical research questions, the project could
take advantage of access to expertise in computational linguistics to overcome
issues with choosing unambiguous and temporally stable keywords for analysis,
previously reported by others (Lansdall-Welfare
et al., 2017). A key methodological objective of the project is to
'translate' history research questions into data models, in order to inspect
and integrate them into historical narratives. In order to enable this process,
a digital infrastructure is being collaboratively designed and developed, whose
purpose is to marshal and interlink a variety of historical datasets, including
newspapers, and allow for historians and data scientists to engage with them.
In this paper we will present our vision for Living with Machines, focusing on how we
plan to approach it, and the ways in which digital infrastructure enables this
multidisciplinary exchange. We will also showcase preliminary results from the
different research 'laboratories', and detail the historical sources we plan to
use within the project.
The Past, Present and Future of
Digital Scholarship with Newspaper Collections
Mia Ridge (British
Library), Giovanni Colavizza (Alan Turing Institute)
Historical newspapers are of interest to many humanities scholars, valued as sources of information and language closely tied to a particular time, social context and place. Following library and commercial microfilming and, more recently, digitisation projects, newspapers have been an accessible and valued source for researchers. The ability to use keyword searches through more data than ever before via digitised newspapers has transformed the work of researchers.
Digitised historic newspapers are also of interest to many researchers who seek large bodies of relatively easily computationally-transcribed text on which they can try new methods and tools. Intensive digitisation over the past two decades has seen smaller-scale or repository-focused projects flourish in the Anglophone and European world (Holley, 2009; King, 2005; Neudecker et al., 2014). However, just as earlier scholarship was potentially over-reliant on The Times of London and other metropolitan dailies, this has been replicated and reinforced by digitisation projects (for a Canadian example, see Milligan 2013).
In the last years, several large consortia projects proposing to apply data science and computational methods to historical newspapers at scale have emerged, including NewsEye, impresso, Oceanic Exchanges and Living with Machines. This panel has been convened by some consortia members to cast a critical view on past and ongoing digital scholarship with newspapers collections, and to inform its future.
Digitisation can involve both complexities and simplifications. Knowledge about the imperfections of digitisation, cataloguing, corpus construction, text transcription and mining is rarely shared outside cultural institutions or projects. How can these imperfections and absences be made visible to users of digital repositories? Furthermore, how does the over-representation of some aspects of society through the successive winnowing and remediation of potential sources – from creation to collection, microfilming, preservation, licensing and digitisation – affect scholarship based on digitised newspapers. How can computational methods address some of these issues?
The panel proposes the following format: short papers will be delivered by existing projects working on large collections of historical newspapers, presenting their vision and results to date. Each project is at different stages of development and will discuss their choice to work with newspapers, and reflect on what have they learnt to date on practical, methodological and user-focused aspects of this digital humanities work. The panel is additionally an opportunity to consider important questions of interoperability and legacy beyond the life of the project. Two further papers will follow, given by scholars with significant experience using these collections for research, in order to provide the panel with critical reflections. The floor will then open for debate and discussion.
This panel is a unique opportunity to bring senior scholars with a long perspective on the uses of newspapers in scholarship together with projects at formative stages. More broadly, convening this panel is an opportunity for the DH2019 community to ask their own questions of newspaper-based projects, and for researchers to map methodological similarities between projects. Our hope is that this panel will foster a community of practice around the topic and encourage discussions of the methodological and pedagogical implications of digital scholarship with newspapers.
an overview of the impact of keyword search on historical research see (Putnam, 2016) (Bingham, 2010).
Before we start: in the spirit of the mid-2000s, I thought I'd have a go at blogging about events again. I've realised I miss the way that blogging and reading other people's posts from events made me feel part of a distributed community of fellow travellers. Journal articles don't have the same effect (they're too long and jargony for leisure readers, assuming they're accessible outside universities at all), and tweets are great for connecting with people, but they're very ephemeral. Here goes…
On September 3 I was at BBC Broadcasting House for 'AI, Society & the Media: How can we Flourish in the Age of AI?' by BBC, LCFI and The Alan Turing Institute. Artificial intelligence is a hot topic so it was a sell-out event. My notes are very partial (in both senses of the word), and please do let me know if there are errors. The event hashtag will provide more coverage: https://twitter.com/hashtag/howcanweflourish.
The first session was 'AI – What you need to know!'. Matthew Postgate began by providing context for the BBC's interest in AI. 'We need a plurality of business models for AI – not just ad-funded' – yes! The need for different models for AI (and related subjects like machine learning) was a theme that recurred throughout the day (and at other events I was at this week).
Adrian Weller spoke on the limitations of AI. It's data hungry, compute intensive, poor at representing uncertainty, easily fooled by adversarial examples (and more that I missed). We need sensible measures of trustworthiness including robustness, fairness, protection of privacy, transparency.
Been Kim shared Google's AI principles: https://ai.google/principles She's focused on interpretability – goals are to ensure that our values are aligned and our knowledge is reflected. She emphasised the need to understand your data (another theme across the day and other events this week). You can an inherently interpretable machine model (so it can explain its reasoning) or can build an interpreter, enabling conversations between humans and machines. You can then uncover bias using the interpreter, asking what weight it gave to different aspects in making decisions.
Jonnie Penn (who won me with an early shout out to the work of Jon Agar) asked, from where does AI draw its authority? AI is feeding a monopoly of Google-Amazon-Facebook who control majority of internet traffic and advertising spend. Power lies in choosing what to optimise for, and choosing what not to do (a tragically poor paraphrase of his example of advertising to children, but you get the idea). We need 'bureaucratic biodiversity' – need lots of models of diverse systems to avoid calcification.
Kate Coughlan – only 10% of people feel they can influence AI. They looked at media narratives re AI on axes of time (ease vs obsolescence), power (domination vs uprising), desire (gratification vs alienation), life (immortality vs inhumanity). Their survey found that each aspect was equally disempowering. Passivity drives negative outcomes re feelings about change, tech – but if people have agency, then it's different. We need to empower citizens to have active role in shaping AI.
The next session was 'Fake News, Real Problems: How AI both builds and destroys trust in news'. Ryan Fox spoke on 'manufactured consensus' – we're hardwired to agree with our community so you can manipulate opinion by making it look like everyone else thinks a certain way. Manipulating consensus is currently legal, though against social network T&S. 'Viral false narratives can jeopardise brand trust and integrity in an instant'. Manufactured outrage campaigns etc. They're working on detecting inorganic behaviour through the noise – it's rapid, repetitive, sticky, emotional (missed some).
One of the panel questions was, would AI replace journalists? No, it's more like having lots of interns – you wouldn't have them write articles. AI is good for tasks you can explain to a smart 16 year old in the office for a day. The problematic ad-based model came up again – who is the arbiter of truth (e.g. fake news on Facebook). Who's paying for those services and what power does it give them?
This panel made me think about discussions about machine learning and AI at work. There are so many technical, contextual and ethical challenges for collecting institutions in AI, from capturing the output of an interactive voice experience with Alexa, to understanding and recording the difference between Russia Today as a broadcast news channel and as a manipulator of YouTube rankings.
Next was a panel on 'AI as a Creative Enabler'. Cassian Harrison spoke about 'Made By Machine', an experiment with AI and archive programming. They used scene detection, subtitle analysis, visual 'energy', machine learning on the BBC's Redux archive of programmes. Programmes were ranked by how BBC4 they were; split into sections then edited down to create mini BBC4 programmes.
Kanta Dihal and Stephen Cave asked why AI fascinates us in a thoughtful presentation. It's between dead and alive, uncanny (and lots more but clearly my post-lunch notetaking isn't the best).
Anna Ridler and Amy Cutler have created an AI-scripted nature documentary (trained on and re-purposing a range of tropes and footage from romance novels and nature documentaries) and gave a brilliant presentation about AI as a medium and as a process. Anna calls herself a dataset artist, rather than a machine learning artist. You need to get to know the dataset, look out for biases and mistakes, understand the humanness of decisions about what was included or excluded. Machines enact distorted versions of language.
I don't have notes from 'Next Gen AI: How can the next generation flourish in the age of AI?' but it was great to hear about hackathons where teenagers could try applying AI. The final session was 'The Conditions for Flourishing: How to increase citizen agency and social value'. Hannah Fry – once something is dressed up as an algorithm it gains some authority that's hard to question. Diane Coyle talked about 'general purpose technologies', which transform one industry then others. Printing, steam, electricity, internal combustion engine, digital computing, AI. Her 'lessons for the era of AI' were: all technology is social; all technologies are disruptive and have unpredictable consequences; all successful technologies enhance human freedoms', and accordingly she suggested we 'think in systems; plan for change; be optimistic'.
Konstantinos Karachalios called for a show of hands re who feels they have control over their data and what's done with it? Very few hands were raised. 'If we don't act now we'll lose our agency'.
I'm going to give the final word to Terah Lyons as the key takeaway from the day: 'technology is not destiny'.
I didn't hear a solution to the problems of 'fake news' that doesn't require work from all of us. If we don't want technology to be destiny, we all need pay attention to the applications of AI in our lives, and be prepared to demand better governance and accountability from private and government agents.
(A bonus 'question I didn't ask' for those who've read this far: how do BBC aims for ethical AI relate to the introduction compulsory registration to access tv and radio? If I turn on the radio in my kitchen, my listening habits aren't tracked; if I listen via the app they're linked to my personal ID).
I've been posting on the work blog far more frequently than I have here. Launching and running In the Spotlight, crowdsourcing the transcription of the British Library's historic playbills collection, was a focus in 2017-18. Some blog posts:
'If you follow @BL_DigiSchol or #DigitalHumanities hashtags on twitter, you might have seen a burst of data science, history and digital humanities jobs being advertised. In this post, Dr Mia Ridge of the Library's Digital Scholarship team provides some background to contextualise the jobs advertised with the 'Living with Machines' project.
We are seeking to appoint several new roles who will collaborate on an exciting new project developed by the British Library and The Alan Turing Institute, the national centre for data science and artificial intelligence.
You may have noticed that the British Library is also currently advertising for a Curator, Newspaper Data (closes Sept 9). This isn’t related to Living with Machines, but with an approach of applying data-driven journalism and visualisation techniques to historical collections, it should have some lovely synergies and opportunities to share work in progress with the project team. There's also a Research Software Engineer advertised that will work closely with many of the same British Library teams.