What big topics in Digital Humanities should a reading group discuss in 2021?

This is a thrown-together post to capture responses to a question I asked on twitter last week. The Digital Scholarship Reading Group I run at the British Library will spend the first meeting of 2021 collaboratively planning topics to discuss in the rest of the year, so to broaden my understanding of what might be discussed, I posted, ‘A question for people interested / working in Digital Humanities – what do you think are the big topics for 2021? Or what’s not, but should be a focus? … New publications or conference papers welcome!’.

And since I was asking people for suggestions, it seemed like the right time to share something we’d been thinking about for a while: ‘we’ve decided to open our discussions to people outside the British Library / Turing Institute! We’ll alternate between 11am-12pm and 3-4pm meeting times on the first Tuesday of each month’. I haven’t sorted the logistics for signing up – should it be on a session by session basis, or should we just add people’s email address to the generic meeting request so they get the updates? (Will they get the updates, given how defensive and awful email is for collaboration these days?)

I also posted links: ‘For context, here’s what we read up to early 2018 What do deep learning, community archives, Livy and the politics of artefacts have in common? and a themed summary, Readings at the intersection of digital scholarship and anti-racism.

Responses to date are below. I didn’t want to faff about with embedded tweets because they’re more likely to break over time, so I’ve just indented replies with the username at the start.

Claire Boardman @boardman_claire The environmental impact of DH? Conversational AI and collections?

Jajwalya Karajgikar @JajRK Large language models, and computational text analysis overall?

                @mia_out As in models that use very large amounts of training data? And yes, we should do more on CTA, I think we could probably get broader coverage of methods, thanks for the prompt!

                Jajwalya Karajgikar @JajRK Models that use deep learning for language prediction; GPT-3 I think someone mentioned on the thread already?

Thomas Padilla @thomasgpadilla Social justice and DH – though all work that frames current strife as a new thing vs. a longstanding pervasive reality should be tossed into an abyss to make way for others

                @mia_out I won’t ask you to name and shame bad pieces, but let me know if you have any favs that do it well!

Thomas Padilla @thomasgpadilla Ha! On the collections side @dorothyjberry  has a piece or two brewing.  @ess_ell_zee  work here is good too I think https://journal.code4lib.org/articles/14667

                @artepublico peeps like @gbaezaventura and @rayenchil and the @MellonFdn supported Latinx DH program are good places to look

                Same goes for @profgabrielle and all the @CCP_org  work is fantastic

Wilhelmina Randtke @randtke Long term sustained funding. Acknowledging, and even compiling a list of, projects that have had resources eliminated or been completely discontinued since March.

                Jenny Fewster @Fewster Absolutely! This is a problem internationally. Dig hums projects set up with one off funding that then aren’t sustained. Unfortunately digital projects are not a “set and forget” prospect. It’s a colossal waste of time, effort, knowledge and money

Matthew Hannah @TinkeringHuman I think we need/will see more work about the limits of neoliberal capitalism, the academy, and DH, applications of critical university studies and Marxist theory. Esp as higher ed continues to implode.

                @mia_out Sounds very timely! I don’t suppose you have any papers or presentations in mind?

                Matthew Hannah @TinkeringHuman Claire Potter’s piece in Radical Teacher is also an inspiration: https://pdfs.semanticscholar.org/c3c0/b0f853710a56b13b0d232b3b435a19bf59a7.pdf

                But we need more engagement I think around the question of precarity and economics imo

                See also: https://jimmcgrath.us/blog/new-publication-precarious-labor-in-the-digital-humanities-american-quarterly-70-3/

Johan Oomen @johanoomen Detecting polyvocality in heritage collections and navigating this underexplored dimension to investigate shifting viewpoints over time. Could also be a great opportinity for crowdsourcing projects, to encourage contemporary users to voice their opinions on contentious topics.

                @mia_out Ooh, that’s a really juicy one – lots of potential and lots of pitfalls

Erik Champion @nzerik The influence of social media on politics? The failure of social media apps, webchat etc to compensate for lockdown distancing? Govt and corp control on personal data? Big companies controlling VR devices and personal +physiological data?

                @mia_out As seen recently when people were annoyed they had to do a Google Recaptcha on a COVID test site

                Erik Champion @nzerik Bots need vaccines too! (Equality for bots trojans and spam machines #101)

Alexander Doria @Dorialexander On the technical side, optical manuscript recognition and layout analysis (especially for newspaper archives): mature tools are just emerging and that can change a lot in terms of corpus availability, research directions and digitization choices.

                @mia_out There is so much interesting work on newspapers right now! It feels like scholarship is going to have a quite different starting point in just a few years. Periodicals less so, maybe because they’re more specialised and less (family history) name rich?

                Alexander Doria @Dorialexander Yes that’s true. Perhaps also because they are less challenging both technically and intellectually (it’s not that much of a stretch to go from book studies to the periodicals).

Alexander Doria @Dorialexander (On the social side I would say there is a long overdue uncomfortable discussion about the reliance of the field to diverse forms of digital labors: from the production of digitized archives in developing countries to the large use of students as a cheap/unpaid labor force)

                @mia_out That ties in with ideas from @TinkeringHuman

Max Kemman @MaxKemman I think we’ll be seeing more about Computational Humanities and how it relates to Digital Humanities, for which a good starting point will be the @CompHumResearch conference proceedings http://ceur-ws.org/Vol-2723/

                @mia_out Ooh, we could have a debate or discussion about the difference!

                Lauren Tilton @nolauren And intersection/ difference from Data Science

                @mia_out Good call, the lines are becoming increasingly blurred, hopefully in more good ways than bad

Gabriel Hankins @GabrielHankins GPT-3 and algorithmic composition. Interested in the conversation if you open it!

And finally, one reason I collected these responses was:

Michael Lascarides @mlascarides A feature I wish Twitter had: When I see someone influential in a domain I’m interested in ask a really great question, I want to bookmark that question to return to in a couple of days once the responses have come in. It’s a use case a bit more specific than a “like”.

                Michael Lascarides @mlascarides Inspired most recently by [my] Q, but it comes up about once a week for me

Notes from Digital Humanities 2019 (DH2019 Utrecht)

My rough notes from the Digital Humanities 2019 conference in Utrecht. All the usual warnings about partial attention / tendency for distraction apply. My comments are usually in brackets.

I found the most useful reference for the conference programme to be https://www.conftool.pro/dh2019/index.php?page=browseSessions&path=adminSessions&print=export&presentations=show but it doesn’t show the titles or abstracts for papers within panels.

Some places me and my colleagues were during the conference: https://blogs.bl.uk/digital-scholarship/2019/07/british-library-digital-scholarship-at-digital-humanities-2019-.html http://livingwithmachines.ac.uk/living-with-machines-at-digital-humanities-2019/

DH2019 Keynote by Francis B. Nyamnjoh, ‘African Inspiration for Understanding the Compositeness of Being Human through Digital Technology’


  • Notion of complexity, and incompleteness familiar to Africa. Africans frown on attempts to simplify
  • How do notions of incompleteness provide food for thought in digital humanities?
  • Nyamnjoh decries the sense of superiority inspired by zero sum games. ‘Humans are incomplete, nature is incomplete. Religious bit. No one can escape incompleteness.’ (Phew! This is something of a mantra when you work with collections at scale – working in cultural institutions comes with a daily sense that the work is so large it will continue after you’re just a memory. Let’s embrace rather than apologise for it)
  • References books by Amos Tutuola
  • Nyamnjoh on hidden persuaders, activators. Juju as a technology of self-extension. With juju, you can extend your presence; rise beyond ordinary ways of being. But it can also be spyware. (Timely, on the day that Zoom was found to allow access to your laptop camera – this has positives and negatives)
  • Nyamnjoh: DH as the compositeness of being; being incomplete is something to celebrate. Proposes a scholarship of conviviality that takes in practices from different academic disciplines to make itself better.
  • Nyamnjoh in response to Micki K’s question about history as a zero-sum game in which people argue whether something did or didn’t happen: create archives that can tell multiple stories, complexify the stories that exist

DH2019 Day 1, July 10

LP-03: Space Territory GeoHumanities

https://www.conftool.pro/dh2019/index.php?page=browseSessions&path=adminSessions&print=export&ismobile=false&form_session=455&presentations=show Locating Absence with Narrative Digital Maps

How to combine new media production with DH methodologies to create kit for recording and locating in the field.

Why georeference? Situate context, comparison old and new maps, feature extraction, or exploring map complexity.

Maps Re-imagined: Digital, Informational, and Perceptional Experimentations in Progress by Tyng-Ruey Chuang, Chih-Chuan Hsu, Huang-Sin Syu used OpenStreetMap with historical Taiwanese maps. Interesting base map options inc ukiyo style https://bcfuture.github.io/tileserver/Switch.html

Oceanic Exchanges: Transnational Textual Migration And Viral Culture

https://www.conftool.pro/dh2019/index.php?page=browseSessions&path=adminSessions&print=export&ismobile=false&form_session=477&presentations=show Oceanic Exchanges studies the flow of information, searching for historical-literary connections between newspapers around the world; seeks to push the boundaries of research w newspapers

  • Challenges: imperfect comparability of corpora – data is provided in different ways by each data provider; no unifying ontology between archives (no generic identification of specific items); legal restrictions; TEI and other work hasn’t been suitable for newspaper research
  • Limited ability to conduct research across repositories. Deep semantic multilingual text mining remains a challenge. Political (national) and practical organisation of archives currently determines questions that can be asked, privileges certain kinds of enquiry.
  • Oceanic Exchanges project includes over 100 million pages. Corpus exploration tool needed to support: exploring data (metadata and text); other things that went by too quickly.

The Past, Present and Future of Digital Scholarship with Newspaper Collections


I was on this panel so I tweeted a bit but have no notes myself.

Working with historical text (digitised newspapers, books, whatever) collections at scale has some interesting challenges and rewards. Inspired by all the newspaper sessions? Join an emerging community of practitioners, researchers and critical friends via this document from a ‘DH2019 Lunch session – Researchers & Libraries working together on improving digitised newspapers’ https://docs.google.com/document/d/1JJJOjasuos4yJULpquXt8pzpktwlYpOKrRBrCds8r2g/edit

Complexities, Explainability and Method

https://www.conftool.pro/dh2019/index.php?page=browseSessions&path=adminSessions&print=export&ismobile=false&form_session=486&presentations=show I enjoyed listening to this panel which is so far removed from my everyday DH practice.

Other stuff

Tweet: If you ask a library professional about digitisating (new word alert!) a specific collection and they appear to go quiet, this is actually what they’re doing – digitisation takes shedloads of time and paperwork https://twitter.com/CamDigLib/status/1148888628405395456


@LibsDH ADHO Lib & DH SIG meetup

There was a lunchtime meeting for ‘Libraries and Digital Humanities: an ADHO Special Interest Group’, which was a lovely chance to talk libraries / GLAMs and DH. You can join the group via https://docs.google.com/forms/d/e/1FAIpQLSfswiaEnmS_mBTfL3Bc8fJsY5zxhY7xw0auYMCGY_2R0MT06w/viewform or the mailing list at http://lists.digitalhumanities.org/mailman/listinfo/libdh-sig

DH2019 Day 2, July 11

XR in DH: Extended Reality in the Digital Humanities


Another panel where I enjoyed listening and learning about a field I haven’t explored in depth. Tweet from the Q&A: ‘Love the ‘XR in DH: Extended Reality in the Digital Humanities’ panel responses to a question about training students only for them to go off and get jobs in industry: good! Industry needs diversity, PhDs need to support multiple career paths beyond academia’

Data Science & Digital Humanities: new collaborations, new opportunities and new complexities

https://www.conftool.pro/dh2019/index.php?page=browseSessions&path=adminSessions&print=export&ismobile=false&form_session=532&presentations=show Beatrice Alex, Anne Alexander, David Beavan, Eirini Goudarouli, Leonardo Impett, Barbara McGillivray, Nora McGregor, Mia Ridge

My work with open cultural data has led to me asking ‘how can GLAMs and data scientists collaborate to produce outcomes that are useful for both?’. Following this, I presented a short paper, more info at https://www.openobjects.org.uk/2019/07/in-search-of-the-sweet-spot-infrastructure-at-the-intersection-of-cultural-heritage-and-data-science/ https://www.slideshare.net/miaridge/in-search-of-the-sweet-spot-infrastructure-at-the-intersection-of-cultural-heritage-and-data-science.

As summarised in tweets:

  • https://twitter.com/semames1/status/1149250799232540672, ‘data science can provide new routes into library collections; libraries can provide new challenging sources of information (scale, untidy data) for data scientists’;
  • https://twitter.com/sp_meta/status/1149251010025656321 ‘library staff are often assessed by strict metrics of performance – items catalog, speed of delivery to reading room – that isn’t well-matched to messy, experimental collaborations with data scientists’;
  • https://twitter.com/melissaterras/status/1149251480576303109 ‘Copyright issues are inescapable… they are the background noise to what we do’;
  • https://twitter.com/sp_meta/status/1149251656720289792 ‘How can library infrastructure change to enable collaboration with data scientists, encouraging use of collections as data and prompting researchers to share their data and interpretations back?’;
  • (me) ‘I’m wondering about this dichotomy between ‘new’ or novel, and ‘useful’ or applied – is there actually a sweet spot where data scientists can work with DH / GLAMs or should we just apply data science methods and also offer collections for novel data science research? Thinking of it as a scale of different aspects of ‘new to applied research’ rather than a simple either/or’.

SP-19: Cultural Heritage, Art/ifacts and Institutions


“Un Manuscrit Naturellement ” Rescuing a library buried in digital sand

  • 1979, agreement with Ministry of Culture and IRHT to digitise all manuscripts stored in French public libraries. (Began with microfilm, not digital). Safe, but not usable. Financial cost of preserving 40TB of data was prohibitive, but BnF started converting TIFFs to JP2 which made storage financially feasible. Huge investment by France in data preservation for digitised manuscripts.
  • Big data cleaning and deduplication process, got rid of 1 million files. Discovered errors in TIFF when converting to JP2. Found inconsistencies with metadata between databases and files. 3 years to do the prep work and clean the data!
  • ‘A project which lasts for 40 years produces a lot of variabilities’. Needed a team, access to proper infrastructure; the person with memory of the project was key.

A Database of Islamic Scientific Manuscripts — Challenges of Past and Future

  • (Following on from the last paper, digital preservation takes continuous effort). Moving to RDF model based on CIDOC-CRM, standard triple store database, standard ResearchSpace/Metaphactory front end. Trying to separate the data from the software to make maintenance easier.

Analytical Edition Detection In Bibliographic Metadata; The Emerging Paradigm of Bibliographic Data Science

  • Tweet: Two solid papers on a database for Islamic Scientific Manuscripts and data science work with the ESTC (English Short Title Catalogue) plus reflections on the need for continuous investment in digital preservation. Back on familiar curatorial / #MuseTech ground!
  • Lahti – Reconciling / data harmonisation for early modern books is so complex that there are different researchers working on editions, authors, publishers, places

Syriac Persons, Events, and Relations: A Linked Open Factoid-based Prosopography

  • Prosopography and factoids. His project relies heavily on authority files that http://syriaca.org/ produces. Modelling factoids in TEI; usually it’s done in relational databases.
  • Prosopography used to be published as snippets of narrative text about people that enough information was available about
  • Factoid – a discrete piece of prosopographical information asserted in a primary source text and sourced to that text.
  • Person, event and relation factoids. Researcher attribution at the factoid level. Using TEI because (as markup around the text) it stays close to the primary source material; can link out to controlled vocabulary
  • Srophe app – an open source platform for cultural heritage data used to present their prosopographical data https://srophe.app/
  • Harold Short says how pleased he is to hear a project like that taking the approach they have; TEI wasn’t available as an option when they did the original work (seriously beautiful moment)
  • Why SNAP? ‘FOAF isn’t really good at describing relationships that have come about as a result of slave ownership’
  • More on factoid prosopography via Arianna Ciula https://factoid-dighum.kcl.ac.uk/

Day 3, July 12

Complexities in the Use, Analysis, and Representation of Historical Digital Periodicals


  • Torsten Roeder: Tracing debate about a particular work through German music magazines and daily newspapers. OCR and mass digitisation made it easier to compose representative text corpora about specific subjects. Authorship information isn’t available so don’t know their backgrounds etc, means a different form of analysis. ‘Horizontal reading’ as a metaphor for his approach. Topic modelling didn’t work for looking for music criticism.
  • Roeder’s requirements: accessible digital copies of newspapers; reliable metadata; high quality OCR or transcriptions; article borders; some kind of segmentation; deep semantic annotation – ‘but who does what?’ What should collection holders / access providers do, and what should researchers do? (e.g. who should identify entities and concepts within texts? This question was picked up in other discussion in the session, on twitter and at an impromptu lunchtime meetup)
  • Zeg Segal. The Periodical as a Geographical Space. Relation between the two isn’t unidirectional. Imagined space constructed by the text and its layout. Periodicals construct an imaginary space that refers back to the real. Headlines, para text, regular text. Divisions between articles. His case study for exploring the issues: HaZefirah. (sample slide image https://twitter.com/mia_out/status/1149581497680052224)
  • Nanette Rißler-Pipka, Historical Periodicals Research, Opportunities and Limitations. The limitations she encounters as a researcher. Building a corpus of historical periodicals for a research question often means using sources from more than one provider of digitised texts. Different searches, rights, structure. (The need for multiple forms of interoperability, again)
  • Wants article / ad / genre classifications. For metadata wants, bibliographical data about the title (issue, date); extractable data (dates, names, tables of contents), provenance data (who digitised, when?). When you download individual articles, you lose the metadata which would be so useful for research. Open access is vital; interoperability is important; the ability to create individual collections across individual libraries is a wonderful dream
  • Estelle Bunout. Impresso providing exploration tools (integrate and decomplexify NLP tools in current historical research workflows). https://impresso-project.ch/app/#/
  • Working on: expanding a query – find neighbouring terms and frequent OCR errors. Overview of query: where and when is it? Whole corpus has been processed with topic modelling.
  • Complex queries: help me find the mention of places, countries, person in a particular thematic context. Can save to collection or export for further processing.
  • See the unsearchable: missing issues, failure to digitise issues, failure to OCRise, corrupt files
  • Transparency helps researchers discover novel opportunities and make informed decisions about sources.
  • Clifford Wulfman – how to support transcriptions, linked open data that allows exploration of notions of periodicity, notions of the periodical. My tweet: Clifford Wulfman acknowledging that libraries don’t have the resources to support special ‘snowflake’ projects because they’re working to meet the most common needs. IME this question/need doesn’t go away so how best to tackle and support it?
  • Q&A comment: what if we just put all newspapers on Impresso? Discussion of standardisation, working jointly, collaborating internationally
  • Melodee Beals comments: libraries aren’t there just to support academic researchers, academics could look to supporting the work of creative industries, journalists and others to make it easier for libraries to support them.
  • Subject librarian from Leiden University points out that copyright limits their ability to share newspapers after 1880. (Innovating is hard when you can’t even share the data)
  • Nanette Rißler says researchers don’t need fancy interfaces, just access to the data (which probably contradicts the need for ‘special snowflake’ systems and explains why libraries can never ever make all users happy)

LP-34: Cultural Heritage, Art/ifacts and Institutions


(I was chairing so notes are sketchier)

  • Mark Hill, early modern (1500-1800 but 18thC in particular) definitions of ‘authorship’. How does authorship interact with structural aspects of publishing? Shift of authorship from gentlemanly to professional occupation.
  • Using the ESTC. Has about 1m actors, 400k documents with actors attached to them. Actors include authors, editors, publishers, printers, translators, dedicatees. Early modern print trade was ‘trade on a human scale’. People knew each other ‘hand-operated printing press required individual actors and relationships’.
  • As time goes on, printers work with fewer, publishers work with more people, authors work with about the same number of people.
  • They manually created a network of people associated with Bernard Mandeville and compared it with a network automatically generated from ESTC.
  • Looking at a work network for Edmond Hoyle’s Short Treatise on the Game of Whist. (Today I learned that Hoyle’s Rules, determiner of victory in family card games and of ‘according to Hoyle’ fame, dates back to a book on whist in the 18thC)
  • (Really nice use of social network analysis to highlight changes in publisher and authorship networks.) Eigenvector very good at finding important actors. In the English Civil War, who you know does matter when it comes to publishing. By 18thC publishers really matter. See http://ceur-ws.org/Vol-2364/19_paper.pdf for more.

Richard Freedman, David Fiala, Andrew Janco et al

  • What is a musical quotation? Borrowing, allusion, parody, commonplace, contrafact, cover, plagiat, sampling, signifying.
  • Tweet: Freedman et al.’s slides for ‘Citations: The Renaissance Imitation Mass (CRIM) and The Quotable Musical Text in a Digital Age’ https://bit.ly/CRIM_Utrecht are a rich introduction to applications of #DigitalMusicology encoding and markup
  • I spend so much time in text worlds that it’s really refreshing to hear from musicologists who play music to explain their work and place so much value on listening while also exploiting digital processing tools to the max

Digging Into Pattern Usage Within Jazz Improvisation (Pattern History Explorer, Pattern Search and Similarity Search) Frank Höger, Klaus Frieler, Martin Pfleiderer

Impromptu meetup to discuss issues raised around digitised newspapers research and infrastructure

See notes about DH2019 Lunch session – Researchers & Libraries working together on improving digitised newspapers. 20 or more people joined us for a discussion of the wonderful challenges and wish lists from speakers, thinking about how we can collaborate to improve the provision of digitised newspapers / periodicals for researchers.

Theorising the Spatial Humanities panel


  • ?? Space as a container for understanding, organising information. Chorography, the writing of the region.
  • Tweet: In the spatial humanities panel where a speaker mentions chorography, which along with prosopography is my favourite digital-history-enabled-but-also-old concept
  • Daniel Alves. Do history and literature researchers feel the need to incorporate spatial analysis in their work? A large number who do don’t use GIS. Most of them don’t believe in it (!). The rest are so tired that they prefer theorising (!!) His goal, ref last night keynote, is not to build models, tools, the next great algorithm; it’s to advance knowledge in his specific field.
  • Tweet: @DanielAlvesFCSH Is #SpatialDH revolutionary? Do history and literature researchers feel the need to incorporate spatial analysis in their work? A large number who do don’t use GIS. Most of them don’t believe in it(!). The rest are so tired that they prefer theorising(!!)
  • Tweet: @DanielAlvesFCSH close reading is still essential to take in the inner subjectivity of historical / literary sources with a partial and biases conception of space and place
  • Tien Danniau, Ghent Centre for Digital Humanities – deep maps. How is the concept working for them?
  • Tweet: Deep maps! A slide showing some of the findings from the 2012 NEH Advanced Institute on spatial narratives and deep mapping, which is where I met many awesome DH and spatial history people #DH2019pic.twitter.com/JiQepz7kH5
  • Katie McDonough, Spatial history between maps and texts: lessons from the 18thC. Refers to Richard White’s spatial history essay in her abstract. Rethinking geographic information extraction. Embedded entities, spatial relations, other stuff.
  • Tweet: @khetiwe24 references work discussed in https://www.tandfonline.com/doi/abs/10.1080/13658816.2019.1620235?journalCode=tgis20 … noting how the process of annotating texts requires close reading that changes your understanding of place in the text (echoing @DanielAlvesFCSH ‘s earlier point)
  • Tweet: Final #spatialDH talk ‘towards spatial linguistics’ #DH2019 https://twitter.com/mia_out/status/1149666605258829824
  • Tweet #DH2019 Preserving deep maps? I’d talk to folk in web archiving for a sense of which issues re recording complex, multi-format, dynamic items are tricky and which are more solveable

Closing keynote: Digital Humanities — Complexities of Sustainability, Johanna Drucker

(By this point my laptop and mental batteries were drained so I just listened and tweeted. I was also taking part in a conversation about the environmental sustainability of travel for conferences, issues with access to visas and funding, etc, that might be alleviated by better incorporating talks from remote presenters, or even having everyone present online.)

Finally, the DH2020 conference is calling for reviewers. Reviewing is an excellent way to give something back to the DH community while learning about the latest work as it appears in proposals, and perhaps more importantly, learning how to write a good proposal yourself. Find out more: http://dh2020.adho.org/cfps/reviewers/

‘In search of the sweet spot: infrastructure at the intersection of cultural heritage and data science’

It’s not easy to find the abstracts for presentations within panels on the Digital Humanities 2019 (DH2019) site, so I’ve shared mine here.

In search of the sweet spot: infrastructure at the intersection of cultural heritage and data science

Mia Ridge, British Library

My slides: https://www.slideshare.net/miaridge/in-search-of-the-sweet-spot-infrastructure-at-the-intersection-of-cultural-heritage-and-data-science

This paper explores some of the challenges and paradoxes in the application of data science methods to cultural heritage collections. It is drawn from long experience in the cultural heritage sector, predating but broadly aligned to the ‘OpenGLAM’ and ‘Collections as Data’ movements. Experiences that have shaped this thinking include providing open cultural data for computational use; creating APIs for catalogue and interpretive records, running hackathons, and helping cultural organisations think through the preparation of ‘collections as data’; and supervising undergraduate and MSc projects for students of computer science.

The opportunities are many. Cultural heritage institutions (aka GLAMS – galleries, libraries, archives and museums) hold diverse historical, scientific and creative works – images, printed and manuscript works, objects, audio or video – that could be turned into some form of digital ‘data’ for use in data science and digital humanities research. GLAM staff have expert knowledge about the collections and their value to researchers. Data scientists bring a rigour, specialist expertise and skills, and a fresh perspective to the study of cultural heritage collections.

While the quest to publish cultural heritage records and digital surrogates for use in data science is relatively new, the barriers within cultural organisations to creating suitable infrastructure with others are historically numerous. They include different expectations about the pace and urgency of work, different levels of technical expertise, resourcing and infrastructure, and different goals. They may even include different expectations about what ‘data’ is – metadata drawn from GLAM catalogues is the most readily available and shared data, but not only is this rarely complete, often untidy and inconsistent (being the work of decades or centuries and many hands over that time), it is also a far cry from datasets rich with images or transcribed text that data scientists might expect.

Copyright, data protection and commercial licensing can limit access to digitised materials (though this varies greatly). ‘Orphaned works’, where the rights holder cannot be traced in order to licence the use of in-copyright works, mean that up to 40% of some collections, particularly sound or video collections, are unavailable for risk-free use.(2012)

While GLAMs have experimented with APIs, downloadable datasets and SPARQL endpoints, they rarely have the resources or institutional will to maintain and refresh these indefinitely. Records may be available through multi-national aggregators such as Europeana, DPLA, or national aggregators, but as aggregation often requires that metadata is mapped to the lowest common denominator, their value for research may be limited.

The area of overlap between ‘computationally interesting problems’ and ‘solutions useful for GLAMs’ may be smaller than expected to date, but collaboration between cultural institutions and data scientists on shared projects in the ‘sweet spot’ – where new data science methods are explored to enhance the discoverability of collections – may provide a way forward. Sector-wide collaborations like the International Image Interoperability Framework (IIIF, https://iiif.io/) provide modern models for lightweight but powerful standards. Pilot projects with students or others can help test the usability of collection data and infrastructure while exploring the applicability of emerging technologies and methods. It is early days for these collaborations, but the future is bright.

Panel overview

An excerpt from the longer panel description by David Beavan and Barbara McGillivray.

This panel highlights the emerging collaborations and opportunities between the fields of Digital Humanities (DH), Data Science (DS) and Artificial Intelligence (AI). It charts the enthusiastic progress of the Alan Turing Institute, the UK national institute for data science and artificial intelligence, as it engages with cultural heritage institutions and academics from arts, humanities and social sciences disciplines. We discuss the exciting work and learnings from various new activities, across a number of high-profile institutions. As these initiatives push the intellectual and computational boundaries, the panel considers both the gains, benefits, and complexities encountered. The panel latterly turns towards the future of such interdisciplinary working, considering how DS & DH collaborations can grow, with a view towards a manifesto. As Data Science grows globally, this panel session will stimulate new discussion and direction, to help ensure the fields grow together and arts & humanities remain a strong focus of DS & AI. Also so DH methods and practices continue to benefit from new developments in DS which will enable future research avenues and questions.

‘The Past, Present and Future of Digital Scholarship with Newspaper Collections’

It’s not easy to find the abstracts for presentations within panels on the Digital Humanities 2019 (DH2019) site, so I’ve shared mine here. The panel was designed to bring together range of interdisciplinary newspaper-based digital humanities and/or data science projects, with ‘provocations’ from two senior scholars who will provide context for current ambitions, and to start conversations among practitioners.

Short Paper: Living with Machines

Paper authors: Mia Ridge, Giovanni Colavizza with Ruth Ahnert, Claire Austin, David Beavan, Kaspar Beelens, Mariona Coll Ardanuy, Adam Farquhar, Emma Griffin, James Hetherington, Jon Lawrence, Katie McDonough, Barbara McGillivray, André Piza, Daniel van Strien, Giorgia Tolfo, Alan Wilson, Daniel Wilson.

My slides: https://www.slideshare.net/miaridge/living-with-machines-at-the-past-present-and-future-of-digital-scholarship-with-newspaper-collections-154700888

Living with Machines is a five-year interdisciplinary research project, whose ambition is to blend data science with historical enquiry to study the human impact of the industrial revolution. Set to be one of the biggest and most ambitious digital humanities research initiatives ever to launch in the UK, Living with Machines is developing a large-scale infrastructure to perform data analyses on a variety of historical sources, and in so doing provide vital insights into the debates and discussions taking place in response to today’s digital industrial revolution.

Seeking to make the most of a self-described ‘radical collaboration’, the project will iteratively develop research questions as computational linguists, historians, library curators and data scientists work on a shared corpus of digitised newspapers, books and biographical data (census, birth, death, marriage, etc. records). For example, in the process of answering historical research questions, the project could take advantage of access to expertise in computational linguistics to overcome issues with choosing unambiguous and temporally stable keywords for analysis, previously reported by others (Lansdall-Welfare et al., 2017). A key methodological objective of the project is to ‘translate’ history research questions into data models, in order to inspect and integrate them into historical narratives. In order to enable this process, a digital infrastructure is being collaboratively designed and developed, whose purpose is to marshal and interlink a variety of historical datasets, including newspapers, and allow for historians and data scientists to engage with them.

In this paper we will present our vision for Living with Machines, focusing on how we plan to approach it, and the ways in which digital infrastructure enables this multidisciplinary exchange. We will also showcase preliminary results from the different research ‘laboratories’, and detail the historical sources we plan to use within the project.

The Past, Present and Future of Digital Scholarship with Newspaper Collections

Mia Ridge (British Library), Giovanni Colavizza (Alan Turing Institute)

Historical newspapers are of interest to many humanities scholars, valued as sources of information and language closely tied to a particular time, social context and place. Following library and commercial microfilming and, more recently, digitisation projects, newspapers have been an accessible and valued source for researchers. The ability to use keyword searches through more data than ever before via digitised newspapers has transformed the work of researchers.[1]

Digitised historic newspapers are also of interest to many researchers who seek large bodies of relatively easily computationally-transcribed text on which they can try new methods and tools. Intensive digitisation over the past two decades has seen smaller-scale or repository-focused projects flourish in the Anglophone and European world (Holley, 2009; King, 2005; Neudecker et al., 2014). However, just as earlier scholarship was potentially over-reliant on The Times of London and other metropolitan dailies, this has been replicated and reinforced by digitisation projects (for a Canadian example, see Milligan 2013).

In the last years, several large consortia projects proposing to apply data science and computational methods to historical newspapers at scale have emerged, including NewsEye, impresso, Oceanic Exchanges and Living with Machines. This panel has been convened by some consortia members to cast a critical view on past and ongoing digital scholarship with newspapers collections, and to inform its future.

Digitisation can involve both complexities and simplifications. Knowledge about the imperfections of digitisation, cataloguing, corpus construction, text transcription and mining is rarely shared outside cultural institutions or projects. How can these imperfections and absences be made visible to users of digital repositories? Furthermore, how does the over-representation of some aspects of society through the successive winnowing and remediation of potential sources – from creation to collection, microfilming, preservation, licensing and digitisation – affect scholarship based on digitised newspapers. How can computational methods address some of these issues?

The panel proposes the following format: short papers will be delivered by existing projects working on large collections of historical newspapers, presenting their vision and results to date. Each project is at different stages of development and will discuss their choice to work with newspapers, and reflect on what have they learnt to date on practical, methodological and user-focused aspects of this digital humanities work. The panel is additionally an opportunity to consider important questions of interoperability and legacy beyond the life of the project. Two further papers will follow, given by scholars with significant experience using these collections for research, in order to provide the panel with critical reflections. The floor will then open for debate and discussion.

This panel is a unique opportunity to bring senior scholars with a long perspective on the uses of newspapers in scholarship together with projects at formative stages. More broadly, convening this panel is an opportunity for the DH2019 community to ask their own questions of newspaper-based projects, and for researchers to map methodological similarities between projects. Our hope is that this panel will foster a community of practice around the topic and encourage discussions of the methodological and pedagogical implications of digital scholarship with newspapers.

[1] For an overview of the impact of keyword search on historical research see (Putnam, 2016) (Bingham, 2010).

Updates from Digital Scholarship at the British Library

I’ve been posting on the work blog far more frequently than I have here. Launching and running In the Spotlight, crowdsourcing the transcription of the British Library’s historic playbills collection, was a focus in 2017-18. Some blog posts:

And a press release and newsletters:

Other updates from work, including a new project, information about the Digital Scholarship Reading Group I started, student projects, and an open data project I shepherded:

Cross-post: Seeking researchers to work on an ambitious data science and digital humanities project

I rarely post here at the moment, in part because I post on the work blog. Here’s a cross-post to help spread the word about some exciting opportunities currently available: Seeking researchers to work on an ambitious data science and digital humanities project at the British Library and Alan Turing Institute (London)

‘If you follow @BL_DigiSchol or #DigitalHumanities hashtags on twitter, you might have seen a burst of data science, history and digital humanities jobs being advertised. In this post, Dr Mia Ridge of the Library’s Digital Scholarship team provides some background to contextualise the jobs advertised with the ‘Living with Machines’ project.

We are seeking to appoint several new roles who will collaborate on an exciting new project developed by the British Library and The Alan Turing Institute, the national centre for data science and artificial intelligence.

Jobs currently advertised:

The British Library jobs are now advertised, closing September 21:

You may have noticed that the British Library is also currently advertising for a Curator, Newspaper Data (closes Sept 9). This isn’t related to Living with Machines, but with an approach of applying data-driven journalism and visualisation techniques to historical collections, it should have some lovely synergies and opportunities to share work in progress with the project team. There’s also a Research Software Engineer advertised that will work closely with many of the same British Library teams.

If you’re applying for these posts, you may want to check out the Library’s visions and values on the refreshed ‘Careers’ website.’

From piles of material to patchwork: How do we embed the production of usable collections data into library work?

How do we embed the production of usable collections data into library work?These notes were prepared for a panel discussion at the ‘Always Already Computational: Collections as Data‘ (#AACdata) workshop, held in Santa Barbara in March 2017. While my latest thinking on the gap between the scale of collections and the quality of data about them is informed by my role in the Digital Scholarship team at the British Library, I’ve also drawn on work with catalogues and open cultural data at Melbourne Museum, the Museum of London, the Science Museum and various fellowships. My thanks to the organisers and the Institute of Museum and Library Services for the opportunity to attend. My position paper was called ‘From libraries as patchwork to datasets as assemblages?‘ but in hindsight, piles and patchwork of material seemed a better analogy.

The invitation to this panel asked us to share our experience and perspective on various themes. I’m focusing on the challenges in making collections available as data, based on years of working towards open cultural data from within various museums and libraries. I’ve condensed my thoughts about the challenges down into the question on the slide: How do we embed the production of usable collections data into library work?

It has to be usable, because if it’s not then why are we doing it? It has to be embedded because data in one-off projects gets isolated and stale. ‘Production’ is there because infrastructure and workflow is unsexy but necessary for access to the material that makes digital scholarship possible.

One of the biggest issues the British Library (BL) faces is scale. The BL’s collections are vast – maybe 200 million items – and extremely varied. My experience shows that publishing datasets (or sharing them with aggregators) exposes the shortcomings of past cataloguing practices, making the size of the backlog all too apparent.

Good collections data (or metadata, depending on how you look at it) is necessary to avoid the overwhelmed, jumble sale feeling of using a huge aggregator like Europeana, Trove, or the DPLA, where you feel there’s treasure within reach, if only you could find it. Publishing collections online often increases the number of enquiries about them – how can institution deal with enquiries at scale when they already have a cataloguing backlog? Computational methods like entity identification and extraction could complement the ‘gold standard’ cataloguing already in progress. If they’re made widely available, these other methods might help bridge the resourcing gaps that mean it’s easier to find items from richer institutions and countries than from poorer ones.

Photo of piles of materialYou probably already all know this, but it’s worth remembering: our collections aren’t even (yet) a patchwork of materials. The collections we hold, and the subset we can digitise and make available for re-use are only a tiny proportion of what once existed. Each piece was once part of something bigger, and what we have now has been shaped by cumulative practical and intellectual decisions made over decades or centuries. Digitisation projects range from tiny specialist databases to huge commercial genealogy deals, while some areas of the collections don’t yet have digital catalogue records. Some items can’t be digitised because they’re too big, small or fragile for scanning or photography; others can’t be shared because of copyright, data protection or cultural sensitivities. We need to be careful in how we label datasets so that the absences are evident.

(Here, ‘data’ may include various types of metadata, automatically generated OCR or handwritten text recognition transcripts, digital images, audio or video files, crowdsourced enhancements or any combination or these and more)

Image credit: https://www.flickr.com/photos/teen_s/6251107713/

In addition to the incompleteness or fuzziness of catalogue data, when collections appear as data, it’s often as great big lumps of things. It’s hard for normal scholars to process (or just unzip) 4gb of data.

Currently, datasets are often created outside normal processes, and over time they become ‘stale’ as they’re not updated when source collections records change. And when they manage to unzip them, the records rely on internal references – name authorities for people, places, etc – that can only be seen as strings rather than things until extra work is undertaken.

The BL’s metadata team have experimented with ‘researcher format’ CSV exports around specific themes (eg an exhibition), and CSV is undoubtedly the most accessible format – but what we really need is the ability for people to create their own queries across catalogues, and create their own datasets from the results. (And by queries I don’t mean SPARQL but rather faceted browsing or structured search forms).

Image credit: screenshot from http://data.bl.uk/

Collections are huge (and resources relatively small) so we need to supplement manual cataloguing with other methods. Sometimes the work of crafting links from catalogues to external authorities and identifiers will be a machine job, with pieces sewn together at industrial speed via entity recognition tools that can pull categories out or text and images. Sometimes it’s operated by a technologist who runs records through OpenRefine to find links to name authorities or Wikidata records. Sometimes it’s a labour of scholarly love, with links painstakingly researched, hand-tacked together to make sure they fit before they’re finally recorded in a bespoke database.

This linking work often happens outside the institution, so how can we ingest and re-use it appropriately? And if we’re to take advantage of computational methods and external enhancements, then we need ways to signal which categories were applied by catalogues, which by software, by external groups, etc.

The workflow and interface adjustments required would be significant, but even more challenging would be the internal conversations and changes required before a consensus on the best way to combine the work of cataloguers and computers could emerge.

The trick is to move from a collection of pieces to pieces of a collection. Every collection item was created in and about places, and produced by and about people. They have creative, cultural, scientific and intellectual properties. There’s a web of connections from each item that should be represented when they appear in datasets. These connections help make datasets more usable, turning strings of text into references to things and concepts to aid discoverability and the application of computational methods by scholars. This enables structured search across datasets – potentially linking an oral history interview with a scientist in the BL sound archive, their scientific publications in journals, annotated transcriptions of their field notebooks from a crowdsourcing project, and published biography in the legal deposit library.

A lot of this work has been done as authority files like AAT, ULAN etc are applied in cataloguing, so our attention should turn to turning local references into URIs and making the most of that investment.

Applying identifiers is hard – it takes expert care to disambiguate personal names, places, concepts, even with all the hinting that context-aware systems might be able to provide as machine learning etc techniques get better. Catalogues can’t easily record possible attributions, and there’s understandable reluctance to publish an imperfect record, so progress on the backlog is slow. If we’re not to be held back by the need for records to be perfectly complete before they’re published, then we need to design systems capable of capturing the ambiguity, fuzziness and inherent messiness of historical collections and allowing qualified descriptors for possible links to people, places etc. Then we need to explain the difference to users, so that they don’t overly rely on our descriptions, making assumptions about the presence or absence of information when it’s not appropriate.

Image credit: http://europeana.eu/portal/record/2021648/0180_N_31601.html

Photo of pipes over a buildingA lot of what we need relies on more responsive infrastructure for workflows and cataloguing systems. For example, the BL’s systems are designed around the ‘deliverable unit’ – the printed or bound volume, the archive box – because for centuries the reading room was where you accessed items. We now need infrastructure that makes items addressable at the manuscript, page and image level in order to make the most of the annotations and links created to shared identifiers.

(I’d love to see absorbent workflows, soaking up any related data or digital surrogates that pass through an organisation, no matter which system they reside in or originate from. We aren’t yet making the most of OCRd text, let alone enhanced data from other processes, to aid discoverability or produce datasets from collections.)

Image credit: https://www.flickr.com/photos/snorski/34543357
My final thought – we can start small and iterate, which is just as well, because we need to work on understanding what users of collections data need and how they want to use them. We’re making a start and there’s a lot of thoughtful work behind the scenes, but maybe a bit more investment is needed from research libraries to become as comfortable with data users as they are with the readers who pass through their physical doors.

Trying computational data generation and entity extraction

I’ve developed this exercise on computational data generation and entity extraction for various information/data visualisation workshops I’ve been teaching lately. As these methods have become more accessible, my dataviz workshops have included more discussion of computational methods for generating data to be visualised. There are two versions of the exercise – the first works with images, the second with text.

In teaching I’ve found that services that describe images were more accessible and generated richer discussion in class than text-based sites, but it’s handy to have the option for people who work with text. If you try something like this in your classes I’d love to hear from you.

It’s also a chance to talk about the uses of these technologies in categorising and labelling our posts on social media. We can tell people that their social media posts are analysed for personality traits and mentions of brands, but seeing it in action is much more powerful.

Image exercise: trying computational data generation and entity extraction

Time: c. 5 – 10 minutes plus discussion.

Goal: explore methods for extracting information from text or an image and reflect on what the results tell you about the algorithms

1. Find a sample image

Find an image (e.g. from a news site or digitised text) you can download and drag into the window. It may be most convenient to save a copy to your desktop. Many sites let you load images from a URL, so right- or control-clicking to copy an image location for pasting into the site can be useful.

2. Work in your browser

It’s probably easiest to open each of these links in a new browser window. It’s best to use Firefox or Chrome, if you can. Safari and Internet Explorer may behave slightly differently on some sites. You should not need to register to use these sites – please read the tips below or ask for help if you get stuck.

  • Clarifai https://www.clarifai.com/demo – you can drag and drop, open the file explorer to find an image, or load one from a URL via the large ‘+’ in the bottom right-hand corner. You can adjust settings via the ‘Configure’ tab.
  • Google Cloud Vision API https://cloud.google.com/vision/ – don’t sign up, scroll down to the ‘Try the API’ box. Drag and drop your image on the box or click the box to open the file finder. You may need to go through the ‘I am not a robot’ process.
  • Microsoft Computer Vision API https://www.microsoft.com/cognitive-services/en-us/computer-vision-api – scroll to ‘Analyze an image’. You can use one of their sample images, paste a URL and hit ‘Submit’, or click on the ‘browse’ button to upload your own image.
  • IBM Watson Visual Recognition https://visual-recognition-demo.mybluemix.net/ – scroll to ‘Try the service’. Drag an image onto the grey box or click in the grey box to open the file finder. You can also load an image directly from a URL. (You can no longer try this without signing up so it doesn’t work for a quick exercise).
  • Blippar https://developer.blippar.com/portal/vs-api/index/#demoSection – scroll to the ‘Analyze any image’ section – the upload and URL options are below the sample images and tags
  • Caffe http://demo.caffe.berkeleyvision.org/ – provide a URL or upload an image

3. Review the outputs

Make notes, or discuss with your neighbour. Be prepared to report back to the group.

  • What attributes does each tool report on?
  • Which attributes, if any, were unique to a service?
  • Based on this, what do companies like Clarifai, Google, IBM and Microsoft seem to think is important to them (or to their users)?
  • How many of possible entities (concepts, people, places, events, references to time or dates, etc) did it pick up?
  • Is any of the information presented useful?
  • Did it label anything incorrectly?
  • What options for exporting or saving the results did the demo offer? What about the underlying service or software?
  • For tools with configuration options – what could you configure? What difference did changing classifiers or other parameters  make?
  • If you tried it with a few images, did it do better with some than others? Why might that be?

Text exercise: trying computational data generation and entity extraction

Time: c. 5 minutes plus discussion
Goal: explore the impact of source data and algorithms on input text

1.     Grab some text

You will need some text for this exercise. The more ‘entities’ – people, places, dates, concepts – discussed, the better. If you have some text you’re working on handy, you can use that. If you’re stuck for inspiration, pick a front page story from an online news site. Keep the page open so you can copy a section of text to paste into the websites.

2.     Compare text entity labelling websites

  • Open four or more browser windows or tabs. Open the links below in separate tabs or windows so you can easily compare the results.
  • Go to DBpedia Spotlight https://dbpedia-spotlight.github.io/demo/. Paste your copied text into the box, or keep the sample text in the box. Hit ‘Annotate’.
  • Go to Ontotext http://tag.ontotext.com/. You may need to click through the opening screen. Paste your copied text into the box. Hit ‘annotate’.
  • Finally, go to Stanford Named Entity Tagger http://nlp.stanford.edu:8080/ner/. Paste your text into the box. Hit ‘Submit query’.

3.     Review the outputs

  • How many possible entities (concepts, people, places, events, references to time or dates, etc) did each tool pick up? Is any of the other information presented useful?
  • Did it label anything incorrectly?
  • What if you change classifiers or other parameters?
  • Does it do better with different source material?
  • What differences did you find between the two tools? What do you think caused those differences?
  • How much can you find out about the tools and the algorithms they use to create labels?
  • Where does the data underlying the process come from?

Spoiler alert!

Clarifai’s image recognition tool with a historical image

Crowdsourcing workshop at DH2016 – session overview

A quick signal boost for the collaborative notes taken at the DH2016 Expert Workshop: Beyond The Basics: What Next For Crowdsourcing? (held in Kraków, Poland, on 12 July as part of the Digital Humanities 2016 conference, abstract below). We’d emphasised the need to document the unconference-style sessions (see FAQ) so that future projects could benefit from the collective experiences of participants. Since it can be impossible to find Google Docs or past tweets, I’ve copied the session overview below. The text is a summary of key takeaways or topics discussed in each session, created in a plenary session at the end of the workshop.

Participant introductions and interests – live notes
Ethics, Labour, sensitive material

Key takeaway – questions for projects to ask at the start; don’t impose your own ethics on a project, discussing them is start of designing the project.

Where to start
Engaging volunteers, tips including online communities, being open to levels of contribution, being flexible, setting up standards, quality
Workflow, lifecycle, platforms
What people were up to, the problems with hacking systems together, iiif.io, flexibility and workflows
Public expertise, education, what’s unique to humanities crowdsourcing
The humanities are contestable! Responsibility to give the public back the results of the process in re-usable
Options, schemas and goals for text encoding
Encoding systems will depend on your goals; full-text transcription always has some form of encoding, data models – who decides what it is, and when? Then how are people guided to use it?Trying to avoid short-term solutions
UX, flow, motivation
Making tasks as small as possible; creating a sense of contribution; creating a space for volunteers to communicate; potential rewards, issues like badgefication and individual preferences. Supporting unexpected contributions; larger-scale tasks
Project scale – thinking ahead to ending projects technically, and in terms of community – where can life continue after your project ends
Finding and engaging volunteers
Using social media, reliance on personal networks, super-transcribers, problematic individuals who took more time than they gave to the project. Successful strategies are very-project dependent. Something about beer (production of Itinera Nova beer with label containing info on the project and link to website).
Ecosystems and automatic transcription
Makes sense for some projects, but not all – value in having people engage with the text. Ecosystem – depending on goals, which parts work better? Also as publication – editions, corpora – credit, copyright, intellectual property
Plenary session, possible next steps – put information into a wiki. Based around project lifecycle, critical points? Publication in an online journal? Updateable, short-ish case studies. Could be categorised by different attributes. Flexible, allows for pace of change. Illustrate principles, various challenges.

Short-term action: post introductions, project updates and new blog posts, research, etc to https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=CROWDSOURCING – a central place to send new conference papers, project blog posts, questions, meet-ups.

The workshop abstract:

Crowdsourcing – asking the public to help with inherently rewarding tasks that contribute to a shared, significant goal or research interest related to cultural heritage collections or knowledge – is reasonably well established in the humanities and cultural heritage sector. The success of projects such as Transcribe Bentham, Old Weather and the Smithsonian Transcription Center in processing content and engaging participants, and the subsequent development of crowdsourcing platforms that make launching a project easier, have increased interest in this area. While emerging best practices have been documented in a growing body of scholarship, including a recent report from the Crowd Consortium for Libraries and Archives symposium, this workshop looks to the next 5 – 10 years of crowdsourcing in the humanities, the sciences and in cultural heritage. The workshop will gather international experts and senior project staff to document the lessons to be learnt from projects to date and to discuss issues we expect to be important in the future.

Photo by Digital Humanities ‏@DH_Western
Photo by Digital Humanities ‏@DH_Western

The workshop is organised by Mia Ridge (British Library), Meghan Ferriter (Smithsonian Transcription Centre), Christy Henshaw (Wellcome Library) and Ben Brumfield (FromThePage).

If you’re new to crowdsourcing, here’s a reading list created for another event.


Network visualisations and the ‘so what?’ problem

This week I was in Luxembourg for a workshop on Network Visualisation in the Cultural Heritage Sector, organised by Marten Düring and held on the Belval campus of the University of Luxembourg.

In my presentation, I responded to some of the questions posed in the workshop outline:

In this workshop we want to explore how network visualisations and infrastructures will change the research and outreach activities of cultural heritage professionals and historians. Among the questions we seek to discuss during the workshop are for example: How do users benefit from graphs and their visualisation? Which skills do we expect from our users? What can we teach them? Are SNA [social network analysis] theories and methods relevant for public-facing applications? How do graph-based applications shape a user’s perception of the documents/objects which constitute the data? How can applications benefit from user engagement? How can applications expand and tap into other resources?

A rough version of my talk notes is below. The original slides are also online.

Network visualisations and the ‘so what?’ problem


While I may show examples of individual network visualisations, this talk isn’t a critique of them in particular. There’s lots of good practice around, and these lessons probably aren’t needed for people in the room.

Fundamentally, I think network visualisations can be useful for research, but to make them more effective tools for outreach, some challenges should be addressed.


I’m a Digital Curator at the British Library, mostly working with pre-1900 collections of manuscripts, printed material, maps, etc. Part of my job is to help people get access to our digital collections. Visualisations are a great way to firstly help people get a sense of what’s available, and then to understand the collections in more depth.

I’ve been teaching versions of an ‘information visualisation 101’ course at the BL and digital humanities workshops since 2013. Much of what I’m saying now is based on comments and feedback I get when presenting network visualisations to academics, cultural heritage staff (who should be a key audience for social network analyses).

Provocation: digital humanists love network visualisations, but ordinary people say, ‘so what’?

Fig1And this is a problem. We’re not conveying what we’re hoping to convey.

Network visualisation, via Table of data, via http://fredbenenson.com/
Network visualisation http://fredbenenson.com

When teaching datavis, I give people time to explore examples like this, then ask questions like ‘Can you tell what is being measured or described? What do the relationships mean?’. After talking about the pros and cons of network visualisations, discussion often reaches a ‘yes, but so what?’ moment.

Here are some examples of problems ordinary people have with network visualisations…

Location matters

Spatial layout based on the pragmatic aspects of fitting something on the screen using physics, rules of attraction and repulsion doesn’t match what people expect to see. It’s really hard for some to let go of the idea that spatial layout has meaning. The idea that location on a page has meaning of some kind is very deeply linked to their sense of what a visualisation is.

Animated physics is … pointless?

People sometimes like the sproinginess when a network visualisation resettles after a node has been dragged, but waiting for the animation to finish can also be slow and irritating. Does it convey meaning? If not, why is it there?

Size, weight, colour = meaning?

The relationship between size, colour, weight isn’t always intuitive – people assume meaning where there might be none.

In general, network visualisations are more abstract than people expect a visualisation to be.

‘What does this tell me that I couldn’t learn as quickly from a sentence, list or table?’

Table of data, via http://fredbenenson.com/
Table of data, via http://fredbenenson.com/

Scroll down the page that contains the network graph above and you get other visualisations. Sometimes they’re much more positively received, particularly people feel they learn more from them than from the network visualisation.

Onto other issues with ‘network visualisations as communication’…

Which algorithmic choices are significant?

screenshot of network graphs
Mike Bostock’s force-directed and curved line versions of character co-occurrence in Les Misérables

It’s hard for novices to know which algorithmic and data-cleaning choices are significant, and which have a more superficial impact.

Untethered images

Images travel extremely well on social media. When they do so, they often leave information behind and end up floating in space. Who created this, and why? What world view does it represent? What source material underlies it, how was it manipulated to produce the image? Can I trust it?

‘Can’t see the wood for the trees’

viral texts

When I showed this to a class recently, one participant was frustrated that they couldn’t ‘see the wood for the trees’. The visualisations gives a general impression of density, but it’s not easy to dive deeper into detail.

Stories vs hairballs

But when I started to explain what was being represented – the ways in which stories were copied from one newspaper to another – they were fascinated. They might have found their way there if they’d read the text but again, the visualisation is so abstract that it didn’t hint at what lay underneath. (Also I have only very, very rarely seen someone stop to read the text before playing with a visualisation.)

No sense of change over time

This flattening of time into one simultaneous moment is more vital for historical networks than for literary ones, but even so, you might want to compare relationships between sections of a literary work.

No sense of texture, detail of sources

All network visualisations look similar, whether they’re about historical texts or cans of baked beans. Dots and lines mask texture, and don’t always hint at the depth of information they represent.


Node. Edge. Graph. Directed, undirected. Betweenness. Closeness. Eccentricity.

There’s a lot to take on to really understand what’s being expressed in a network graph.

There is some hope…

Onto the positive bit!

Interactivity is engaging

People find the interactive movement, the ability to zoom and highlight links engaging, even if they have no idea what’s being expressed. In class, people started to come up with questions about the data as I told them more about what was represented. That moment of curiosity is an opportunity if they can dive in and start to explore what’s going on, what do the relationships mean?

…but different users have different interaction needs

For some, there’s that frustration expressed earlier they ‘can’t get to see a particular tree’ in the dense woods of a network visualisation. People often want to get to the detail of an instance of a relationship – the lines of text, images of the original document – from a graph.

This mightn’t be how network visualisations are used in research, but it’s something to consider for public-facing visualisations. How can we connect abstract lines or dots to detail, or provide more information about what the relationship means, show the quantification expressed as people highlight or filter parts of a graph? A  harder, but more interesting task is hinting at the texture or detail of those relationships.

Proceed, with caution

One of the workshop questions was ‘Are social network analysis theories and methods relevant for public-facing applications?’ – and maybe the answer is a qualified yes. As a working tool, they’re great for generating hypotheses, but they need a lot more care before exposing them to the public.

[As an aside, I’d always taken the difference between visualisations as working tools for exploring data – part of the process of investigating a research question – and visualisation as an output – a product of the process, designed for explanation rather than exploration – as fundamental, but maybe we need to make that distinction more explicit.]

But first – who are your ‘users’?

During this workshop, at different points we may be talking about different ‘users’ – it’s useful to scope who we mean at any given point. In this presentation, I was talking about end users who encounter visualisations, not scholars who may be organising and visualising networks for analysis.

Sometimes a network visualisation isn’t the answer … even if it was part of the question.

As an outcome of an exploratory process, network visualisations are not necessarily the best way to present the final product. Be disciplined – make yourself justify the choice to use network visualisations.

No more untethered images

Include an extended caption – data source, tools and algorithms used. Provide a link to find out more – why this data, this form? What was interesting but not easily visualised? Let people download the dataset to explore themselves?

Present visualisations as the tip of the data iceberg

Visualisations are the tip of the iceberg
Visualisations are the tip of the iceberg

Lots of interesting data doesn’t make it into a visualisation. Talking about what isn’t included and why it was left out is important context.

Talk about data that couldn’t exist

Beyond the (fuzzy, incomplete, messy) data that’s left out because it’s hard to visualise, data that never existed in the first place is also important:

‘because we’re only looking on one axis (letters), we get an inflated sense of the importance of spatial distance in early modern intellectual networks. Best friends never wrote to each other; they lived in the same city and drank in the same pubs; they could just meet on a sunny afternoon if they had anything important to say. Distant letters were important, but our networks obscure the equally important local scholarly communities.’
Scott Weingart, ‘Networks Demystified 8: When Networks are Inappropriate’

Help users learn the skills and knowledge they need to interpret network visualisations in context.

How? Good question! This is the point at which I hand over to you…