57 Varieties of Digital History? Towards the future of looking at the past

Back in November 2015, Tara Andrews invited me to give a guest lecture on 'digital history' for the Introduction to Digital Humanities course at the University of Bern, where she was then a professor. This is a slightly shortened version of my talk notes, finally posted in 2024 as I go back to thinking about what 'digital history' actually is.

Illustration of a tin of Heinz Cream of Tomato Soup '57 varieties'I called my talk '57 varieties of digital history' as a play on the number of activities and outputs called 'digital history'. While digital history and digital humanities are often linked and have many methods in common, digital history also draws on the use of computers for quantitative work, and digitisation projects undertaken in museums, libraries, archives and academia. Digital tools have enhanced many of the tasks in the research process (which itself has many stages – I find the University of Minnesota Libraries' model with stages of 'discovering', 'gathering', 'creating' and 'sharing' useful), but at the moment the underlying processes often remain the same.

So, what is digital history?

…using computers for writing, publishing

A historian on twitter once told me about a colleague who said they're doing digital history because they're using PowerPoint. On reflection, I think they have a point. These simple tools might be linked to fairly traditional scholarship – writing journal articles or creating presentations – but text created in them is infinitely quotable, shareable, and searchable, unlike the more inert paper equivalents. Many scholars use Word documents to keep bits of text they've transcribed from historical source materials, or to keep track of information from other articles or books. These become part of their personal research collections, which can build up over years into substantial resources in their own right. Even 'helper' applications like reference managers such as Zotero or EndNote can free up significant amounts of time that can then be devoted to research.

…the study of computers

When some people hear 'digital history', they imagine that it's the study of computers, rather than the use of digital methods by historians. While this isn't a serious definition of digital history, it's a reminder that viewing digital tools through a history of science and technology lens can be fruitful.

…using digitised material

Digitisation takes many forms, including creating or transcribing catalogue records about heritage collections, writing full descriptions of items, and making digital images of books, manuscripts, artworks etc. Metadata – information about the item, such as when and where it was made – is the minimum required to make collections discoverable. Increasingly, new forms of photography may be applied to particular types of objects to capture more information than the naked eye can see. Text may be transcribed, place names mapped, marginalia annotated and more.

The availability of free (or comparatively inexpensive) historical records through heritage institutions and related commercial or grassroots projects means we can access historical material without having to work around physical locations and opening hours, negotiate entry to archives (some of which require users to be 'bona fide scholars'), or navigate unknown etiquettes. Text transcription allows readers who lack the skills to read manuscript or hand-written documents to make use of these resources, as well as making the text searchable.

For some historians, this is about as digital as they want to get. They're very happy with being able to access more material more conveniently; their research methods and questions are still pretty unchanged.

…creating digital repositories

Most digitised items live in some broader system that aggregates and presents material from a particular institution, or related to a particular topic. While some digital repositories are based on sub-sets of official institutional collections, most aren't traditional 'archives'. One archivist describes digital repositories as a 'purposeful collection of surrogates'.

Repositories aren't always created by big, funded projects. Personal research collections assembled over time are one form of ad hoc repository – they may contain material from many different archives collected by one researcher over a number of years.

Themed collections may be the result of large, scholarly projects with formal partners who've agreed to contribute material about a particular time, place, group in society or topic. They might also be the result of work by a local history society with volunteers who digitise material and share it online.

'Commons' projects (like Flickr or Wikimedia Commons) tend to be less focused – they might contain collections from specific institutions, but these specific collections are aggregated into the whole repository, where their identity (and the provenance of individual items) may be subsumed. While 'commons' platforms technically enable sharing, the cultural practices around sharing are yet to change, particularly for academic historians and many cultural institutions.

Repositories can provide different functionality. In some 'scholarly workbenches' you can collect and annotate material; in others you can bookmark records or download images. They allow support different levels of access. Some allow you to download and re-use material without restriction, some only allow non-commercial use, and some are behind paywalls.

…creating datasets

The Old Bailey Online project has digitised the proceedings of the Old Bailey, making court cases from 1674 to 1913 available online. They haven't just transcribed text from digital images, they've added structure to the text. For example, the defendant's name, the crime he was accused of and the victim's name have all been tagged. The addition of this structure means that the material can be studied as text, or analysed statistically.

Adding structure to data can enable innovative research activities. If the markup is well-designed, it can support the exploration of questions that were not envisaged when the data was created. Adding structure to other datasets may become less resource-intensive as new computational techniques become available.

…creating visualisations and innovative interfaces

Some people or projects create specialist interfaces to help people explore their datasets. They might be maps or timelines that help people understand the scope of a collection in time and place, while others are more interpretive, presenting a scholarly argument through their arrangement of interface elements, the material they have assembled, the labels they use and the search or browse queries they support. Ideally, these interfaces should provide access to the original records underlying the visualisation so that scholars can investigate potential new research questions that arise from their use of the interface.

…creating linked data (going from strings to things)

As well as marking up records with information like 'this bit is a defendant's name', we can also link a particular person's name to other records about them online. One way to do this is to link their name to published lists of names online. These stable identifiers mean that we could link any mention of a particular person in a text to this online identifier, so that 'Captain Cook' or 'James Cook' are understood to be different strings about the same person.

A screenshot of structured data on the dbpedia site e.g. dbo:birthPlace = 1728-01-01
dbpedia page for 'James Cook', 2015

This also helps create a layer of semantic meaning about these strings of text. Software can learn that strings that represent people can have relationships with other things – in this case, historical voyages, other people, natural history and ethnographic collections, and historical events.

…applying computational methods, tools to digitised sources

So far some of what we've seen has been heavily reliant on manual processing – someone has had to sit at a desk and decide which bit of text is about the defendant and which about the victim in an Old Bailey case.

So people are developing software algorithms to find concepts – people, places, events, etc – within text. This is partly a response to amount of digitised text now available; partly a response to recognition of power of structured data. Techniques like 'named entity recognition' help create structure from unstructured data. This allows data to be queried, contextualised and presented in more powerful ways.

The named entity recognition software here [screenshot lost?] knows some things about the world – the names of places, people, dates, some organisations. It also gets lots of things wrong – it doesn't understand 'category five storm' as a concept, it mixes up people and organisations – but as a first pass, it has potential. Software can be trained to understand the kinds of concepts and things that occur in particular datasets. This also presents a problem for historians, who may have to use software trained for modern, commercial data.

This is part of a wider exploration of 'distant reading', methods for understanding what's in a corpus by processing the text en masse rather than by reading each individual novel or document. For example, it might be used to find linguistic differences between genres of literature, or between authors from different countries.

In this example [screenshot of topic modelling lost?], statistically unlikely combinations of words have been grouped together into 'topics'. This provides a form of summary of the contents of text files.

Image tagging – 'machine learning' techniques mean that software can learn how to do things rather than having to be precisely programmed in advance. This will have more impact on the future of digital history as these techniques become mainstream.

Audio tagging – software suggests tags, humans verify them. Quicker than doing them from scratch, but possible for software to miss significant moments that a person would spot. (e.g. famous voices, cultural references, etc).

Handwritten text recognition will transform manuscript sources such as much as optical character recognition has transformed typed sources!

Studying born digital material (web archives, social media corpus etc)

Important historical moments, such as the 'Arab spring', happened on social media platforms like twitter, youtube and facebook. The British Library and the Internet Archive have various 'snapshots' of websites, but they can only hope to capture a part of online material. We've already lost significant chunks of web history – every time a social media platform is shut without being archived, future historians have lost valuable data. (Not to mention people's personal data losses).

This also raises questions about how we should study 'digital material culture'. Websites like Facebook really only make sense when they're used in a social context. The interaction design of 'likes' and comments, the way a newsfeed is constructed in seconds based on a tiny part of everything done in your network – these are hard to study as a series of static screenshots or data dumps.

…sharing history online

Sharing research outputs is great. It some point it starts to intersect with public history. But questions remain about 'broadcast' vs 'discursive' modes of public history – could we do more than model old formats online? Websites and social media can be just as one-way broadcast as television unless they're designed for two-way participation.

What's missing?

Are there other research objects or questions that should be included under the heading 'digital history'? [A question to allow for discussion time]

Towards the future of looking at the past

To sum up what we've seen so far – we've seen the transformation of unorganised, unprocessed data into 'information' through research activities like 'classification, rearranging/sorting, aggregating, performing calculations, and selection'.

Historical material is being transformed from a 'page' to a 'dataset'. As some of this process is automated, it raises new questions – how do we balance the convenience of automatic processing with the responsibility to review and verify the results? How do we convey the processes that went into creating a dataset so that another researcher can understand its gaps, the mixture of algorithmic and expert processes applied to it? My work at the British Library has made the importance of versioning a dataset or corpus clear – if a historian bases an argument on one version of OCR text, and the next version is better, they should be able to link to the version they based their work on.

We've thought about how digital text and media allows for new forms of analysis, using methods such as data visualisation, topic modelling or data mining. These methods can yield new insights and provoke new research questions, but most are not yet accessible to the ordinary historian. While automated processes help, preparing data for digital history is still incredibly detailed, time-consuming work.

What are the pros and cons of the forms of digital history discussed?

Cons

The ability to locate records on consumer-facing services like Google Maps is valuable, but commercial, general use mapping tools are not always suitable for historical data, which is often fuzzy, messy, and of highly variable coverage and precision. For example, placing text or points on maps can suggest a degree of certainty not supported by the data. Locating historical addresses can be inherently uncertain in instances where street numbers were not yet in use, but most systems expect a location to be placed as a precise dot (point of interest) on a map; drawing a line to mark a location would at least allow the length of a street to be marked as a possible address.

There is an unmet need for everyday geospatial tools suitable for historians. For example, those with datasets containing historical locations would appreciate the ability to map addresses from specific periods on historical maps that are georeferenced, georectified and displayable on a modern, copyright-free map or the historical map. Similarly, biographical software, particularly when used for family history, collaborative prosopographical or community history projects would benefit from the ability to record the degree of certainty for potential-but-not-yet-proven relationships or identifications, and to link uncertain information to specific individuals.

The complexity of some software packages (or the combination of packages assembled to meet various needs) is a barrier for those short on time, unable to access dedicated support or training, or who do not feel capable of learning the specialist jargon and skills required to assess and procure software to meet their needs. The need for equipment and software licences can be a financial barrier; unclear licensing requirements and costs for purchasing high-resolution historical maps are another. Copyright and licensing are also complex issues.

Sensible historians worry about the sustainability of digital sites – their personal research collection might be around for 30 years or more; and they want to cite material that will be findable later.

There are issues with representing historical data, particularly in modern tools that cannot represent uncertainty, contingency. Here [screenshot lost?]the curator's necessarily fuzzy label of 'early 17th century' has been assigned to a falsely precise date. Many digital tools are not (yet) suitable for historical data. Their abilities have over-stated or their limits not clearly communicated/understood.

Very few peer-reviewed journals are able to host formats other than articles, inhibiting historians' ability to explore emerging digital formats for presenting research.

Faculty historians might dream of creating digital projects tailored for the specific requirements of their historical dataset, research question and audience, but their peers may not be confident in their ability to evaluate the results and assign credit appropriately.

Pros

Material can be recontextualised, transcluded, linked, contextualised. The distance between a reference and the original item reduced to just a link (unless a paywall etc gets in the way). Material can be organised in multiple ways independent of their physical location. Digital tools can represent multiple commentaries or assertions on a single image or document through linked annotations.

Computational techniques for processing data could reduce the gap between well-funded projects and others, thereby reducing the likelihood of digital history projects reinscribing the canon.

Digitised resources have made it easier to write histories of ordinary lives. You can search through multiple databases to quickly collate biographical info (births, deaths, marriages etc) and other instances when their existence might be documented. This isn't just a change in speed, but also in the accessibility of resources without travel, expense.

Screenshot of a IIIF viewer showing search results highlighted on a digitised historical text
Wellcome's IIIF viewer showing a highlighted search result

Search – any word in a digitised text can be a search result – we're not limited to keywords in a catalogue record. We can also discover some historical material via general search engines. Phonetic and fuzzy searches have also improved the ability to discover sources.

Historians like Professor Katrina Navickas have shown new models for the division of labour between people and software; previously most historical data collection and processing was painstakingly done by historians. She and others have shown how digital techniques can be applied to digitised sources in the pursuit of a historical research question.

Conclusion and questions: digital history, digital historiography?

The future is here, it's just not evenly distributed (this is the downer bit)

Academic historians might find it difficult to explore new forms of digital creation if they are hindered by the difficulties of collaborating on interdisciplinary digital projects and their need for credit and attribution when publishing data or research. More advanced forms of digital history also require access to technical expertise. While historians should know the basics of computational thinking, most may not be able to train as a programmer and as a historian – how much should we expect people to know about making software?

I've hinted at the impact of convenience in accessing digitised historical materials, and in those various stages of 'discovering', 'gathering', 'creating' and 'sharing'… We must also consider how experiences of digital technologies have influenced our understanding of what is possible in historical research, and the factors that limit the impact of digital technologies. The ease with which historians transform data from text notes to spreadsheets to maps to publications and presentations is almost taken for granted, but it shows the impact of digitality on enhancing everyday research practices.

So digital history has potential, is being demonstrated, but there's more to do…

Toddlers to teenagers: AI and libraries in 2023

A copy of my April 2023 position paper for the Collections as Data: State of the field and future directions summit held at the Internet Archive in Vancouver in April 2023. The full set of statements is available on Zenodo at Position Statements -> Collections as Data: State of the field and future directions. It'll be interesting to see how this post ages. I have a new favourite metaphor since I wrote this – the 'brilliant, hard-working — and occasionally hungover — [medical] intern'.

A light brown historical building with columns and steps. The building is small but grand. A modern skyscraper looms in the background.
The Internet Archive building in Vancouver

My favourite analogy for AI / machine learning-based tools[1] is that they’re like working with a child. They can spin a great story, but you wouldn’t bet your job on it being accurate. They can do tasks like sorting and labelling images, but as they absorb models of the world from the adults around them you’d want to check that they haven’t mistakenly learnt things like ‘nurses are women and doctors are men’.

Libraries and other GLAMs have been working with machine learning-based tools for a number of years, cumulatively gathering evidence for what works, what doesn’t, and what it might mean for our work. AI can scale up tasks like transcription, translation, classification, entity recognition and summarisation quickly – but it shouldn’t be used without supervision if the answer to the question ‘does it matter if the output is true?’ is ‘yes’.[2] Training a model and checking the results of an external model both require resources and expertise that may be scarce in GLAMs.

But the thing about toddlers is that they’re cute and fun to play with. By the start of 2023, ‘generative AI’ tools like the text-to-image tool DALL·E 2 and large language models (LLMs) like ChatGPT captured the public imagination. You’ve probably heard examples of people using LLMs as everything from an oracle (‘give me arguments for and against remodelling our kitchen’) to a tutor (‘explain this concept to me’) to a creative spark for getting started with writing code or a piece of text. If you don’t have an AI strategy already, you’re going to need one soon.

The other thing about toddlers is that they grow up fast. GLAMs have an opportunity to help influence the types of teenagers then adults they become – but we need to be proactive if we want AI that produces trustworthy results and doesn’t create further biases. Improving AI literacy within the GLAM sector is an important part of being able to make good choices about the technologies we give our money and attention to. (The same is also true for our societies as a whole, of course).

Since the 2017 summit, I’ve found myself thinking about ‘collections as data’ in two ways.[3] One is the digitised collections records (from metadata through to full page or object scans) that we share with researchers interested in studying particular topics, formats or methods; the other is the data that GLAMs themselves could generate about their collections to make them more discoverable and better connected to other collections. The development of specialist methods within computer vision and natural language processing has promise for both sorts of ‘collections as data’,[4] but we still have much to learn about the logistical, legal, cultural and training challenges in aligning the needs of researchers and GLAMs.

The buzz around AI and the hunger for more material to feed into models has introduced a third – collections as training data. Libraries hold vast repositories of historical and contemporary collections that reflect both the best thinking and the worst biases of the society that produced them. What is their role in responsibly and ethically stewarding those collections into training data (or not)?

As we learn more about the different ‘modes of interaction’ with AI-based tools, from the ‘text-grounded’, ‘knowledge-seeking’ and ‘creative’,[5] and collect examples of researchers and institutions using tools like large language models to create structured data from text,[6] we’re better able to understand and advocate for the role that AI might play in library work. Through collaborations within the Living with Machines project, I’ve seen how we could combine crowdsourcing and machine learning to clear copyright for orphan works at scale; improve metadata and full text searches with word vectors that help people match keywords to concepts rather than literal strings; disambiguate historical place names and turn symbols on maps into computational information.

Our challenge now is to work together with the Silicon Valley companies that shape so much of what AI ‘knows’ about the world, with the communities and individuals that created the collections we care for, and with the wider GLAM sector to ensure that we get the best AI tools possible.

[1] I’m going to use ‘AI’ as a shorthand for ‘AI and machine learning’ throughout, as machine learning models are the most practical applications of AI-type technologies at present. I’m excluding ‘artificial general intelligence’ for now.

[2] Tiulkanov, “Is It Safe to Use ChatGPT for Your Task?”

[3] Much of this thinking is informed by the Living with Machines project, a mere twinkle in the eye during the first summit. Launched in late 2018, the project aims to devise new methods, tools and software in data science and artificial intelligence that can be applied to historical resources. A key goal for the Library was to understand and develop some solutions for the practical, intellectual, logistical and copyright challenges in collaborative research with digitised collections at scale. As the project draws to an end five and a half years later, I’ve been reflecting on lessons learnt from our work with AI, and on the dramatic improvements in machine learning tools and methods since the project began.

[4] See for example Living with Machines work with data science and digital humanities methods documented at https://livingwithmachines.ac.uk/achievements

[5] Goldberg, “Reinforcement Learning for Language Models.” April 2023. https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81.

[6] For example, tools like Annif https://annif.org, and the work of librarian/developers like Matt Miller and genealogists.

Little, “AI Genealogy Use Cases, How-to Guides.” 2023. https://aigenealogyinsights.com/ai-genealogy-use-cases-how-to-guides/

Miller, “Using GPT on Library Collections.” March 30, 2023. https://thisismattmiller.com/post/using-gpt-on-library-collections/.

'Resonating with different frequencies' – notes for a talk on the Le Show archive

I met dr. rosa a. eberly, associate professor of rhetoric at Pennsylvania State University when she took my and Thomas Padilla's 'Collections as Data' course at the HILT summer school in 2018. When she got in touch to ask if I could contribute to a workshop on Harry Shearer's Le Show archive, of course I said yes! That event became the CAS 2023 Summer Symposium on Harry Shearer's "Le Show".

My slides for 'Resonating with different frequencies… Thoughts on public humanities through crowdsourcing in a ChatGPT world' are online at Zenodo. My planned talk notes are below.

Banner from Harry Shearer's Le Show archive, featuring a photo of Shearer. Text says 'Vogue magazine describes Le Show as "wildly clever,
iconoclastic stew of talk, music, political commentary,
readings of inadvertently funny public documents or
trade magazines and scripted skits."'

Opening – I’m sorry I can’t be in the room today, not least because the programme lists so many interesting talks.

Today I wanted to think about the different ways that public humanities work through crowdsourcing still has a place in an AI-obsessed world… what happens if we think about different ways of ‘listening’ to an audio archive like Le Show, by people, by machines, and by people and machines in combination?

What visions can we create for a future in which people and machines tune into different frequencies, each doing what they do best?

Overview

  • My work in crowdsourcing / data science in GLAMs
  • What can machines do?
  • The Le Show archive (as described by Rosa)
  • Why do we still need people listening to Le Show and other audio archives?

My current challenge is working out the role of crowdsourcing when 'AI can do it all'…

Of course AI can't, but we need to articulate what people and what machines can do so that we can set up systems that align with our values.

If we leave it to the commercial sector and pure software guys, there’s a risk that people are regarded as part of the machine; or are replaced by AI rather than aided by AI.

[Then I did a general 'crowdsourcing and data science in cultural heritage / British Library / Living with Machines' bit]

Given developments in 'AI' (machine learning)… What can AI/data science do for audio?

  • Transcribe speech for text-based search, methods
  • Detect some concepts, entities, emotions –> metadata for findability
  • Support 'distant reading'

–Shifts, motifs, patterns over time

–Collapse hours, years – take time out of the equation

  • Machine listening?

–Use 'similarity' to find sonic (not text) matches?

[Description of the BBC World Archive experiments c 2012 combining crowdsourcing with early machine learning https://www.bbc.co.uk/blogs/researchanddevelopment/2012/11/the-world-service-archive-prot.shtml]

Le Show (as described by Rosa)

  • A  massive 'portal' of 'conceptual and sonic hyperlinks to late-20th- and early-21st-century news and culture'
  • A 'polyphonic cornucopia of words and characters, lyrics and arguments, fact and folly'
  • 'resistant to datafication'
  • With koine topoi – issues of common or public concern 

'Harry Shearer is a portal: Learn one thing from Le Show, and you’ll quickly learn half a dozen more by logical consequence'

dr. rosa a. eberly

(Le Show reminds me of a time when news was designed to inform more than enrage.)

Why let machines have all the fun?

People can hear a richer range of emotions, topics and references, recognise impersonations and characters -> better metadata, findability

What can’t machines do? Software might be able to transcribe speech with pretty high accuracy, but it can't (reliably)… recognise humour, sarcasm, rhetorical flourishes, impersonations and characters – all the wonderful characteristics of the Le Show archive that Rosa described in her opening remarks yesterday. A lot of emotions aren’t covered in the ‘big 8’ that software tries to detect.

Software can recognise some subjects that e.g. have Wikipedia entries, but it’d also miss so much of what people can hear.

So, people can do a better job of telling us what's in the archive than computers can. Together, people and computers can help make specific moments more findable, creates metadata that could be used to visualise links between shows – by topic, by tone, music and more.

Could access to history in the raw, 'koine topoi' be a super-power?

Individual learning via crowdsourcing contributes to an informed, literate society

It's not all about the data. Crowdsourcing creates a platform and a reason for engagement. Your work helps others, but it also helps you.

I've shown some of my work with objects from the history of astronomy; playbills for 19th c British theatre performances, and most recently, newspaper articles from the long 19th c.

Through this work, I've come to believe that giving people access to original historical sources is one of the most important ways we can contribute to an informed, literate society.

A society that understands where we've come from, and what that means for where we're going.

A society that is less likely to fall for predictions of AI dooms or AI fantasies, because they've seen tech hype before.

A society that is less likely to believe that 'AI might take your job' because they know that the executives behind the curtain are the ones deciding whether AI helps workers or 'replaces' them.

I've worried about whether volunteers would be motivated to help transcribe audio or text, classify or tag images, when 'AI can do it'. But then I remembered that people still knit jumpers (sweaters) when they can buy them far more quickly and cheaply.

So, crowdsourcing still has a place. The trick is to find ways for 'AI' to aid people, not replace them. To figure out the boring bits and the bits that software is great at; so that people can spend more time on the fun bits.

Harry Shearer's ability to turn something into a topic, 'news of microplastics', of bees', is something of a super power. To amplify those messages is another gift, one the public can create by and for themselves.

Crowdsourcing as connection: a constant star over a sea of change / Établir des connexions: un invariant des projets de crowdsourcing

As I'm speaking today at an event that's mostly in French, I'm sharing my slides outline so it can be viewed at leisure, or copy-and-pasted into a translation tool like Google Translate.

Colloque de clôture du projet Testaments de Poilus, Les Archives nationales de France, 25 Novembre 2022

Crowdsourcing as connection: a constant star over a sea of change, Mia Ridge, British Library

GLAM values as a guiding star

(Or, how will AI change crowdsourcing?) My argument is that technology is changing rapidly around us, but our skills in connecting people and collections are as relevant as ever:

  • Crowdsourcing connects people and collections
  • AI is changing GLAM work
  • But the values we express through crowdsourcing can light the way forward

(GLAM – galleries, libraries, archives and museums)

A sea of change

AI-based tools can now do many crowdsourced tasks:

  • Transcribe audio; typed and handwritten text
  • Classify / label images and text – objects, concepts, 'emotions'

AI-based tools can also generate new images, text

  • Deep fakes, emerging formats – collecting and preservation challenges

AI is still work-in-progress

Automatic transcription, translation failure from this morning: 'the encephalogram is no longer the mother of weeks'

  • Results have many biases; cannot be used alone
  • White, Western, 21st century view
  • Carbon footprint
  • Expertise and resources required
  • Not easily integrated with GLAM workflows

Why bother with crowdsourcing if AI will soon be 'good enough'?

The elephant in the room; been on my mind for a couple of years now

The rise of AI means we have to think about the role of crowdsourcing in cultural heritage. Why bother if software can do it all?

Crowdsourcing brings collections to life

  • Close, engaged attention to 'obscure' collection items
  • Opportunities for lifelong learning; historical and scientific literacy
  • Gathers diverse perspectives, knowledge

Crowdsourcing as connection

Crowdsourcing in GLAMs is valuable in part because it creates connections around people and collections

  • Between volunteers and staff
  • Between people and collections
  • Between collections

Examples from the British Library

In the Spotlight: designing for productivity and engagement

Living with Machines: designing crowdsourcing projects in collaboration with data scientists that attempt to both engage the public with our research and generate research datasets. Participant comments and questions inspired new tasks, shaped our work.

How do we follow the star?

Bringing 'crowdsourcing as connection' into work with AI

Valuing 'crowdsourcing as connection'

  • Efficiency isn't everything. Participation is part of our mission
  • Help technologists and researchers understand the value in connecting people with collections
  • Develop mutual understanding of different types of data – editions, enhancement, transcription, annotation
  • Perfection isn't everything – help GLAM staff define 'data quality' in different contexts
  • Where is imperfect, AI data at scale more useful than perfect but limited data?
  • 'réinjectée' – when, where, and how?
  • How does crowdsourcing, AI change work for staff?
  • How do we integrate data from different sources (AI, crowdsourcing, cataloguers), at different scales, into coherent systems?
  • How do interfaces show data provenance, confidence?

Transforming access, discovery, use

  • A single digitised item can be infinitely linked to places, people, concepts – how does this change 'discovery'?
  • What other user needs can we meet through a combination of AI, better data systems and public participation?

Merci de votre attention!

Pour en savoir plus: https://bl.uk/digital https://livingwithmachines.ac.uk

Essayez notre activité de crowdsourcing: http://bit.ly/LivingWithMachines

Nous attendons vos questions: digitalresearch@bl.uk

Screenshot of images generated by AI, showing variations on dark blue or green seas and shining stars
Versions of image generation for the text 'a bright star over the sea'
Presenting at Les Archives nationales de France, Paris, from home

Festival of Maintenance talk: Apps, microsites and collections online: innovation and maintenance in digital cultural heritage

I came to Liverpool for the 'Festival of Maintenance', a celebration of maintainers. I'm blogging my talk notes so that I'm not just preaching to the converted in the room. As they say:

'Maintenance and repair are just as important as innovation, but sometimes these ideas seem left behind. Amidst the rapid pace of innovation, have we missed opportunities to design things so that they can be fixed?'.

Liverpool 2019: Maintenance in Complex and Changing Times

Apps, microsites and collections online: innovation and maintenance in digital cultural heritage

My talk was about different narratives about 'digital' in cultural heritage organisations and how they can make maintenance harder or easier to support and resource. If last year's innovation is this year's maintenance task, how do we innovate to meet changing needs while making good decisions about what to maintain? At one museum job I calculated that c.85% of my time was spent on legacy systems, leaving less than a day a week for new work, so it's a subject close to my heart.

I began with an introduction to 'What does a cultural heritage technologist do?'. I might be a digital curator now but my roots lie in creating and maintaining systems for managing and sharing collections information and interpretative knowledge. This includes making digitised items available as individual items or computationally-ready datasets. There was also a gratuitous reference to Abba to illustrate the GLAM (galleries, libraries, archives and museums) acronym.

What do galleries, libraries, archives and museums have to maintain?

Exhibition apps and audio guides. Research software. Microsites by departments including marketing, education, fundraising. Catalogues. More catalogues. Secret spreadsheets. Digital asset management systems. Collections online pulled from the catalogue. Collections online from a random database. Student projects. Glueware. Ticketing. Ecommerce. APIs. Content on social media sites, other 3rd party sites and aggregators. CMS. CRM. DRM. VR, AR, MR.

Stories considered harmful

These stories mean GLAMs aren't making the best decisions about maintaining digital resources:

  • It's fine for social media content to be ephemeral
  • 'Digital' is just marketing, no-one expects it to be kept
  • We have limited resources, and if we spend them all maintaining things then how will we build the new cool things the Director wants?
  • We're a museum / gallery / library / archive, not a software development company, what do you mean we have to maintain things?
  • What do you mean, software decays over time? People don't necessarily know that digital products are embedded in a network of software dependencies. User expectations about performance and design also change over time.
  • 'Digital' is just like an exhibition; once it's launched you're done. You work really hard in the lead-up to the opening, but after the opening night you're free to move onto the next thing
  • That person left, it doesn't matter anymore. But people outside won't know that – you can't just let things drop.

Why do these stories matter?

If you don't make conscious choices about what to maintain, you're leaving it to fate.

Today's ephemera is tomorrow's history. Organisations need to be able to tell their own history. They also need to collect digital ephemera so that we can tell the history of wider society. (Social media companies aren't archives for your photos, events and stories.)

Better stories for the future

  • You can't save everything: make the hard choices. Make conscious decisions about what to maintain and how you'll close the things you can't maintain. Assess the likely lifetime of a digital product before you start work and build it into the roadmap.
  • Plan for a graceful exit – for all stakeholders. What lessons need to be documented and shared? Do you need to let any collaborators, funders, users or fans know? Can you make it web archive ready? How can you export and document the data? How can you document the interfaces and contextual reasons for algorithmic logic?
  • Refresh little and often, where possible. It's a pain, but it means projects stay in institutional memory
  • Build on standards, work with communities. Every collection is a special butterfly, but if you work on shared software and standards, someone else might help you maintain it. IIIF is a great example of this.

Also:

  • Check whether your websites are archiveready.com (and nominate UK websites for the UK Web Archive)
  • Look to expert advice on digital preservation
  • Support GLAMs with the legislative, rights and technical challenges of collecting digital ephemera. It's hard to collect social media, websites, podcasts, games, emerging formats, but if we don't, how will we tell the story of 'now' in the future?

And it's been on my mind a lot lately, but I didn't include it: consider the carbon footprint of cloud computing and machine learning, because we also need to maintain the planet.

In closing, I'd slightly adapt the Festival's line: 'design things so that they can be fixed or shut down when their job is done'. I'm sure I've missed some better stories that cultural institutions could tell themselves – let me know what you think!

Two of the organisers introducing the Festival of Maintenance event

'In search of the sweet spot: infrastructure at the intersection of cultural heritage and data science'

It's not easy to find the abstracts for presentations within panels on the Digital Humanities 2019 (DH2019) site, so I've shared mine here.

In search of the sweet spot: infrastructure at the intersection of cultural heritage and data science

Mia Ridge, British Library

My slides: https://www.slideshare.net/miaridge/in-search-of-the-sweet-spot-infrastructure-at-the-intersection-of-cultural-heritage-and-data-science

This paper explores some of the challenges and paradoxes in the application of data science methods to cultural heritage collections. It is drawn from long experience in the cultural heritage sector, predating but broadly aligned to the 'OpenGLAM' and 'Collections as Data' movements. Experiences that have shaped this thinking include providing open cultural data for computational use; creating APIs for catalogue and interpretive records, running hackathons, and helping cultural organisations think through the preparation of 'collections as data'; and supervising undergraduate and MSc projects for students of computer science.

The opportunities are many. Cultural heritage institutions (aka GLAMS – galleries, libraries, archives and museums) hold diverse historical, scientific and creative works – images, printed and manuscript works, objects, audio or video – that could be turned into some form of digital 'data' for use in data science and digital humanities research. GLAM staff have expert knowledge about the collections and their value to researchers. Data scientists bring a rigour, specialist expertise and skills, and a fresh perspective to the study of cultural heritage collections.

While the quest to publish cultural heritage records and digital surrogates for use in data science is relatively new, the barriers within cultural organisations to creating suitable infrastructure with others are historically numerous. They include different expectations about the pace and urgency of work, different levels of technical expertise, resourcing and infrastructure, and different goals. They may even include different expectations about what 'data' is – metadata drawn from GLAM catalogues is the most readily available and shared data, but not only is this rarely complete, often untidy and inconsistent (being the work of decades or centuries and many hands over that time), it is also a far cry from datasets rich with images or transcribed text that data scientists might expect.

Copyright, data protection and commercial licensing can limit access to digitised materials (though this varies greatly). 'Orphaned works', where the rights holder cannot be traced in order to licence the use of in-copyright works, mean that up to 40% of some collections, particularly sound or video collections, are unavailable for risk-free use.(2012)

While GLAMs have experimented with APIs, downloadable datasets and SPARQL endpoints, they rarely have the resources or institutional will to maintain and refresh these indefinitely. Records may be available through multi-national aggregators such as Europeana, DPLA, or national aggregators, but as aggregation often requires that metadata is mapped to the lowest common denominator, their value for research may be limited.

The area of overlap between 'computationally interesting problems' and 'solutions useful for GLAMs' may be smaller than expected to date, but collaboration between cultural institutions and data scientists on shared projects in the 'sweet spot' – where new data science methods are explored to enhance the discoverability of collections – may provide a way forward. Sector-wide collaborations like the International Image Interoperability Framework (IIIF, https://iiif.io/) provide modern models for lightweight but powerful standards. Pilot projects with students or others can help test the usability of collection data and infrastructure while exploring the applicability of emerging technologies and methods. It is early days for these collaborations, but the future is bright.

Panel overview

An excerpt from the longer panel description by David Beavan and Barbara McGillivray.

This panel highlights the emerging collaborations and opportunities between the fields of Digital Humanities (DH), Data Science (DS) and Artificial Intelligence (AI). It charts the enthusiastic progress of the Alan Turing Institute, the UK national institute for data science and artificial intelligence, as it engages with cultural heritage institutions and academics from arts, humanities and social sciences disciplines. We discuss the exciting work and learnings from various new activities, across a number of high-profile institutions. As these initiatives push the intellectual and computational boundaries, the panel considers both the gains, benefits, and complexities encountered. The panel latterly turns towards the future of such interdisciplinary working, considering how DS & DH collaborations can grow, with a view towards a manifesto. As Data Science grows globally, this panel session will stimulate new discussion and direction, to help ensure the fields grow together and arts & humanities remain a strong focus of DS & AI. Also so DH methods and practices continue to benefit from new developments in DS which will enable future research avenues and questions.

'The Past, Present and Future of Digital Scholarship with Newspaper Collections'

It's not easy to find the abstracts for presentations within panels on the Digital Humanities 2019 (DH2019) site, so I've shared mine here. The panel was designed to bring together range of interdisciplinary newspaper-based digital humanities and/or data science projects, with 'provocations' from two senior scholars who will provide context for current ambitions, and to start conversations among practitioners.

Short Paper: Living with Machines

Paper authors: Mia Ridge, Giovanni Colavizza with Ruth Ahnert, Claire Austin, David Beavan, Kaspar Beelens, Mariona Coll Ardanuy, Adam Farquhar, Emma Griffin, James Hetherington, Jon Lawrence, Katie McDonough, Barbara McGillivray, André Piza, Daniel van Strien, Giorgia Tolfo, Alan Wilson, Daniel Wilson.

My slides: https://www.slideshare.net/miaridge/living-with-machines-at-the-past-present-and-future-of-digital-scholarship-with-newspaper-collections-154700888

Living with Machines is a five-year interdisciplinary research project, whose ambition is to blend data science with historical enquiry to study the human impact of the industrial revolution. Set to be one of the biggest and most ambitious digital humanities research initiatives ever to launch in the UK, Living with Machines is developing a large-scale infrastructure to perform data analyses on a variety of historical sources, and in so doing provide vital insights into the debates and discussions taking place in response to today’s digital industrial revolution.

Seeking to make the most of a self-described 'radical collaboration', the project will iteratively develop research questions as computational linguists, historians, library curators and data scientists work on a shared corpus of digitised newspapers, books and biographical data (census, birth, death, marriage, etc. records). For example, in the process of answering historical research questions, the project could take advantage of access to expertise in computational linguistics to overcome issues with choosing unambiguous and temporally stable keywords for analysis, previously reported by others (Lansdall-Welfare et al., 2017). A key methodological objective of the project is to 'translate' history research questions into data models, in order to inspect and integrate them into historical narratives. In order to enable this process, a digital infrastructure is being collaboratively designed and developed, whose purpose is to marshal and interlink a variety of historical datasets, including newspapers, and allow for historians and data scientists to engage with them.

In this paper we will present our vision for Living with Machines, focusing on how we plan to approach it, and the ways in which digital infrastructure enables this multidisciplinary exchange. We will also showcase preliminary results from the different research 'laboratories', and detail the historical sources we plan to use within the project.

The Past, Present and Future of Digital Scholarship with Newspaper Collections

Mia Ridge (British Library), Giovanni Colavizza (Alan Turing Institute)

Historical newspapers are of interest to many humanities scholars, valued as sources of information and language closely tied to a particular time, social context and place. Following library and commercial microfilming and, more recently, digitisation projects, newspapers have been an accessible and valued source for researchers. The ability to use keyword searches through more data than ever before via digitised newspapers has transformed the work of researchers.[1]

Digitised historic newspapers are also of interest to many researchers who seek large bodies of relatively easily computationally-transcribed text on which they can try new methods and tools. Intensive digitisation over the past two decades has seen smaller-scale or repository-focused projects flourish in the Anglophone and European world (Holley, 2009; King, 2005; Neudecker et al., 2014). However, just as earlier scholarship was potentially over-reliant on The Times of London and other metropolitan dailies, this has been replicated and reinforced by digitisation projects (for a Canadian example, see Milligan 2013).

In the last years, several large consortia projects proposing to apply data science and computational methods to historical newspapers at scale have emerged, including NewsEye, impresso, Oceanic Exchanges and Living with Machines. This panel has been convened by some consortia members to cast a critical view on past and ongoing digital scholarship with newspapers collections, and to inform its future.

Digitisation can involve both complexities and simplifications. Knowledge about the imperfections of digitisation, cataloguing, corpus construction, text transcription and mining is rarely shared outside cultural institutions or projects. How can these imperfections and absences be made visible to users of digital repositories? Furthermore, how does the over-representation of some aspects of society through the successive winnowing and remediation of potential sources – from creation to collection, microfilming, preservation, licensing and digitisation – affect scholarship based on digitised newspapers. How can computational methods address some of these issues?

The panel proposes the following format: short papers will be delivered by existing projects working on large collections of historical newspapers, presenting their vision and results to date. Each project is at different stages of development and will discuss their choice to work with newspapers, and reflect on what have they learnt to date on practical, methodological and user-focused aspects of this digital humanities work. The panel is additionally an opportunity to consider important questions of interoperability and legacy beyond the life of the project. Two further papers will follow, given by scholars with significant experience using these collections for research, in order to provide the panel with critical reflections. The floor will then open for debate and discussion.

This panel is a unique opportunity to bring senior scholars with a long perspective on the uses of newspapers in scholarship together with projects at formative stages. More broadly, convening this panel is an opportunity for the DH2019 community to ask their own questions of newspaper-based projects, and for researchers to map methodological similarities between projects. Our hope is that this panel will foster a community of practice around the topic and encourage discussions of the methodological and pedagogical implications of digital scholarship with newspapers.

[1] For an overview of the impact of keyword search on historical research see (Putnam, 2016) (Bingham, 2010).

From piles of material to patchwork: How do we embed the production of usable collections data into library work?

How do we embed the production of usable collections data into library work?These notes were prepared for a panel discussion at the 'Always Already Computational: Collections as Data' (#AACdata) workshop, held in Santa Barbara in March 2017. While my latest thinking on the gap between the scale of collections and the quality of data about them is informed by my role in the Digital Scholarship team at the British Library, I've also drawn on work with catalogues and open cultural data at Melbourne Museum, the Museum of London, the Science Museum and various fellowships. My thanks to the organisers and the Institute of Museum and Library Services for the opportunity to attend. My position paper was called 'From libraries as patchwork to datasets as assemblages?' but in hindsight, piles and patchwork of material seemed a better analogy.

The invitation to this panel asked us to share our experience and perspective on various themes. I'm focusing on the challenges in making collections available as data, based on years of working towards open cultural data from within various museums and libraries. I've condensed my thoughts about the challenges down into the question on the slide: How do we embed the production of usable collections data into library work?

It has to be usable, because if it's not then why are we doing it? It has to be embedded because data in one-off projects gets isolated and stale. 'Production' is there because infrastructure and workflow is unsexy but necessary for access to the material that makes digital scholarship possible.

One of the biggest issues the British Library (BL) faces is scale. The BL's collections are vast – maybe 200 million items – and extremely varied. My experience shows that publishing datasets (or sharing them with aggregators) exposes the shortcomings of past cataloguing practices, making the size of the backlog all too apparent.

Good collections data (or metadata, depending on how you look at it) is necessary to avoid the overwhelmed, jumble sale feeling of using a huge aggregator like Europeana, Trove, or the DPLA, where you feel there's treasure within reach, if only you could find it. Publishing collections online often increases the number of enquiries about them – how can institution deal with enquiries at scale when they already have a cataloguing backlog? Computational methods like entity identification and extraction could complement the 'gold standard' cataloguing already in progress. If they're made widely available, these other methods might help bridge the resourcing gaps that mean it's easier to find items from richer institutions and countries than from poorer ones.

Photo of piles of materialYou probably already all know this, but it's worth remembering: our collections aren't even (yet) a patchwork of materials. The collections we hold, and the subset we can digitise and make available for re-use are only a tiny proportion of what once existed. Each piece was once part of something bigger, and what we have now has been shaped by cumulative practical and intellectual decisions made over decades or centuries. Digitisation projects range from tiny specialist databases to huge commercial genealogy deals, while some areas of the collections don't yet have digital catalogue records. Some items can't be digitised because they're too big, small or fragile for scanning or photography; others can't be shared because of copyright, data protection or cultural sensitivities. We need to be careful in how we label datasets so that the absences are evident.

(Here, 'data' may include various types of metadata, automatically generated OCR or handwritten text recognition transcripts, digital images, audio or video files, crowdsourced enhancements or any combination or these and more)

Image credit: https://www.flickr.com/photos/teen_s/6251107713/

In addition to the incompleteness or fuzziness of catalogue data, when collections appear as data, it's often as great big lumps of things. It's hard for normal scholars to process (or just unzip) 4gb of data.

Currently, datasets are often created outside normal processes, and over time they become 'stale' as they're not updated when source collections records change. And when they manage to unzip them, the records rely on internal references – name authorities for people, places, etc – that can only be seen as strings rather than things until extra work is undertaken.

The BL's metadata team have experimented with 'researcher format' CSV exports around specific themes (eg an exhibition), and CSV is undoubtedly the most accessible format – but what we really need is the ability for people to create their own queries across catalogues, and create their own datasets from the results. (And by queries I don't mean SPARQL but rather faceted browsing or structured search forms).

Image credit: screenshot from http://data.bl.uk/

Collections are huge (and resources relatively small) so we need to supplement manual cataloguing with other methods. Sometimes the work of crafting links from catalogues to external authorities and identifiers will be a machine job, with pieces sewn together at industrial speed via entity recognition tools that can pull categories out or text and images. Sometimes it's operated by a technologist who runs records through OpenRefine to find links to name authorities or Wikidata records. Sometimes it's a labour of scholarly love, with links painstakingly researched, hand-tacked together to make sure they fit before they're finally recorded in a bespoke database.

This linking work often happens outside the institution, so how can we ingest and re-use it appropriately? And if we're to take advantage of computational methods and external enhancements, then we need ways to signal which categories were applied by catalogues, which by software, by external groups, etc.

The workflow and interface adjustments required would be significant, but even more challenging would be the internal conversations and changes required before a consensus on the best way to combine the work of cataloguers and computers could emerge.

The trick is to move from a collection of pieces to pieces of a collection. Every collection item was created in and about places, and produced by and about people. They have creative, cultural, scientific and intellectual properties. There's a web of connections from each item that should be represented when they appear in datasets. These connections help make datasets more usable, turning strings of text into references to things and concepts to aid discoverability and the application of computational methods by scholars. This enables structured search across datasets – potentially linking an oral history interview with a scientist in the BL sound archive, their scientific publications in journals, annotated transcriptions of their field notebooks from a crowdsourcing project, and published biography in the legal deposit library.

A lot of this work has been done as authority files like AAT, ULAN etc are applied in cataloguing, so our attention should turn to turning local references into URIs and making the most of that investment.

Applying identifiers is hard – it takes expert care to disambiguate personal names, places, concepts, even with all the hinting that context-aware systems might be able to provide as machine learning etc techniques get better. Catalogues can't easily record possible attributions, and there's understandable reluctance to publish an imperfect record, so progress on the backlog is slow. If we're not to be held back by the need for records to be perfectly complete before they're published, then we need to design systems capable of capturing the ambiguity, fuzziness and inherent messiness of historical collections and allowing qualified descriptors for possible links to people, places etc. Then we need to explain the difference to users, so that they don't overly rely on our descriptions, making assumptions about the presence or absence of information when it's not appropriate.

Image credit: http://europeana.eu/portal/record/2021648/0180_N_31601.html

Photo of pipes over a buildingA lot of what we need relies on more responsive infrastructure for workflows and cataloguing systems. For example, the BL's systems are designed around the 'deliverable unit' – the printed or bound volume, the archive box – because for centuries the reading room was where you accessed items. We now need infrastructure that makes items addressable at the manuscript, page and image level in order to make the most of the annotations and links created to shared identifiers.

(I'd love to see absorbent workflows, soaking up any related data or digital surrogates that pass through an organisation, no matter which system they reside in or originate from. We aren't yet making the most of OCRd text, let alone enhanced data from other processes, to aid discoverability or produce datasets from collections.)

Image credit: https://www.flickr.com/photos/snorski/34543357
My final thought – we can start small and iterate, which is just as well, because we need to work on understanding what users of collections data need and how they want to use them. We're making a start and there's a lot of thoughtful work behind the scenes, but maybe a bit more investment is needed from research libraries to become as comfortable with data users as they are with the readers who pass through their physical doors.

Keynote online: 'Reaching out: museums, crowdsourcing and participatory heritage'

In September I was invited to give a keynote at the Museum Theme Days 2016 in Helsinki. I spoke on 'Reaching out: museums, crowdsourcing and participatory heritage. In lieu of my notes or slides, the video is below. (Great image, thanks YouTube!)

Network visualisations and the 'so what?' problem

This week I was in Luxembourg for a workshop on Network Visualisation in the Cultural Heritage Sector, organised by Marten Düring and held on the Belval campus of the University of Luxembourg.

In my presentation, I responded to some of the questions posed in the workshop outline:

In this workshop we want to explore how network visualisations and infrastructures will change the research and outreach activities of cultural heritage professionals and historians. Among the questions we seek to discuss during the workshop are for example: How do users benefit from graphs and their visualisation? Which skills do we expect from our users? What can we teach them? Are SNA [social network analysis] theories and methods relevant for public-facing applications? How do graph-based applications shape a user’s perception of the documents/objects which constitute the data? How can applications benefit from user engagement? How can applications expand and tap into other resources?

A rough version of my talk notes is below. The original slides are also online.

Network visualisations and the 'so what?' problem

Caveat

While I may show examples of individual network visualisations, this talk isn't a critique of them in particular. There's lots of good practice around, and these lessons probably aren't needed for people in the room.

Fundamentally, I think network visualisations can be useful for research, but to make them more effective tools for outreach, some challenges should be addressed.

Context

I'm a Digital Curator at the British Library, mostly working with pre-1900 collections of manuscripts, printed material, maps, etc. Part of my job is to help people get access to our digital collections. Visualisations are a great way to firstly help people get a sense of what's available, and then to understand the collections in more depth.

I've been teaching versions of an 'information visualisation 101' course at the BL and digital humanities workshops since 2013. Much of what I'm saying now is based on comments and feedback I get when presenting network visualisations to academics, cultural heritage staff (who should be a key audience for social network analyses).

Provocation: digital humanists love network visualisations, but ordinary people say, 'so what'?

Fig1And this is a problem. We're not conveying what we're hoping to convey.

Network visualisation, via Table of data, via http://fredbenenson.com/
Network visualisation http://fredbenenson.com

When teaching datavis, I give people time to explore examples like this, then ask questions like 'Can you tell what is being measured or described? What do the relationships mean?'. After talking about the pros and cons of network visualisations, discussion often reaches a 'yes, but so what?' moment.

Here are some examples of problems ordinary people have with network visualisations…

Location matters

Spatial layout based on the pragmatic aspects of fitting something on the screen using physics, rules of attraction and repulsion doesn't match what people expect to see. It's really hard for some to let go of the idea that spatial layout has meaning. The idea that location on a page has meaning of some kind is very deeply linked to their sense of what a visualisation is.

Animated physics is … pointless?

People sometimes like the sproinginess when a network visualisation resettles after a node has been dragged, but waiting for the animation to finish can also be slow and irritating. Does it convey meaning? If not, why is it there?

Size, weight, colour = meaning?

The relationship between size, colour, weight isn't always intuitive – people assume meaning where there might be none.

In general, network visualisations are more abstract than people expect a visualisation to be.

'What does this tell me that I couldn't learn as quickly from a sentence, list or table?'

Table of data, via http://fredbenenson.com/
Table of data, via http://fredbenenson.com/

Scroll down the page that contains the network graph above and you get other visualisations. Sometimes they're much more positively received, particularly people feel they learn more from them than from the network visualisation.

Onto other issues with 'network visualisations as communication'…

Which algorithmic choices are significant?

screenshot of network graphs
Mike Bostock's force-directed and curved line versions of character co-occurrence in Les Misérables

It's hard for novices to know which algorithmic and data-cleaning choices are significant, and which have a more superficial impact.

Untethered images

Images travel extremely well on social media. When they do so, they often leave information behind and end up floating in space. Who created this, and why? What world view does it represent? What source material underlies it, how was it manipulated to produce the image? Can I trust it?

'Can't see the wood for the trees'

viral texts

When I showed this to a class recently, one participant was frustrated that they couldn't 'see the wood for the trees'. The visualisations gives a general impression of density, but it's not easy to dive deeper into detail.

Stories vs hairballs

But when I started to explain what was being represented – the ways in which stories were copied from one newspaper to another – they were fascinated. They might have found their way there if they'd read the text but again, the visualisation is so abstract that it didn't hint at what lay underneath. (Also I have only very, very rarely seen someone stop to read the text before playing with a visualisation.)

No sense of change over time

This flattening of time into one simultaneous moment is more vital for historical networks than for literary ones, but even so, you might want to compare relationships between sections of a literary work.

No sense of texture, detail of sources

All network visualisations look similar, whether they're about historical texts or cans of baked beans. Dots and lines mask texture, and don't always hint at the depth of information they represent.

Jargon

Node. Edge. Graph. Directed, undirected. Betweenness. Closeness. Eccentricity.

There's a lot to take on to really understand what's being expressed in a network graph.

There is some hope…

Onto the positive bit!

Interactivity is engaging

People find the interactive movement, the ability to zoom and highlight links engaging, even if they have no idea what's being expressed. In class, people started to come up with questions about the data as I told them more about what was represented. That moment of curiosity is an opportunity if they can dive in and start to explore what's going on, what do the relationships mean?

…but different users have different interaction needs

For some, there's that frustration expressed earlier they 'can't get to see a particular tree' in the dense woods of a network visualisation. People often want to get to the detail of an instance of a relationship – the lines of text, images of the original document – from a graph.

This mightn't be how network visualisations are used in research, but it's something to consider for public-facing visualisations. How can we connect abstract lines or dots to detail, or provide more information about what the relationship means, show the quantification expressed as people highlight or filter parts of a graph? A  harder, but more interesting task is hinting at the texture or detail of those relationships.

Proceed, with caution

One of the workshop questions was 'Are social network analysis theories and methods relevant for public-facing applications?' – and maybe the answer is a qualified yes. As a working tool, they're great for generating hypotheses, but they need a lot more care before exposing them to the public.

[As an aside, I’d always taken the difference between visualisations as working tools for exploring data – part of the process of investigating a research question – and visualisation as an output – a product of the process, designed for explanation rather than exploration – as fundamental, but maybe we need to make that distinction more explicit.]

But first – who are your 'users'?

During this workshop, at different points we may be talking about different 'users' – it's useful to scope who we mean at any given point. In this presentation, I was talking about end users who encounter visualisations, not scholars who may be organising and visualising networks for analysis.

Sometimes a network visualisation isn't the answer … even if it was part of the question.

As an outcome of an exploratory process, network visualisations are not necessarily the best way to present the final product. Be disciplined – make yourself justify the choice to use network visualisations.

No more untethered images

Include an extended caption – data source, tools and algorithms used. Provide a link to find out more – why this data, this form? What was interesting but not easily visualised? Let people download the dataset to explore themselves?

Present visualisations as the tip of the data iceberg

Visualisations are the tip of the iceberg
Visualisations are the tip of the iceberg

Lots of interesting data doesn't make it into a visualisation. Talking about what isn't included and why it was left out is important context.

Talk about data that couldn't exist

Beyond the (fuzzy, incomplete, messy) data that's left out because it's hard to visualise, data that never existed in the first place is also important:

'because we're only looking on one axis (letters), we get an inflated sense of the importance of spatial distance in early modern intellectual networks. Best friends never wrote to each other; they lived in the same city and drank in the same pubs; they could just meet on a sunny afternoon if they had anything important to say. Distant letters were important, but our networks obscure the equally important local scholarly communities.'
Scott Weingart, 'Networks Demystified 8: When Networks are Inappropriate'

Help users learn the skills and knowledge they need to interpret network visualisations in context.

How? Good question! This is the point at which I hand over to you…