57 Varieties of Digital History? Towards the future of looking at the past

Back in November 2015, Tara Andrews invited me to give a guest lecture on 'digital history' for the Introduction to Digital Humanities course at the University of Bern, where she was then a professor. This is a slightly shortened version of my talk notes, finally posted in 2024 as I go back to thinking about what 'digital history' actually is.

Illustration of a tin of Heinz Cream of Tomato Soup '57 varieties'I called my talk '57 varieties of digital history' as a play on the number of activities and outputs called 'digital history'. While digital history and digital humanities are often linked and have many methods in common, digital history also draws on the use of computers for quantitative work, and digitisation projects undertaken in museums, libraries, archives and academia. Digital tools have enhanced many of the tasks in the research process (which itself has many stages – I find the University of Minnesota Libraries' model with stages of 'discovering', 'gathering', 'creating' and 'sharing' useful), but at the moment the underlying processes often remain the same.

So, what is digital history?

…using computers for writing, publishing

A historian on twitter once told me about a colleague who said they're doing digital history because they're using PowerPoint. On reflection, I think they have a point. These simple tools might be linked to fairly traditional scholarship – writing journal articles or creating presentations – but text created in them is infinitely quotable, shareable, and searchable, unlike the more inert paper equivalents. Many scholars use Word documents to keep bits of text they've transcribed from historical source materials, or to keep track of information from other articles or books. These become part of their personal research collections, which can build up over years into substantial resources in their own right. Even 'helper' applications like reference managers such as Zotero or EndNote can free up significant amounts of time that can then be devoted to research.

…the study of computers

When some people hear 'digital history', they imagine that it's the study of computers, rather than the use of digital methods by historians. While this isn't a serious definition of digital history, it's a reminder that viewing digital tools through a history of science and technology lens can be fruitful.

…using digitised material

Digitisation takes many forms, including creating or transcribing catalogue records about heritage collections, writing full descriptions of items, and making digital images of books, manuscripts, artworks etc. Metadata – information about the item, such as when and where it was made – is the minimum required to make collections discoverable. Increasingly, new forms of photography may be applied to particular types of objects to capture more information than the naked eye can see. Text may be transcribed, place names mapped, marginalia annotated and more.

The availability of free (or comparatively inexpensive) historical records through heritage institutions and related commercial or grassroots projects means we can access historical material without having to work around physical locations and opening hours, negotiate entry to archives (some of which require users to be 'bona fide scholars'), or navigate unknown etiquettes. Text transcription allows readers who lack the skills to read manuscript or hand-written documents to make use of these resources, as well as making the text searchable.

For some historians, this is about as digital as they want to get. They're very happy with being able to access more material more conveniently; their research methods and questions are still pretty unchanged.

…creating digital repositories

Most digitised items live in some broader system that aggregates and presents material from a particular institution, or related to a particular topic. While some digital repositories are based on sub-sets of official institutional collections, most aren't traditional 'archives'. One archivist describes digital repositories as a 'purposeful collection of surrogates'.

Repositories aren't always created by big, funded projects. Personal research collections assembled over time are one form of ad hoc repository – they may contain material from many different archives collected by one researcher over a number of years.

Themed collections may be the result of large, scholarly projects with formal partners who've agreed to contribute material about a particular time, place, group in society or topic. They might also be the result of work by a local history society with volunteers who digitise material and share it online.

'Commons' projects (like Flickr or Wikimedia Commons) tend to be less focused – they might contain collections from specific institutions, but these specific collections are aggregated into the whole repository, where their identity (and the provenance of individual items) may be subsumed. While 'commons' platforms technically enable sharing, the cultural practices around sharing are yet to change, particularly for academic historians and many cultural institutions.

Repositories can provide different functionality. In some 'scholarly workbenches' you can collect and annotate material; in others you can bookmark records or download images. They allow support different levels of access. Some allow you to download and re-use material without restriction, some only allow non-commercial use, and some are behind paywalls.

…creating datasets

The Old Bailey Online project has digitised the proceedings of the Old Bailey, making court cases from 1674 to 1913 available online. They haven't just transcribed text from digital images, they've added structure to the text. For example, the defendant's name, the crime he was accused of and the victim's name have all been tagged. The addition of this structure means that the material can be studied as text, or analysed statistically.

Adding structure to data can enable innovative research activities. If the markup is well-designed, it can support the exploration of questions that were not envisaged when the data was created. Adding structure to other datasets may become less resource-intensive as new computational techniques become available.

…creating visualisations and innovative interfaces

Some people or projects create specialist interfaces to help people explore their datasets. They might be maps or timelines that help people understand the scope of a collection in time and place, while others are more interpretive, presenting a scholarly argument through their arrangement of interface elements, the material they have assembled, the labels they use and the search or browse queries they support. Ideally, these interfaces should provide access to the original records underlying the visualisation so that scholars can investigate potential new research questions that arise from their use of the interface.

…creating linked data (going from strings to things)

As well as marking up records with information like 'this bit is a defendant's name', we can also link a particular person's name to other records about them online. One way to do this is to link their name to published lists of names online. These stable identifiers mean that we could link any mention of a particular person in a text to this online identifier, so that 'Captain Cook' or 'James Cook' are understood to be different strings about the same person.

A screenshot of structured data on the dbpedia site e.g. dbo:birthPlace = 1728-01-01
dbpedia page for 'James Cook', 2015

This also helps create a layer of semantic meaning about these strings of text. Software can learn that strings that represent people can have relationships with other things – in this case, historical voyages, other people, natural history and ethnographic collections, and historical events.

…applying computational methods, tools to digitised sources

So far some of what we've seen has been heavily reliant on manual processing – someone has had to sit at a desk and decide which bit of text is about the defendant and which about the victim in an Old Bailey case.

So people are developing software algorithms to find concepts – people, places, events, etc – within text. This is partly a response to amount of digitised text now available; partly a response to recognition of power of structured data. Techniques like 'named entity recognition' help create structure from unstructured data. This allows data to be queried, contextualised and presented in more powerful ways.

The named entity recognition software here [screenshot lost?] knows some things about the world – the names of places, people, dates, some organisations. It also gets lots of things wrong – it doesn't understand 'category five storm' as a concept, it mixes up people and organisations – but as a first pass, it has potential. Software can be trained to understand the kinds of concepts and things that occur in particular datasets. This also presents a problem for historians, who may have to use software trained for modern, commercial data.

This is part of a wider exploration of 'distant reading', methods for understanding what's in a corpus by processing the text en masse rather than by reading each individual novel or document. For example, it might be used to find linguistic differences between genres of literature, or between authors from different countries.

In this example [screenshot of topic modelling lost?], statistically unlikely combinations of words have been grouped together into 'topics'. This provides a form of summary of the contents of text files.

Image tagging – 'machine learning' techniques mean that software can learn how to do things rather than having to be precisely programmed in advance. This will have more impact on the future of digital history as these techniques become mainstream.

Audio tagging – software suggests tags, humans verify them. Quicker than doing them from scratch, but possible for software to miss significant moments that a person would spot. (e.g. famous voices, cultural references, etc).

Handwritten text recognition will transform manuscript sources such as much as optical character recognition has transformed typed sources!

Studying born digital material (web archives, social media corpus etc)

Important historical moments, such as the 'Arab spring', happened on social media platforms like twitter, youtube and facebook. The British Library and the Internet Archive have various 'snapshots' of websites, but they can only hope to capture a part of online material. We've already lost significant chunks of web history – every time a social media platform is shut without being archived, future historians have lost valuable data. (Not to mention people's personal data losses).

This also raises questions about how we should study 'digital material culture'. Websites like Facebook really only make sense when they're used in a social context. The interaction design of 'likes' and comments, the way a newsfeed is constructed in seconds based on a tiny part of everything done in your network – these are hard to study as a series of static screenshots or data dumps.

…sharing history online

Sharing research outputs is great. It some point it starts to intersect with public history. But questions remain about 'broadcast' vs 'discursive' modes of public history – could we do more than model old formats online? Websites and social media can be just as one-way broadcast as television unless they're designed for two-way participation.

What's missing?

Are there other research objects or questions that should be included under the heading 'digital history'? [A question to allow for discussion time]

Towards the future of looking at the past

To sum up what we've seen so far – we've seen the transformation of unorganised, unprocessed data into 'information' through research activities like 'classification, rearranging/sorting, aggregating, performing calculations, and selection'.

Historical material is being transformed from a 'page' to a 'dataset'. As some of this process is automated, it raises new questions – how do we balance the convenience of automatic processing with the responsibility to review and verify the results? How do we convey the processes that went into creating a dataset so that another researcher can understand its gaps, the mixture of algorithmic and expert processes applied to it? My work at the British Library has made the importance of versioning a dataset or corpus clear – if a historian bases an argument on one version of OCR text, and the next version is better, they should be able to link to the version they based their work on.

We've thought about how digital text and media allows for new forms of analysis, using methods such as data visualisation, topic modelling or data mining. These methods can yield new insights and provoke new research questions, but most are not yet accessible to the ordinary historian. While automated processes help, preparing data for digital history is still incredibly detailed, time-consuming work.

What are the pros and cons of the forms of digital history discussed?

Cons

The ability to locate records on consumer-facing services like Google Maps is valuable, but commercial, general use mapping tools are not always suitable for historical data, which is often fuzzy, messy, and of highly variable coverage and precision. For example, placing text or points on maps can suggest a degree of certainty not supported by the data. Locating historical addresses can be inherently uncertain in instances where street numbers were not yet in use, but most systems expect a location to be placed as a precise dot (point of interest) on a map; drawing a line to mark a location would at least allow the length of a street to be marked as a possible address.

There is an unmet need for everyday geospatial tools suitable for historians. For example, those with datasets containing historical locations would appreciate the ability to map addresses from specific periods on historical maps that are georeferenced, georectified and displayable on a modern, copyright-free map or the historical map. Similarly, biographical software, particularly when used for family history, collaborative prosopographical or community history projects would benefit from the ability to record the degree of certainty for potential-but-not-yet-proven relationships or identifications, and to link uncertain information to specific individuals.

The complexity of some software packages (or the combination of packages assembled to meet various needs) is a barrier for those short on time, unable to access dedicated support or training, or who do not feel capable of learning the specialist jargon and skills required to assess and procure software to meet their needs. The need for equipment and software licences can be a financial barrier; unclear licensing requirements and costs for purchasing high-resolution historical maps are another. Copyright and licensing are also complex issues.

Sensible historians worry about the sustainability of digital sites – their personal research collection might be around for 30 years or more; and they want to cite material that will be findable later.

There are issues with representing historical data, particularly in modern tools that cannot represent uncertainty, contingency. Here [screenshot lost?]the curator's necessarily fuzzy label of 'early 17th century' has been assigned to a falsely precise date. Many digital tools are not (yet) suitable for historical data. Their abilities have over-stated or their limits not clearly communicated/understood.

Very few peer-reviewed journals are able to host formats other than articles, inhibiting historians' ability to explore emerging digital formats for presenting research.

Faculty historians might dream of creating digital projects tailored for the specific requirements of their historical dataset, research question and audience, but their peers may not be confident in their ability to evaluate the results and assign credit appropriately.

Pros

Material can be recontextualised, transcluded, linked, contextualised. The distance between a reference and the original item reduced to just a link (unless a paywall etc gets in the way). Material can be organised in multiple ways independent of their physical location. Digital tools can represent multiple commentaries or assertions on a single image or document through linked annotations.

Computational techniques for processing data could reduce the gap between well-funded projects and others, thereby reducing the likelihood of digital history projects reinscribing the canon.

Digitised resources have made it easier to write histories of ordinary lives. You can search through multiple databases to quickly collate biographical info (births, deaths, marriages etc) and other instances when their existence might be documented. This isn't just a change in speed, but also in the accessibility of resources without travel, expense.

Screenshot of a IIIF viewer showing search results highlighted on a digitised historical text
Wellcome's IIIF viewer showing a highlighted search result

Search – any word in a digitised text can be a search result – we're not limited to keywords in a catalogue record. We can also discover some historical material via general search engines. Phonetic and fuzzy searches have also improved the ability to discover sources.

Historians like Professor Katrina Navickas have shown new models for the division of labour between people and software; previously most historical data collection and processing was painstakingly done by historians. She and others have shown how digital techniques can be applied to digitised sources in the pursuit of a historical research question.

Conclusion and questions: digital history, digital historiography?

The future is here, it's just not evenly distributed (this is the downer bit)

Academic historians might find it difficult to explore new forms of digital creation if they are hindered by the difficulties of collaborating on interdisciplinary digital projects and their need for credit and attribution when publishing data or research. More advanced forms of digital history also require access to technical expertise. While historians should know the basics of computational thinking, most may not be able to train as a programmer and as a historian – how much should we expect people to know about making software?

I've hinted at the impact of convenience in accessing digitised historical materials, and in those various stages of 'discovering', 'gathering', 'creating' and 'sharing'… We must also consider how experiences of digital technologies have influenced our understanding of what is possible in historical research, and the factors that limit the impact of digital technologies. The ease with which historians transform data from text notes to spreadsheets to maps to publications and presentations is almost taken for granted, but it shows the impact of digitality on enhancing everyday research practices.

So digital history has potential, is being demonstrated, but there's more to do…

Useful distractions: help cultural heritage and scientific projects from home

Today I came across the term 'terror-scrolling', a good phrase to describe the act of glancing from one COVID-19 update to another. While you can check out galleries, libraries, archives and museums content online or explore the ebooks, magazines and other digital items available from your local library, you might also want to help online projects from scientific and cultural heritage organisations. You can call it 'online volunteering' or 'crowdsourcing', but the key point is that these projects offer a break from the everyday while contributing to a bigger goal.

Not commuting at the moment? Need to channel some energy into something positive? You can help transcribe historical text that computers can't read, or sort scientific images. And don't worry – these sites will let you know what skills are required, you can often try a task before registering, and they have built-in methods for dealing with any mistakes you might make at the start.

Here's a list of sites that have a variety of different kinds of tasks / content to work on:

Some of these sites offer projects in languages other than English, and I've collected additional multi-lingual / international sites at Crowdsourcing the world’s heritage – I'm working on an update that'll make it easy to find current, live projects but (ironically, for someone who loves taking part in projects) I can't spend much time at my desk right now so it's not ready just yet.

Stuck at home? View cultural heritage collections online

With people self-isolating to slow the spread of the COVID-19 pandemic, parents and educators (as well as people looking for an art or history fix) may be looking to replace in-person trips to galleries, libraries, archives and museums* with online access to images of artefacts and information about them. GLAMs have spent decades getting some of the collections digitised and online so that you can view items and information from home.

* Collectively known as 'GLAMs' because it's a mouthful to say each time

Search a bunch of GLAM portals at once

I've made a quick 'custom search engine' so you can search most of the sites above with one Google search box. Search a range of portals that collect digitised objects, texts and media from galleries, libraries, archives and museums internationally:

The direct link is https://cse.google.com/cse?cx=006190492493219194770:xw0b7dfwb6b (it's just a search box, without any context, but it means you can do a search without loading this whole post)

Collections, deep zoom and virtual tour portals

Various platforms have large collections of objects from different institutions, in formats ranging from 'virtual exhibitions' or 'tours' to 'deep zooms' to catalogue-style pages about objects. I've focused on sites that include collections from multiple institutions, but this also means some of them are huge and you'll have to explore a bit to find relevant content. Try:

Other links

Various articles have collected institution-specific links to different forms of virtual tours. Try:

Things are moving fast, so let me know about other sets of links to collections, stories and tours online that'll help people staying home get their fix of history and culture and I'll update this post. Comment below, email me or @mia_out on twitter.

Screenshot from https://www.europeana.eu/portal/en
Europeana is just one of many online portals to images, stories, deep zooms and virtual tours / exhibitions from galleries, libraries, archives and museums internationally

'The Past, Present and Future of Digital Scholarship with Newspaper Collections'

It's not easy to find the abstracts for presentations within panels on the Digital Humanities 2019 (DH2019) site, so I've shared mine here. The panel was designed to bring together range of interdisciplinary newspaper-based digital humanities and/or data science projects, with 'provocations' from two senior scholars who will provide context for current ambitions, and to start conversations among practitioners.

Short Paper: Living with Machines

Paper authors: Mia Ridge, Giovanni Colavizza with Ruth Ahnert, Claire Austin, David Beavan, Kaspar Beelens, Mariona Coll Ardanuy, Adam Farquhar, Emma Griffin, James Hetherington, Jon Lawrence, Katie McDonough, Barbara McGillivray, André Piza, Daniel van Strien, Giorgia Tolfo, Alan Wilson, Daniel Wilson.

My slides: https://www.slideshare.net/miaridge/living-with-machines-at-the-past-present-and-future-of-digital-scholarship-with-newspaper-collections-154700888

Living with Machines is a five-year interdisciplinary research project, whose ambition is to blend data science with historical enquiry to study the human impact of the industrial revolution. Set to be one of the biggest and most ambitious digital humanities research initiatives ever to launch in the UK, Living with Machines is developing a large-scale infrastructure to perform data analyses on a variety of historical sources, and in so doing provide vital insights into the debates and discussions taking place in response to today’s digital industrial revolution.

Seeking to make the most of a self-described 'radical collaboration', the project will iteratively develop research questions as computational linguists, historians, library curators and data scientists work on a shared corpus of digitised newspapers, books and biographical data (census, birth, death, marriage, etc. records). For example, in the process of answering historical research questions, the project could take advantage of access to expertise in computational linguistics to overcome issues with choosing unambiguous and temporally stable keywords for analysis, previously reported by others (Lansdall-Welfare et al., 2017). A key methodological objective of the project is to 'translate' history research questions into data models, in order to inspect and integrate them into historical narratives. In order to enable this process, a digital infrastructure is being collaboratively designed and developed, whose purpose is to marshal and interlink a variety of historical datasets, including newspapers, and allow for historians and data scientists to engage with them.

In this paper we will present our vision for Living with Machines, focusing on how we plan to approach it, and the ways in which digital infrastructure enables this multidisciplinary exchange. We will also showcase preliminary results from the different research 'laboratories', and detail the historical sources we plan to use within the project.

The Past, Present and Future of Digital Scholarship with Newspaper Collections

Mia Ridge (British Library), Giovanni Colavizza (Alan Turing Institute)

Historical newspapers are of interest to many humanities scholars, valued as sources of information and language closely tied to a particular time, social context and place. Following library and commercial microfilming and, more recently, digitisation projects, newspapers have been an accessible and valued source for researchers. The ability to use keyword searches through more data than ever before via digitised newspapers has transformed the work of researchers.[1]

Digitised historic newspapers are also of interest to many researchers who seek large bodies of relatively easily computationally-transcribed text on which they can try new methods and tools. Intensive digitisation over the past two decades has seen smaller-scale or repository-focused projects flourish in the Anglophone and European world (Holley, 2009; King, 2005; Neudecker et al., 2014). However, just as earlier scholarship was potentially over-reliant on The Times of London and other metropolitan dailies, this has been replicated and reinforced by digitisation projects (for a Canadian example, see Milligan 2013).

In the last years, several large consortia projects proposing to apply data science and computational methods to historical newspapers at scale have emerged, including NewsEye, impresso, Oceanic Exchanges and Living with Machines. This panel has been convened by some consortia members to cast a critical view on past and ongoing digital scholarship with newspapers collections, and to inform its future.

Digitisation can involve both complexities and simplifications. Knowledge about the imperfections of digitisation, cataloguing, corpus construction, text transcription and mining is rarely shared outside cultural institutions or projects. How can these imperfections and absences be made visible to users of digital repositories? Furthermore, how does the over-representation of some aspects of society through the successive winnowing and remediation of potential sources – from creation to collection, microfilming, preservation, licensing and digitisation – affect scholarship based on digitised newspapers. How can computational methods address some of these issues?

The panel proposes the following format: short papers will be delivered by existing projects working on large collections of historical newspapers, presenting their vision and results to date. Each project is at different stages of development and will discuss their choice to work with newspapers, and reflect on what have they learnt to date on practical, methodological and user-focused aspects of this digital humanities work. The panel is additionally an opportunity to consider important questions of interoperability and legacy beyond the life of the project. Two further papers will follow, given by scholars with significant experience using these collections for research, in order to provide the panel with critical reflections. The floor will then open for debate and discussion.

This panel is a unique opportunity to bring senior scholars with a long perspective on the uses of newspapers in scholarship together with projects at formative stages. More broadly, convening this panel is an opportunity for the DH2019 community to ask their own questions of newspaper-based projects, and for researchers to map methodological similarities between projects. Our hope is that this panel will foster a community of practice around the topic and encourage discussions of the methodological and pedagogical implications of digital scholarship with newspapers.

[1] For an overview of the impact of keyword search on historical research see (Putnam, 2016) (Bingham, 2010).

Cross-post: Seeking researchers to work on an ambitious data science and digital humanities project

I rarely post here at the moment, in part because I post on the work blog. Here's a cross-post to help spread the word about some exciting opportunities currently available: Seeking researchers to work on an ambitious data science and digital humanities project at the British Library and Alan Turing Institute (London)

'If you follow @BL_DigiSchol or #DigitalHumanities hashtags on twitter, you might have seen a burst of data science, history and digital humanities jobs being advertised. In this post, Dr Mia Ridge of the Library's Digital Scholarship team provides some background to contextualise the jobs advertised with the 'Living with Machines' project.

We are seeking to appoint several new roles who will collaborate on an exciting new project developed by the British Library and The Alan Turing Institute, the national centre for data science and artificial intelligence.

Jobs currently advertised:

The British Library jobs are now advertised, closing September 21:

You may have noticed that the British Library is also currently advertising for a Curator, Newspaper Data (closes Sept 9). This isn’t related to Living with Machines, but with an approach of applying data-driven journalism and visualisation techniques to historical collections, it should have some lovely synergies and opportunities to share work in progress with the project team. There's also a Research Software Engineer advertised that will work closely with many of the same British Library teams.

If you're applying for these posts, you may want to check out the Library's visions and values on the refreshed 'Careers' website.'

Network visualisations and the 'so what?' problem

This week I was in Luxembourg for a workshop on Network Visualisation in the Cultural Heritage Sector, organised by Marten Düring and held on the Belval campus of the University of Luxembourg.

In my presentation, I responded to some of the questions posed in the workshop outline:

In this workshop we want to explore how network visualisations and infrastructures will change the research and outreach activities of cultural heritage professionals and historians. Among the questions we seek to discuss during the workshop are for example: How do users benefit from graphs and their visualisation? Which skills do we expect from our users? What can we teach them? Are SNA [social network analysis] theories and methods relevant for public-facing applications? How do graph-based applications shape a user’s perception of the documents/objects which constitute the data? How can applications benefit from user engagement? How can applications expand and tap into other resources?

A rough version of my talk notes is below. The original slides are also online.

Network visualisations and the 'so what?' problem

Caveat

While I may show examples of individual network visualisations, this talk isn't a critique of them in particular. There's lots of good practice around, and these lessons probably aren't needed for people in the room.

Fundamentally, I think network visualisations can be useful for research, but to make them more effective tools for outreach, some challenges should be addressed.

Context

I'm a Digital Curator at the British Library, mostly working with pre-1900 collections of manuscripts, printed material, maps, etc. Part of my job is to help people get access to our digital collections. Visualisations are a great way to firstly help people get a sense of what's available, and then to understand the collections in more depth.

I've been teaching versions of an 'information visualisation 101' course at the BL and digital humanities workshops since 2013. Much of what I'm saying now is based on comments and feedback I get when presenting network visualisations to academics, cultural heritage staff (who should be a key audience for social network analyses).

Provocation: digital humanists love network visualisations, but ordinary people say, 'so what'?

Fig1And this is a problem. We're not conveying what we're hoping to convey.

Network visualisation, via Table of data, via http://fredbenenson.com/
Network visualisation http://fredbenenson.com

When teaching datavis, I give people time to explore examples like this, then ask questions like 'Can you tell what is being measured or described? What do the relationships mean?'. After talking about the pros and cons of network visualisations, discussion often reaches a 'yes, but so what?' moment.

Here are some examples of problems ordinary people have with network visualisations…

Location matters

Spatial layout based on the pragmatic aspects of fitting something on the screen using physics, rules of attraction and repulsion doesn't match what people expect to see. It's really hard for some to let go of the idea that spatial layout has meaning. The idea that location on a page has meaning of some kind is very deeply linked to their sense of what a visualisation is.

Animated physics is … pointless?

People sometimes like the sproinginess when a network visualisation resettles after a node has been dragged, but waiting for the animation to finish can also be slow and irritating. Does it convey meaning? If not, why is it there?

Size, weight, colour = meaning?

The relationship between size, colour, weight isn't always intuitive – people assume meaning where there might be none.

In general, network visualisations are more abstract than people expect a visualisation to be.

'What does this tell me that I couldn't learn as quickly from a sentence, list or table?'

Table of data, via http://fredbenenson.com/
Table of data, via http://fredbenenson.com/

Scroll down the page that contains the network graph above and you get other visualisations. Sometimes they're much more positively received, particularly people feel they learn more from them than from the network visualisation.

Onto other issues with 'network visualisations as communication'…

Which algorithmic choices are significant?

screenshot of network graphs
Mike Bostock's force-directed and curved line versions of character co-occurrence in Les Misérables

It's hard for novices to know which algorithmic and data-cleaning choices are significant, and which have a more superficial impact.

Untethered images

Images travel extremely well on social media. When they do so, they often leave information behind and end up floating in space. Who created this, and why? What world view does it represent? What source material underlies it, how was it manipulated to produce the image? Can I trust it?

'Can't see the wood for the trees'

viral texts

When I showed this to a class recently, one participant was frustrated that they couldn't 'see the wood for the trees'. The visualisations gives a general impression of density, but it's not easy to dive deeper into detail.

Stories vs hairballs

But when I started to explain what was being represented – the ways in which stories were copied from one newspaper to another – they were fascinated. They might have found their way there if they'd read the text but again, the visualisation is so abstract that it didn't hint at what lay underneath. (Also I have only very, very rarely seen someone stop to read the text before playing with a visualisation.)

No sense of change over time

This flattening of time into one simultaneous moment is more vital for historical networks than for literary ones, but even so, you might want to compare relationships between sections of a literary work.

No sense of texture, detail of sources

All network visualisations look similar, whether they're about historical texts or cans of baked beans. Dots and lines mask texture, and don't always hint at the depth of information they represent.

Jargon

Node. Edge. Graph. Directed, undirected. Betweenness. Closeness. Eccentricity.

There's a lot to take on to really understand what's being expressed in a network graph.

There is some hope…

Onto the positive bit!

Interactivity is engaging

People find the interactive movement, the ability to zoom and highlight links engaging, even if they have no idea what's being expressed. In class, people started to come up with questions about the data as I told them more about what was represented. That moment of curiosity is an opportunity if they can dive in and start to explore what's going on, what do the relationships mean?

…but different users have different interaction needs

For some, there's that frustration expressed earlier they 'can't get to see a particular tree' in the dense woods of a network visualisation. People often want to get to the detail of an instance of a relationship – the lines of text, images of the original document – from a graph.

This mightn't be how network visualisations are used in research, but it's something to consider for public-facing visualisations. How can we connect abstract lines or dots to detail, or provide more information about what the relationship means, show the quantification expressed as people highlight or filter parts of a graph? A  harder, but more interesting task is hinting at the texture or detail of those relationships.

Proceed, with caution

One of the workshop questions was 'Are social network analysis theories and methods relevant for public-facing applications?' – and maybe the answer is a qualified yes. As a working tool, they're great for generating hypotheses, but they need a lot more care before exposing them to the public.

[As an aside, I’d always taken the difference between visualisations as working tools for exploring data – part of the process of investigating a research question – and visualisation as an output – a product of the process, designed for explanation rather than exploration – as fundamental, but maybe we need to make that distinction more explicit.]

But first – who are your 'users'?

During this workshop, at different points we may be talking about different 'users' – it's useful to scope who we mean at any given point. In this presentation, I was talking about end users who encounter visualisations, not scholars who may be organising and visualising networks for analysis.

Sometimes a network visualisation isn't the answer … even if it was part of the question.

As an outcome of an exploratory process, network visualisations are not necessarily the best way to present the final product. Be disciplined – make yourself justify the choice to use network visualisations.

No more untethered images

Include an extended caption – data source, tools and algorithms used. Provide a link to find out more – why this data, this form? What was interesting but not easily visualised? Let people download the dataset to explore themselves?

Present visualisations as the tip of the data iceberg

Visualisations are the tip of the iceberg
Visualisations are the tip of the iceberg

Lots of interesting data doesn't make it into a visualisation. Talking about what isn't included and why it was left out is important context.

Talk about data that couldn't exist

Beyond the (fuzzy, incomplete, messy) data that's left out because it's hard to visualise, data that never existed in the first place is also important:

'because we're only looking on one axis (letters), we get an inflated sense of the importance of spatial distance in early modern intellectual networks. Best friends never wrote to each other; they lived in the same city and drank in the same pubs; they could just meet on a sunny afternoon if they had anything important to say. Distant letters were important, but our networks obscure the equally important local scholarly communities.'
Scott Weingart, 'Networks Demystified 8: When Networks are Inappropriate'

Help users learn the skills and knowledge they need to interpret network visualisations in context.

How? Good question! This is the point at which I hand over to you…

April news in crowdsourcing, citizen science, citizen history

Another quick post with news on crowdsourcing in cultural heritage, citizen science and citizen history in April(ish) 2016…

Acceptances for our DH2016 Expert Workshop: Beyond The Basics: What Next For Crowdsourcing? have been sent out. If you missed the boat, don't panic! We're taking a few more applications on a rolling basis to allow for people with late travel approval for the DH2016 conference in July.

Probably the biggest news is the launch of citizenscience.gov, as it signals the importance of citizen science and crowdsourcing to the US government.

From the press release: 'the White House announced that the U.S. General Services Administration (GSA) has partnered with the Woodrow Wilson International Center for Scholars (WWICS), a Trust instrumentality of the U.S. Government, to launch CitizenScience.gov as the new hub for citizen science and crowdsourcing initiatives in the public sector.

CitizenScience.gov provides information, resources, and tools for government personnel and citizens actively engaged in or looking to participate in citizen science and crowdsourcing projects. … Citizen science and crowdsourcing are powerful approaches that engage the public and provide multiple benefits to the Federal government, volunteer participants, and society as a whole.'

There's also work to 'standardize data and metadata related to citizen science, allowing for greater information exchange and collaboration both within individual projects and across different projects'.

Other news:

Responses to questions about if the volunteers agreed that the Zooniverse… From Science Learning via Participation in Online Citizen Science

Have I missed something important? Let me know in the comments or @mia_out.

From grey dots to trenches to field books – news in heritage crowdsourcing

Apparently you can finish a thesis but you can't stop scanning for articles and blog posts on your topic. Sharing them here is a good way to shake the 'I should be doing something with this' feeling.* This is a fairly random sample of recent material, but if people find it useful I can go back and pull out other things I've collected.

Victoria Van Hyning, ‘What’s up with those grey dots?’ you ask – brief blog post on using software rather than manual processes to review multiple text transcriptions, and on the interface challenges that brings.

Melissa Terras, 'Crowdsourcing in the Digital Humanities' – pre-print PDF for a chapter in A New Companion to Digital Humanities.

Richard Grayson, 'A Life in the Trenches? The Use of Operation War Diary and Crowdsourcing Methods to Provide an Understanding of the British Army’s Day-to-Day Life on the Western Front' – a peer-reviewed article based on data created through Operation War Diary.

The Impact of Coordinated Social Media Campaigns on Online Citizen Science Engagement – a poster by Lesley Parilla and Meghan Ferriter reported on the Biodiversity Heritage Library blog.

The Impact of Coordinated Social Media Campaigns on Online Citizen Science Engagement

Ben Brumfield, Crowdsourcing Transcription Failures – a response to a mailing list post asking 'where are the failures?'

And finally, something related to my interest in participatory history commonsMartin Luther King Jr. Memorial Library – Central Library launches Memory Lab, a 'DIY space where you can digitize your home movies, scan photographs and slides, and learn how to care for your physical and digital family heirlooms'. I was so excited when I about this project – it's addressing such important issues. Jaime Mears is blogging about the project.

 

* How long after a PhD does it take for that feeling to go? Asking for a friend.

The good, the bad, and the unstructured… Open data in cultural heritage

I was in London this week for the Linked Pasts event, where I presented on trends and practices for open data in cultural heritage. Linked Pasts was a colloquium on linked open data in cultural heritage organised by the Pelagios project (Leif Isaksen, Elton Barker and Rainer Simon with Pau de Soto). I really enjoyed the other papers, which included thoughtful, grounded approaches to structured data for historical periods, places and people, recognition of the importance of designing projects around audience needs (including user research), the relationship between digital tools and scholarly inquiry, visualisations as research tools, and the importance of good infrastructure for digital history.

My talk notes are below the embedded slides.

Warning: generalisations ahead.

My discussion points are based on years of conversations with other cultural heritage technologists in museums, libraries, and archives, but inevitably I'll have blind spots. For example, I'm focusing on the English-speaking world, which means I'm not discussing the great work that Dutch and Japanese organisations are doing. I've undoubtedly left out brilliant specific examples in the interests of focusing on broader trends. The point is to start conversations, to bring issues out into the open so we can collectively decide how to move forward.

The good

The good news is that more and more open cultural data is being published. Organisations have figured out that a) nothing bad is likely to happen and that b) they might get some kudos for releasing open data.

Generally, organisations are publishing the data that they have to hand – this means it's mostly collections data. This data is often as messy, incomplete and fuzzy as you'd expect from records created by many different people using many different systems over a hundred or more years.

…the bad…

Copyright restrictions mean that images mightn't be included. Furthermore, because it's often collections data, it's not necessarily rich in interpretative information. It's metadata rather than data. It doesn't capture the scholarly debates, the uncertain attributions, the biases in collecting… It certainly doesn't capture the experience of viewing the original object.

Licensing issues are still a concern. Until cultural organisations are rewarded by their funders for releasing open data, and funders free organisations from expectations for monetising data, there will be damaging uncertainty about the opportunity cost of open data.

Non-commercial licenses are also an issue – organisations and scholars might feel exploited if others who have not contributed to the process of creating it can commercially publish their work. Finally, attribution is an important currency for organisations and scholars but most open licences aren't designed with that in mind.

…and the unstructured

The data that's released is often pretty unstructured. CSV files are very easy to use, so they help more people get access to information (assuming they can figure out GitHub), but a giant dump like this doesn't provide stable URIs for each object. Records in data dumps rarely link to external identifiers like the Getty's Thesaurus of Geographic Names, Art & Architecture Thesaurus (AAT) or Union List of Artist Names, or vernacular sources for place and people names such as Geonames or DBPedia. And that's fair enough, because people using a CSV file probably don't want all the hassle of dereferencing each URI to grab the place name so they can visualise data on a map (or whatever they're doing with the data). But it also means that it's hard for someone to reliably look for matching artists in their database, and link these records with data from other organisations.

So it's open, but it's often not very linked. If we're after a 'digital ecosystem of online open materials', this open data is only a baby step. But it's often where cultural organisations finish their work.

Classics > Cultural Heritage?

But many others, particularly in the classical and ancient world, have managed to overcome these issues to publish and use linked open data. So why do museums, libraries and archives seem to struggle? I'll suggest some possible reasons as conversation starters…

Not enough time

Organisations are often busy enough keeping their internal systems up and running, dealing with the needs of visitors in their physical venues, working on ecommerce and picture library systems…

Not enough skills

Cultural heritage technologists are often generalists, and apart from being too time-stretched to learn new technologies for the fun of it, they might not have the computational or information science skills necessary to implement the full linked data stack.

Some cultural heritage technologists argue that they don't know of any developers who can negotiate the complexities of SPARQL endpoints, so why publish it? The complexity is multiplied when complex data models are used with complex (or at least, unfamiliar) technologies. For some, SPARQL puts the 'end' in 'endpoint', and 'RDF triples' can seem like an abstraction too far. In these circumstances, the instruction to provide linked open data as RDF is a barrier they won't cross.

But sometimes it feels as if some heritage technologists are unnecessarily allergic to complexity. Avoiding unnecessary complexity is useful, but progress can stall if they demand that everything remains simple enough for them to feel comfortable. Some technologists might benefit from working with people more used to thinking about structured data, such as cataloguers, registrars etc. Unfortunately, linked open data falls in the gap between the technical and the informatics silos that often exist in cultural organisations.

And organisations are also not yet using triples or structured data provided by other organisations [with the exception of identifiers for e.g. people, places and specific vocabularies]. They're publishing data in broadcast mode; it's not yet a dialogue with other collections.

Not enough data

In a way, this is the collections documentation version of the technical barriers. If the data doesn't already exist, it's hard to publish. If it needs work to pull it out of different departments, or different individuals, who's going to resource that work? Similarly, collections staff are unlikely to have time to map their data to CIDOC-CRM unless there's a compelling reason to do so. (And some of the examples given might use cultural heritage collections but are a better fit with the work of researchers outside the institution than the institution's own work).

It may be easier for some types of collections than others – art collections tend to be smaller and better described; natural history collections can link into international projects for structured data, and libraries can share cataloguing data. Classicists have also been able to get a critical mass of data together. Your local records office or small museum may have more heterogeneous collections, and there are fewer widely used ontologies or vocabularies for historical collections. The nature of historical collections means that 'small ontologies, loosely joined', may be more effective, but creating these, or mapping collections to them, is still a large piece of work. While there are tools for mapping to data structures like Europeana's data model, it seems the reasons for doing so haven't been convincing enough, so far. Which brings me to…

Not enough benefits

This is an important point, and an area the community hasn't paid enough attention to in the past. Too many conversations have jumped straight to discussion about the specific standards to use, and not enough have been about the benefits for heritage audiences, scholars and organisations.

Many technologists – who are the ones making decisions about digital standards, alongside the collections people working on digitisation – are too far removed from the consumers of linked open data to see the benefits of it unless we show them real world needs.

There's a cost in producing data for others, so it needs to be linked to the mission and goals of an organisation. Organisations are not generally able to prioritise the potential, future audiences who might benefit from tools someone else creates with linked open data when they have so many immediate problems to solve first.

While some cultural and historical organisations have done good work with linked open data, the purpose can sometimes seem rather academic. Linked data is not always explained so that the average, over-worked collections or digital team will that convinced by the benefits outweigh the financial and intellectual investment.

No-one's drinking their own champagne

You don't often hear of people beating on the door of a museum, library or archive asking for linked open data, and most organisations are yet to map their data to specific, widely-used vocabularies because they need to use them in their own work. If technologists in the cultural sector are isolated from people working with collections data and/or research questions, then it's hard for them to appreciate the value of linked data for research projects.

The classical world has benefited from small communities of scholar-technologists – so they're not only drinking their own champagne, they're throwing parties. Smaller, more contained collections of sources and research questions helps create stronger connections and gives people a reason to link their sources. And as we're learning throughout the day, community really helps motivate action.

(I know it's normally called 'eating your own dog food' or 'dogfooding' but I'm vegetarian, so there.)

Linked open data isn't built into collections management systems

Getting linked open data into collections management systems should mean that publishing linked data is an automatic part of sharing data online.

Chicken or the egg?

So it's all a bit 'chicken or the egg' – will it stay that way? Until there's a critical mass, probably. These conversations about linked open data in cultural heritage have been going around for years, but it also shows how far we've come.

[And if you've published open data from cultural heritage collections, linked open data on the classical or ancient world, or any other form of structured data about the past, please add it to the wiki page for museum, gallery, library and archive APIs and machine-readable data sources for open cultural data.]

Drink your own champagne! (Nasjonalbiblioteket image)
Drink your own champagne! (Nasjonalbiblioteket image)

Save

The rise of interpolated content?

One thing that might stand out when we look back at 2014 is the rise of interpolated content. We've become used to translating around auto-correct errors in texts and emails but we seem to be at a tipping point where software is going ahead and rewriting content rather than prompting you to notice and edit things yourself.

iOS doesn't just highlight or fix typos, it changes the words you've typed. To take one example, iOS users might use 'ill' more than they use 'ilk', but if I typed 'ilk' I'm not happy when it's replaced by an algorithmically-determined 'ill'. As a side note, understanding the effect of auto-correct on written messages will be a challenge for future historians (much as it is for us sometimes now).

And it's not only text. In 2014, Adobe previewed GapStop, 'a new video technology that eases transitions and removes pauses from video automatically'. It's not just editing out pauses, it's creating filler images from existing images to bridge the gaps so the image doesn't jump between cuts. It makes it a lot harder to tell when someone's words have been edited to say something different to what they actually said – again, editing audio and video isn't new, but making it so easy to remove the artefacts that previously provided clues to the edits is.

Photoshop has long let you edit the contrast and tone in images, but now their Content-Aware Move, Fill and Patch tools can seamlessly add, move or remove content from images, making it easy to create 'new' historical moments. The images on extrapolated-art.com, which uses '[n]ew techniques in machine learning and image processing […] to extrapolate the scene of a painting to see what the full scenery might have looked like' show the same techniques applied to classic paintings.

But photos have been manipulated since they were first used, so what's new? As one Google user reported in It’s Official: AIs are now re-writing history, 'Google’s algorithms took the two similar photos and created a moment in history that never existed, one where my wife and I smiled our best (or what the algorithm determined was our best) at the exact same microsecond, in a restaurant in Normandy.' The important difference here is that he did not create this new image himself: Google's scripts did, without asking or specifically notifying him. In twenty years time, this fake image may become part of his 'memory' of the day. Automatically generated content like this also takes the question of intent entirely out of the process of determining 'real' from interpolated content. And if software starts retrospectively 'correcting' images, what does that mean for our personal digital archives, for collecting institutions and for future historians?

Interventions between the act of taking a photo and posting it on social media might be one of the trends of 2015. Facebook are about to start 'auto-enhancing' your photos, and apparently, Facebook Wants To Stop You From Uploading Drunk Pictures Of Yourself. Apparently this is to save your mum and boss seeing them; the alternative path of building a social network that don't show everything you do to your mum and boss was lost long ago. Would the world be a better place if Facebook or Twitter had a 'this looks like an ill-formed rant, are you sure you want to post it?' function?

So 2014 seems to have brought the removal of human agency from the process of enhancing, and even creating, text and images. Algorithms writing history? Where do we go from here? How will we deal with the increase of interpolated content when looking back at this time? I'd love to hear your thoughts.