From piles of material to patchwork: How do we embed the production of usable collections data into library work?

How do we embed the production of usable collections data into library work?These notes were prepared for a panel discussion at the ‘Always Already Computational: Collections as Data‘ (#AACdata) workshop, held in Santa Barbara in March 2017. While my latest thinking on the gap between the scale of collections and the quality of data about them is informed by my role in the Digital Scholarship team at the British Library, I’ve also drawn on work with catalogues and open cultural data at Melbourne Museum, the Museum of London, the Science Museum and various fellowships. My thanks to the organisers and the Institute of Museum and Library Services for the opportunity to attend. My position paper was called ‘From libraries as patchwork to datasets as assemblages?‘ but in hindsight, piles and patchwork of material seemed a better analogy.

The invitation to this panel asked us to share our experience and perspective on various themes. I’m focusing on the challenges in making collections available as data, based on years of working towards open cultural data from within various museums and libraries. I’ve condensed my thoughts about the challenges down into the question on the slide: How do we embed the production of usable collections data into library work?

It has to be usable, because if it’s not then why are we doing it? It has to be embedded because data in one-off projects gets isolated and stale. ‘Production’ is there because infrastructure and workflow is unsexy but necessary for access to the material that makes digital scholarship possible.

One of the biggest issues the British Library (BL) faces is scale. The BL’s collections are vast – maybe 200 million items – and extremely varied. My experience shows that publishing datasets (or sharing them with aggregators) exposes the shortcomings of past cataloguing practices, making the size of the backlog all too apparent.

Good collections data (or metadata, depending on how you look at it) is necessary to avoid the overwhelmed, jumble sale feeling of using a huge aggregator like Europeana, Trove, or the DPLA, where you feel there’s treasure within reach, if only you could find it. Publishing collections online often increases the number of enquiries about them – how can institution deal with enquiries at scale when they already have a cataloguing backlog? Computational methods like entity identification and extraction could complement the ‘gold standard’ cataloguing already in progress. If they’re made widely available, these other methods might help bridge the resourcing gaps that mean it’s easier to find items from richer institutions and countries than from poorer ones.

Photo of piles of materialYou probably already all know this, but it’s worth remembering: our collections aren’t even (yet) a patchwork of materials. The collections we hold, and the subset we can digitise and make available for re-use are only a tiny proportion of what once existed. Each piece was once part of something bigger, and what we have now has been shaped by cumulative practical and intellectual decisions made over decades or centuries. Digitisation projects range from tiny specialist databases to huge commercial genealogy deals, while some areas of the collections don’t yet have digital catalogue records. Some items can’t be digitised because they’re too big, small or fragile for scanning or photography; others can’t be shared because of copyright, data protection or cultural sensitivities. We need to be careful in how we label datasets so that the absences are evident.

(Here, ‘data’ may include various types of metadata, automatically generated OCR or handwritten text recognition transcripts, digital images, audio or video files, crowdsourced enhancements or any combination or these and more)

Image credit: https://www.flickr.com/photos/teen_s/6251107713/

In addition to the incompleteness or fuzziness of catalogue data, when collections appear as data, it’s often as great big lumps of things. It’s hard for normal scholars to process (or just unzip) 4gb of data.

Currently, datasets are often created outside normal processes, and over time they become ‘stale’ as they’re not updated when source collections records change. And when they manage to unzip them, the records rely on internal references – name authorities for people, places, etc – that can only be seen as strings rather than things until extra work is undertaken.

The BL’s metadata team have experimented with ‘researcher format’ CSV exports around specific themes (eg an exhibition), and CSV is undoubtedly the most accessible format – but what we really need is the ability for people to create their own queries across catalogues, and create their own datasets from the results. (And by queries I don’t mean SPARQL but rather faceted browsing or structured search forms).

Image credit: screenshot from http://data.bl.uk/

Collections are huge (and resources relatively small) so we need to supplement manual cataloguing with other methods. Sometimes the work of crafting links from catalogues to external authorities and identifiers will be a machine job, with pieces sewn together at industrial speed via entity recognition tools that can pull categories out or text and images. Sometimes it’s operated by a technologist who runs records through OpenRefine to find links to name authorities or Wikidata records. Sometimes it’s a labour of scholarly love, with links painstakingly researched, hand-tacked together to make sure they fit before they’re finally recorded in a bespoke database.

This linking work often happens outside the institution, so how can we ingest and re-use it appropriately? And if we’re to take advantage of computational methods and external enhancements, then we need ways to signal which categories were applied by catalogues, which by software, by external groups, etc.

The workflow and interface adjustments required would be significant, but even more challenging would be the internal conversations and changes required before a consensus on the best way to combine the work of cataloguers and computers could emerge.

The trick is to move from a collection of pieces to pieces of a collection. Every collection item was created in and about places, and produced by and about people. They have creative, cultural, scientific and intellectual properties. There’s a web of connections from each item that should be represented when they appear in datasets. These connections help make datasets more usable, turning strings of text into references to things and concepts to aid discoverability and the application of computational methods by scholars. This enables structured search across datasets – potentially linking an oral history interview with a scientist in the BL sound archive, their scientific publications in journals, annotated transcriptions of their field notebooks from a crowdsourcing project, and published biography in the legal deposit library.

A lot of this work has been done as authority files like AAT, ULAN etc are applied in cataloguing, so our attention should turn to turning local references into URIs and making the most of that investment.

Applying identifiers is hard – it takes expert care to disambiguate personal names, places, concepts, even with all the hinting that context-aware systems might be able to provide as machine learning etc techniques get better. Catalogues can’t easily record possible attributions, and there’s understandable reluctance to publish an imperfect record, so progress on the backlog is slow. If we’re not to be held back by the need for records to be perfectly complete before they’re published, then we need to design systems capable of capturing the ambiguity, fuzziness and inherent messiness of historical collections and allowing qualified descriptors for possible links to people, places etc. Then we need to explain the difference to users, so that they don’t overly rely on our descriptions, making assumptions about the presence or absence of information when it’s not appropriate.

Image credit: http://europeana.eu/portal/record/2021648/0180_N_31601.html

Photo of pipes over a buildingA lot of what we need relies on more responsive infrastructure for workflows and cataloguing systems. For example, the BL’s systems are designed around the ‘deliverable unit’ – the printed or bound volume, the archive box – because for centuries the reading room was where you accessed items. We now need infrastructure that makes items addressable at the manuscript, page and image level in order to make the most of the annotations and links created to shared identifiers.

(I’d love to see absorbent workflows, soaking up any related data or digital surrogates that pass through an organisation, no matter which system they reside in or originate from. We aren’t yet making the most of OCRd text, let alone enhanced data from other processes, to aid discoverability or produce datasets from collections.)

Image credit: https://www.flickr.com/photos/snorski/34543357
My final thought – we can start small and iterate, which is just as well, because we need to work on understanding what users of collections data need and how they want to use them. We’re making a start and there’s a lot of thoughtful work behind the scenes, but maybe a bit more investment is needed from research libraries to become as comfortable with data users as they are with the readers who pass through their physical doors.

Trying computational data generation and entity extraction

I’ve developed this exercise on computational data generation and entity extraction for various information/data visualisation workshops I’ve been teaching lately. As these methods have become more accessible, my dataviz workshops have included more discussion of computational methods for generating data to be visualised. I used to do a text-based version of this, but found that using services that describe images was more accessible and generated richer discussion in class. If you try something like this in your classes I’d love to hear from you.

It’s also a chance to talk about the uses of these technologies in categorising and labelling our posts on social media. We can tell people that their social media posts are analysed for personality traits and mentions of brands, but seeing it in action is much more powerful.

Exercise: trying computational data generation and entity extraction

Time: c. 10 minutes with discussion.

Goal: explore methods for extracting information from text or an image and reflect on what the results tell you about the algorithms

1 Find a sample image

Find an image (e.g. from a news site or digitised text) you can download and drag into the window. It may be most convenient to save a copy to your desktop. Many sites let you load images from a URL, so right- or control-clicking to copy an image location for pasting into the site can be useful.

2 Work in your browser

It’s probably easiest to open each of these links in a new browser window. It’s best to use Firefox or Chrome, if you can. Safari and Internet Explorer may behave slightly differently on some sites. You should not need to register to use these sites – please read the tips below or ask for help if you get stuck.

3 Review the outputs

Make notes, or discuss with your neighbour. Be prepared to report back to the group.

  • What attributes does each tool report on?
  • Which attributes, if any, were unique to a service?
  • Based on this, what do Clarifai, Google, IBM and Microsoft seem to think is important to them (or to their users)?
  • How many of possible entities (concepts, people, places, events, references to time or dates, etc) did it pick up?
  • Is any of the information presented useful?
  • Did it label anything incorrectly?
  • What options for exporting or saving the results did the demo offer? What about the underlying service or software?
  • For tools with configuration options – what could you configure? What difference did changing classifiers or other parameters  make?
  • If you tried it with a few images, did it do better with some than others? Why might that be?

This exercise focuses on images, but you can try a similar exercise with text-based tools like Stanford’s Natural Language Processing (NLP) demo http://corenlp.run/, DBPedia https://dbpedia-spotlight.github.io/demo/ and Ontotext http://tag.ontotext.com/

Spoiler alert!

screenshot
Clarifai’s image recognition tool with a historical image

Crowdsourcing workshop at DH2016 – session overview

A quick signal boost for the collaborative notes taken at the DH2016 Expert Workshop: Beyond The Basics: What Next For Crowdsourcing? (held in Kraków, Poland, on 12 July as part of the Digital Humanities 2016 conference, abstract below). We’d emphasised the need to document the unconference-style sessions (see FAQ) so that future projects could benefit from the collective experiences of participants. Since it can be impossible to find Google Docs or past tweets, I’ve copied the session overview below. The text is a summary of key takeaways or topics discussed in each session, created in a plenary session at the end of the workshop.

Participant introductions and interests – live notes
Ethics, Labour, sensitive material

Key takeaway – questions for projects to ask at the start; don’t impose your own ethics on a project, discussing them is start of designing the project.

Where to start
Engaging volunteers, tips including online communities, being open to levels of contribution, being flexible, setting up standards, quality
Workflow, lifecycle, platforms
What people were up to, the problems with hacking systems together, iiif.io, flexibility and workflows
Public expertise, education, what’s unique to humanities crowdsourcing
The humanities are contestable! Responsibility to give the public back the results of the process in re-usable
Options, schemas and goals for text encoding
Encoding systems will depend on your goals; full-text transcription always has some form of encoding, data models – who decides what it is, and when? Then how are people guided to use it?Trying to avoid short-term solutions
UX, flow, motivation
Making tasks as small as possible; creating a sense of contribution; creating a space for volunteers to communicate; potential rewards, issues like badgefication and individual preferences. Supporting unexpected contributions; larger-scale tasks
Project scale – thinking ahead to ending projects technically, and in terms of community – where can life continue after your project ends
Finding and engaging volunteers
Using social media, reliance on personal networks, super-transcribers, problematic individuals who took more time than they gave to the project. Successful strategies are very-project dependent. Something about beer (production of Itinera Nova beer with label containing info on the project and link to website).
Ecosystems and automatic transcription
Makes sense for some projects, but not all – value in having people engage with the text. Ecosystem – depending on goals, which parts work better? Also as publication – editions, corpora – credit, copyright, intellectual property
Plenary session, possible next steps – put information into a wiki. Based around project lifecycle, critical points? Publication in an online journal? Updateable, short-ish case studies. Could be categorised by different attributes. Flexible, allows for pace of change. Illustrate principles, various challenges.

Short-term action: post introductions, project updates and new blog posts, research, etc to https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=CROWDSOURCING – a central place to send new conference papers, project blog posts, questions, meet-ups.

The workshop abstract:

Crowdsourcing – asking the public to help with inherently rewarding tasks that contribute to a shared, significant goal or research interest related to cultural heritage collections or knowledge – is reasonably well established in the humanities and cultural heritage sector. The success of projects such as Transcribe Bentham, Old Weather and the Smithsonian Transcription Center in processing content and engaging participants, and the subsequent development of crowdsourcing platforms that make launching a project easier, have increased interest in this area. While emerging best practices have been documented in a growing body of scholarship, including a recent report from the Crowd Consortium for Libraries and Archives symposium, this workshop looks to the next 5 – 10 years of crowdsourcing in the humanities, the sciences and in cultural heritage. The workshop will gather international experts and senior project staff to document the lessons to be learnt from projects to date and to discuss issues we expect to be important in the future.

Photo by Digital Humanities ‏@DH_Western
Photo by Digital Humanities ‏@DH_Western

The workshop is organised by Mia Ridge (British Library), Meghan Ferriter (Smithsonian Transcription Centre), Christy Henshaw (Wellcome Library) and Ben Brumfield (FromThePage).

If you’re new to crowdsourcing, here’s a reading list created for another event.

 

Network visualisations and the ‘so what?’ problem

This week I was in Luxembourg for a workshop on Network Visualisation in the Cultural Heritage Sector, organised by Marten Düring and held on the Belval campus of the University of Luxembourg.

In my presentation, I responded to some of the questions posed in the workshop outline:

In this workshop we want to explore how network visualisations and infrastructures will change the research and outreach activities of cultural heritage professionals and historians. Among the questions we seek to discuss during the workshop are for example: How do users benefit from graphs and their visualisation? Which skills do we expect from our users? What can we teach them? Are SNA [social network analysis] theories and methods relevant for public-facing applications? How do graph-based applications shape a user’s perception of the documents/objects which constitute the data? How can applications benefit from user engagement? How can applications expand and tap into other resources?

A rough version of my talk notes is below. The original slides are also online.

Network visualisations and the ‘so what?’ problem

Caveat

While I may show examples of individual network visualisations, this talk isn’t a critique of them in particular. There’s lots of good practice around, and these lessons probably aren’t needed for people in the room.

Fundamentally, I think network visualisations can be useful for research, but to make them more effective tools for outreach, some challenges should be addressed.

Context

I’m a Digital Curator at the British Library, mostly working with pre-1900 collections of manuscripts, printed material, maps, etc. Part of my job is to help people get access to our digital collections. Visualisations are a great way to firstly help people get a sense of what’s available, and then to understand the collections in more depth.

I’ve been teaching versions of an ‘information visualisation 101’ course at the BL and digital humanities workshops since 2013. Much of what I’m saying now is based on comments and feedback I get when presenting network visualisations to academics, cultural heritage staff (who should be a key audience for social network analyses).

Provocation: digital humanists love network visualisations, but ordinary people say, ‘so what’?

Fig1And this is a problem. We’re not conveying what we’re hoping to convey.

Network visualisation, via Table of data, via http://fredbenenson.com/
Network visualisation http://fredbenenson.com

When teaching datavis, I give people time to explore examples like this, then ask questions like ‘Can you tell what is being measured or described? What do the relationships mean?’. After talking about the pros and cons of network visualisations, discussion often reaches a ‘yes, but so what?’ moment.

Here are some examples of problems ordinary people have with network visualisations…

Location matters

Spatial layout based on the pragmatic aspects of fitting something on the screen using physics, rules of attraction and repulsion doesn’t match what people expect to see. It’s really hard for some to let go of the idea that spatial layout has meaning. The idea that location on a page has meaning of some kind is very deeply linked to their sense of what a visualisation is.

Animated physics is … pointless?

People sometimes like the sproinginess when a network visualisation resettles after a node has been dragged, but waiting for the animation to finish can also be slow and irritating. Does it convey meaning? If not, why is it there?

Size, weight, colour = meaning?

The relationship between size, colour, weight isn’t always intuitive – people assume meaning where there might be none.

In general, network visualisations are more abstract than people expect a visualisation to be.

‘What does this tell me that I couldn’t learn as quickly from a sentence, list or table?’

Table of data, via http://fredbenenson.com/
Table of data, via http://fredbenenson.com/

Scroll down the page that contains the network graph above and you get other visualisations. Sometimes they’re much more positively received, particularly people feel they learn more from them than from the network visualisation.

Onto other issues with ‘network visualisations as communication’…

Which algorithmic choices are significant?

screenshot of network graphs
Mike Bostock’s force-directed and curved line versions of character co-occurrence in Les Misérables

It’s hard for novices to know which algorithmic and data-cleaning choices are significant, and which have a more superficial impact.

Untethered images

Images travel extremely well on social media. When they do so, they often leave information behind and end up floating in space. Who created this, and why? What world view does it represent? What source material underlies it, how was it manipulated to produce the image? Can I trust it?

‘Can’t see the wood for the trees’

viral texts

When I showed this to a class recently, one participant was frustrated that they couldn’t ‘see the wood for the trees’. The visualisations gives a general impression of density, but it’s not easy to dive deeper into detail.

Stories vs hairballs

But when I started to explain what was being represented – the ways in which stories were copied from one newspaper to another – they were fascinated. They might have found their way there if they’d read the text but again, the visualisation is so abstract that it didn’t hint at what lay underneath. (Also I have only very, very rarely seen someone stop to read the text before playing with a visualisation.)

No sense of change over time

This flattening of time into one simultaneous moment is more vital for historical networks than for literary ones, but even so, you might want to compare relationships between sections of a literary work.

No sense of texture, detail of sources

All network visualisations look similar, whether they’re about historical texts or cans of baked beans. Dots and lines mask texture, and don’t always hint at the depth of information they represent.

Jargon

Node. Edge. Graph. Directed, undirected. Betweenness. Closeness. Eccentricity.

There’s a lot to take on to really understand what’s being expressed in a network graph.

There is some hope…

Onto the positive bit!

Interactivity is engaging

People find the interactive movement, the ability to zoom and highlight links engaging, even if they have no idea what’s being expressed. In class, people started to come up with questions about the data as I told them more about what was represented. That moment of curiosity is an opportunity if they can dive in and start to explore what’s going on, what do the relationships mean?

…but different users have different interaction needs

For some, there’s that frustration expressed earlier they ‘can’t get to see a particular tree’ in the dense woods of a network visualisation. People often want to get to the detail of an instance of a relationship – the lines of text, images of the original document – from a graph.

This mightn’t be how network visualisations are used in research, but it’s something to consider for public-facing visualisations. How can we connect abstract lines or dots to detail, or provide more information about what the relationship means, show the quantification expressed as people highlight or filter parts of a graph? A  harder, but more interesting task is hinting at the texture or detail of those relationships.

Proceed, with caution

One of the workshop questions was ‘Are social network analysis theories and methods relevant for public-facing applications?’ – and maybe the answer is a qualified yes. As a working tool, they’re great for generating hypotheses, but they need a lot more care before exposing them to the public.

[As an aside, I’d always taken the difference between visualisations as working tools for exploring data – part of the process of investigating a research question – and visualisation as an output – a product of the process, designed for explanation rather than exploration – as fundamental, but maybe we need to make that distinction more explicit.]

But first – who are your ‘users’?

During this workshop, at different points we may be talking about different ‘users’ – it’s useful to scope who we mean at any given point. In this presentation, I was talking about end users who encounter visualisations, not scholars who may be organising and visualising networks for analysis.

Sometimes a network visualisation isn’t the answer … even if it was part of the question.

As an outcome of an exploratory process, network visualisations are not necessarily the best way to present the final product. Be disciplined – make yourself justify the choice to use network visualisations.

No more untethered images

Include an extended caption – data source, tools and algorithms used. Provide a link to find out more – why this data, this form? What was interesting but not easily visualised? Let people download the dataset to explore themselves?

Present visualisations as the tip of the data iceberg

Visualisations are the tip of the iceberg
Visualisations are the tip of the iceberg

Lots of interesting data doesn’t make it into a visualisation. Talking about what isn’t included and why it was left out is important context.

Talk about data that couldn’t exist

Beyond the (fuzzy, incomplete, messy) data that’s left out because it’s hard to visualise, data that never existed in the first place is also important:

‘because we’re only looking on one axis (letters), we get an inflated sense of the importance of spatial distance in early modern intellectual networks. Best friends never wrote to each other; they lived in the same city and drank in the same pubs; they could just meet on a sunny afternoon if they had anything important to say. Distant letters were important, but our networks obscure the equally important local scholarly communities.’
Scott Weingart, ‘Networks Demystified 8: When Networks are Inappropriate’

Help users learn the skills and knowledge they need to interpret network visualisations in context.

How? Good question! This is the point at which I hand over to you…

Two questions for the digital humanities from Laura Mandell

I came across Joshua Sternfeld’s definition of ‘digital historiography’ while I was writing my thesis, and two parts of it very neatly described what I was up to – firstly, the ‘interdisciplinary study of the interaction of digital technology with historical practice’ – and secondly, seeking to understand the ‘construction, use, and evaluation of digital historical representations’.[1] However, the size and shape of the gap between digital historiography and ‘digital history’ is where I tend to get stuck. I’ve got a draft post on the various types of ‘digital history’ that’s never quite ready to go live.* Is digital history like art history – a field with its own theoretical concerns and objects of study – or will it eventually merge into ‘history’ as everyone starts integrating digital methods/tools and digitised sources into their work, in the same way that social or economic history have influenced other fields?

Anyway. While I may one day write that post, Melissa Dinsman’s interview with Laura Mandell, The Digital in the Humanities: An Interview with Laura Mandell, puts two vital questions that help focus my enquiry:

The real reason for me for talking about the digital humanities is that we need to realize the humanities never were the humanities. They are the print humanities and they are conditioned by print. So the question the term “digital humanities” poses is: How must humanities disciplines change if we are no longer working in a print world? This question, to me, is crucial. It is an intellectual question. And the question being proposed is: What happens to the humanities when digital methodologies are applied to them or when they start to interrogate digital methodologies? Both of these questions are crucial and that is what this term — “digital humanities” — keeps front and center.

The whole series of ‘The Digital in the Humanities’ interviews Dinsman has conducted provide a thoughtful insight into the state of the field.

* Partly because ‘digital history’ changes at a fairly constant rate and my thoughts shift correspondingly.

[1] Joshua Sternfeld, ‘Archival Theory and Digital Historiography: Selection, Search, and Metadata as Archival Processes for Assessing Historical Contextualization’, American Archivist 74, no. 2 (2011): 544–75, http://archivists.metapress.com/index/644851P6GMG432H0.pdf.

SXSW, project anniversaries and more – news on heritage crowdsourcing

Photo of programme
Our panel listing at SXSW

I’ve just spent two weeks in Texas, enjoying the wonderful hospitality and probing questions after giving various talks at universities in Houston and Austin before heading to SXSW. I was there for a panel on ‘Build the Crowdsourcing Community of Your Dreams’ (link to our slides and collected resources) with Ben Brumfield, Siobhan Leachman, and Meghan Ferriter. Siobhan, a ‘super-volunteer’ in more ways than one, posted her talk notes on ‘How cultural institutions encouraged me to participate in crowdsourcing & the factors I consider before donating my time‘.

In other news, we (me, Ben, Meghan and Christy Henshaw from the Wellcome Library) have had a workshop accepted for the Digital Humanities 2016 conference, to be held in Kraków in July. We’re looking for people with different kinds of expertise for our DH2016 Expert Workshop: Beyond The Basics: What Next For Crowdsourcing?.  You can apply via this form.

One of the questions at our SXSW panel was about crowdsourcing in teaching, which reminded me of this recent post on ‘The War Department in the Classroom‘ in which Zayna Bizri ‘describes her approach to using the Papers of the War Department in the classroom and offers suggestions for those who wish to do the same’. In related news, the PWD project is now five years old! There’s also this post on Primary School Zooniverse Volunteers.

The Science Gossip project is one year old, and they’re asking their contributors to decide which periodicals they’ll work on next and to start new discussions about the documents and images they find interesting.

The History Harvest project have released their Handbook (PDF).

The Danish Nationalmuseet is having a ‘Crowdsource4dk‘ crowdsourcing event on April 9. You can also transcribe Churchill’s WWII daily appointments, 1939 – 1945 or take part in Old Weather: Whaling (and there’s a great Hyperallergic post with lots of images about the whaling log books).

I’ve seen a few interesting studentships and jobs posted lately, hinting at research and projects to come. There’s a funded PhD in HCI and online civic engagement and a (now closed) studentship on Co-creating Citizen Science for Innovation.

And in old news, this 1996 post on FamilySearch’s collaborative indexing is a good reminder that very little is entirely new in crowdsourcing.

From grey dots to trenches to field books – news in heritage crowdsourcing

Apparently you can finish a thesis but you can’t stop scanning for articles and blog posts on your topic. Sharing them here is a good way to shake the ‘I should be doing something with this’ feeling.* This is a fairly random sample of recent material, but if people find it useful I can go back and pull out other things I’ve collected.

Victoria Van Hyning, ‘What’s up with those grey dots?’ you ask – brief blog post on using software rather than manual processes to review multiple text transcriptions, and on the interface challenges that brings.

Melissa Terras, ‘Crowdsourcing in the Digital Humanities‘ – pre-print PDF for a chapter in A New Companion to Digital Humanities.

Richard Grayson, ‘A Life in the Trenches? The Use of Operation War Diary and Crowdsourcing Methods to Provide an Understanding of the British Army’s Day-to-Day Life on the Western Front‘ – a peer-reviewed article based on data created through Operation War Diary.

The Impact of Coordinated Social Media Campaigns on Online Citizen Science Engagement – a poster by Lesley Parilla and Meghan Ferriter reported on the Biodiversity Heritage Library blog.

The Impact of Coordinated Social Media Campaigns on Online Citizen Science Engagement

Ben Brumfield, Crowdsourcing Transcription Failures – a response to a mailing list post asking ‘where are the failures?’

And finally, something related to my interest in participatory history commonsMartin Luther King Jr. Memorial Library – Central Library launches Memory Lab, a ‘DIY space where you can digitize your home movies, scan photographs and slides, and learn how to care for your physical and digital family heirlooms’. I was so excited when I about this project – it’s addressing such important issues. Jaime Mears is blogging about the project.

 

* How long after a PhD does it take for that feeling to go? Asking for a friend.

The good, the bad, and the unstructured… Open data in cultural heritage

I was in London this week for the Linked Pasts event, where I presented on trends and practices for open data in cultural heritage. Linked Pasts was a colloquium put on by the Pelagios project (Leif Isaksen, Elton Barker and Rainer Simon with Pau de Soto). I really enjoyed the other papers, which included thoughtful, grounded approaches to structured data for historical periods, places and people,  recognition of the importance of designing projects around audience needs (including user research), the relationship between digital tools and scholarly inquiry, visualisations as research tools, and the importance of good infrastructure for digital history.

My talk notes are below the embedded slides.

Warning: generalisations ahead.

My discussion points are based on years of conversations with other cultural heritage technologists in museums, libraries, and archives, but inevitably I’ll have blind spots. For example, I’m focusing on the English-speaking world, which means I’m not discussing the great work that Dutch and Japanese organisations are doing. I’ve undoubtedly left out brilliant specific examples in the interests of focusing on broader trends.  The point is to start conversations, to bring issues out into the open so we can collectively decide how to move forward.

The good

The good news is that more and more open cultural data is being published. Organisations have figured out that a) nothing bad is likely to happen and that b) they might get some kudos for releasing open data.

Generally, organisations are publishing the data that they have to hand – this means it’s mostly collections data. This data is often as messy, incomplete and fuzzy as you’d expect from records created by many different people using many different systems over a hundred or more years.

…the bad…

Copyright restrictions mean that images mightn’t be included. Furthermore, because it’s often collections data, it’s not necessarily rich in interpretative information. It’s metadata rather than data. It doesn’t capture the scholarly debates, the uncertain attributions, the biases in collecting… It certainly doesn’t capture the experience of viewing the original object.

Licensing issues are still a concern. Until cultural organisations are rewarded by their funders for releasing open data, and funders free organisations from expectations for monetising data, there will be damaging uncertainty about the opportunity cost of open data.

Non-commercial licenses are also an issue – organisations and scholars might feel exploited if others who have not contributed to the process of creating it can commercially publish their work. Finally, attribution is an important currency for organisations and scholars but most open licences aren’t designed with that in mind.

…and the unstructured

The data that’s released is often pretty unstructured. CSV files are very easy to use, so they help more people get access to information (assuming they can figure out GitHub), but a giant dump like this doesn’t provide stable URIs for each object. Records in data dumps rarely link to external identifiers like the Getty’s Thesaurus of Geographic Names, Art & Architecture Thesaurus (AAT) or Union List of Artist Names, or vernacular sources for place and people names such as Geonames or DBPedia. And that’s fair enough, because people using a CSV file probably don’t want all the hassle of dereferencing each URI to grab the place name so they can visualise data on a map (or whatever they’re doing with the data). But it also means that it’s hard for someone to reliably look for matching artists in their database, and link these records with data from other organisations.

So it’s open, but it’s often not very linked. If we’re after a ‘digital ecosystem of online open materials’, this open data is only a baby step. But it’s often where cultural organisations finish their work.

Classics > Cultural Heritage?

But many others, particularly in the classical and ancient world, have managed to overcome these issues to publish and use linked open data. So why do museums, libraries and archives seem to struggle? I’ll suggest some possible reasons as conversation starters…

Not enough time

Organisations are often busy enough keeping their internal systems up and running, dealing with the needs of visitors in their physical venues, working on ecommerce and picture library systems…

Not enough skills

Cultural heritage technologists are often generalists, and apart from being too time-stretched to learn new technologies for the fun of it, they might not have the computational or information science skills necessary to implement the full linked data stack.

Some cultural heritage technologists argue that they don’t know of any developers who can negotiate the complexities of SPARQL endpoints, so why publish it? The complexity is multiplied when complex data models are used with complex (or at least, unfamiliar) technologies. For some, SPARQL puts the ‘end’ in ‘endpoint’, and ‘RDF triples‘ can seem like an abstraction too far. In these circumstances, the instruction to provide linked open data as RDF is a barrier they won’t cross.

But sometimes it feels as if some heritage technologists are unnecessarily allergic to complexity. Avoiding unnecessary complexity is useful, but progress can stall if they demand that everything remains simple enough for them to feel comfortable. Some technologists might benefit from working with people more used to thinking about structured data, such as cataloguers, registrars etc. Unfortunately, linked open data falls in the gap between the technical and the informatics silos that often exist in cultural organisations.

And organisations are also not yet using triples or structured data provided by other organisations. They’re publishing data in broadcast mode; it’s not yet a dialogue with other collections.

Not enough data

In a way, this is the collections documentation version of the technical barriers. If the data doesn’t already exist, it’s hard to publish. If it needs work to pull it out of different departments, or different individuals, who’s going to resource that work? Similarly, collections staff are unlikely to have time to map their data to CIDOC-CRM unless there’s a compelling reason to do so. (And some of the examples given might use cultural heritage collections but are a better fit with the work of researchers outside the institution than the institution’s own work).

It may be easier for some types of collections than others – art collections tend to be smaller and better described; natural history collections can link into international projects for structured data, and libraries can share cataloguing data. Classicists have also been able to get a critical mass of data together. Your local records office or small museum may have more heterogeneous collections, and there are fewer widely used ontologies or vocabularies for historical collections. The nature of historical collections means that ‘small ontologies, loosely joined’, may be more effective, but creating these, or mapping collections to them, is still a large piece of work. While there are tools for mapping to data structures like Europeana’s data model, it seems the reasons for doing so haven’t been convincing enough, so far. Which brings me to…

Not enough benefits

This is an important point, and an area the community hasn’t paid enough attention to in the past. Too many conversations have jumped straight to discussion about the specific standards to use, and not enough have been about the benefits for heritage audiences, scholars and organisations.

Many technologists – who are the ones making decisions about digital standards, alongside the collections people working on digitisation – are too far removed from the consumers of linked open data to see the benefits of it unless we show them real world needs.

There’s a cost in producing data for others, so it needs to be linked to the mission and goals of an organisation. Organisations are not generally able to prioritise the potential, future audiences who might benefit from tools someone else creates with linked open data when they have so many immediate problems to solve first.

While some cultural and historical organisations have done good work with linked open data, the purpose can sometimes seem rather academic. Linked data is not always explained so that the average, over-worked collections or digital team will that convinced by the benefits outweigh the financial and intellectual investment.

No-one’s drinking their own champagne

You don’t often hear of people beating on the door of a museum, library or archive asking for linked open data, and most organisations are yet to map their data to specific, widely-used vocabularies because they need to use them in their own work. If technologists in the cultural sector are isolated from people working with collections data and/or research questions, then it’s hard for them to appreciate the value of linked data for research projects.

The classical world has benefited from small communities of scholar-technologists – so they’re not only drinking their own champagne, they’re throwing parties. Smaller, more contained collections of sources and research questions helps create stronger connections and gives people a reason to link their sources. And as we’re learning throughout the day, community really helps motivate action.

(I know it’s normally called ‘eating your own dog food’ or ‘dogfooding’ but I’m vegetarian, so there.)

Linked open data isn’t built into collections management systems

Getting linked open data into collections management systems should mean that publishing linked data is an automatic part of sharing data online.

Chicken or the egg?

So it’s all a bit ‘chicken or the egg’ – will it stay that way? Until there’s a critical mass, probably. These conversations about linked open data in cultural heritage have been going around for years, but it also shows how far we’ve come.

[And if you’ve published open data from cultural heritage collections, linked open data on the classical or ancient world, or any other form of structured data about the past, please add it to the wiki page for museum, gallery, library and archive APIs and machine-readable data sources for open cultural data.]

Drink your own champagne! (Nasjonalbiblioteket image)
Drink your own champagne! (Nasjonalbiblioteket image)

Save

Sentiment analysis is opinion turned into code

Modern elections are data visualisation bonanzas, and the 2015 UK General Election is no exception.

Last night seven political leaders presented their views in a televised debate. This morning the papers are full of snap polls, focus groups, body language experts, and graphs based on public social media posts describing the results. Graphs like the one below summarise masses of text using a technique called ‘sentiment analysis’, a form of computational language processing.* After a twitter conversation with @benosteen and @MLBrook I thought it was worth posting about the inherent biases in the tools that create these visualisations. Ultimately, ‘sentiment analysis’ is someone’s opinion turned into code – so whose opinion are you seeing?

ChartThis is a great time to remember that sentiment analysis – mining text to see what people are talking about and how they feel about it – is based on algorithms and software libraries that were created and configured by people who’ve made a series of small, accumulative decisions that affect what we see. You can think of sentiment analysis as a sausage factory with the text of tweets as the mince going in one end, and pretty pictures as the product coming out the other end. A healthy democracy needs the list of secret ingredients added during processing, not least because this election prominently features spin rooms and party lines.

What are those ‘ingredients’? The software used for sentiment analysis is ‘trained’ on existing text, and the type of text used affects what the software assumes about the world. For example, software trained on business articles is great at recognising company names but does not do so well on content taken from museum catalogues (unless the inventor of an object went on to found a company and so entered the trained vocabulary). The algorithms used to process text change the output, as does the length of the phrase analysed. The results are riddled with assumptions about tone, intent, the demographics of the poster and more.

In the case of an election, we’d also want to know when the text used for training was created, whether it looks at previous posts by the same person, and how long the software was running over the given texts. Where was the baseline of sentiment on various topics set? Who defines what ‘neutral’ looks like to an algorithm?

We should ask the same questions about visualisations and computational analysis that we’d ask about any document. The algorithmic ‘black box’ is a human construction, and just like every other text, software is written by people. Who’s paying for it? What sources did they use? If it’s an agency promoting their tools, do they mention the weaknesses and probable error rates or gloss over it? If it’s a political party (or a company owned by someone associated with a party), have they been scrupulous in weeding out bots? Do retweets count? Are some posters weighted more heavily? Which visualisations were discarded and how did various news outlets choose the visualisations they featured? Which parties are left out?

It matters because, all software has biases, and, as Brandwatch say, ‘social media will have a significant role in deciding the outcome of the general election’. And finally, as always, who’s not represented in the dataset?

* If you already know this, hopefully you’ll know the rest too. This post is deliberately light on technical detail but feel free to add more detailed information in the comments.

Creating simple graphs with Excel’s Pivot Tables and Tate’s artist data

I’ve been playing with Tate’s collections data while preparing for a workshop on data visualisation. On the day I’ll probably use Google Fusion Tables as an example, but I always like to be prepared so I’ve prepared a short exercise for creating simple graphs in Excel as an alternative.

The advantage of Excel is that you don’t need to be online, your data isn’t shared, and for many people, gaining additional skills in Excel might be more useful than learning the latest shiny web tool. PivotTables are incredibly useful for summarising data, so it’s worth trying them even if you’re not interested in visualisations. Pivot tables let you run basic functions – summing, averaging, grouping, etc – on spreadsheet data. If you’ve ever wanted spreadsheets to be as powerful as databases, pivot tables can help. I could create a pivot table then create a chart from it, but Excel has an option to create a pivot chart directly that’ll also create a pivot table for you to see how it works.

For this exercise, you will need Excel and a copy of the sample data: tate_artist_data_cleaned_v1_groupedbybirthyearandgender.xlsx
(A plain text CSV version is also available for broader compatibility: tate_artist_data_cleaned_v1_groupedbybirthyearandgender.csv.)

Work out what data you’re interested in

In this example, I’m interested in when the artists in Tate’s collection were born, and the overall gender mix of the artists represented. To make it easier to see what’s going on, I’ve copied those two columns of data from the original ‘artists’ file and copied them over to a new spreadsheet. As a row by row list of births, these columns aren’t ideal for charting as they are, so I want a count of artists per year, broken down by gender.

Insert PivotChart

On the ‘Insert’ menu, click on PivotTable to open the menu and display the option for PivotCharts.
Excel pivot table Insert PivotChart detail

Excel will select our columns as being the most likely thing we want to chart. That all looks fine to me so click ‘OK’.

Excel pivot table OK detailConfigure the PivotChart

This screen asking you to ‘choose fields from the PivotTable Field List’ might look scary, but we’ve only got two columns of data so you can’t really go wrong. Excel pivot table Choose fields

The columns have already been added to the PivotTable Field List on the right, so go ahead and tick the box next to ‘gender’ and ‘yearofBirth’. Excel will probably put them straight into the ‘Axis Fields’ box.

Leave yearofBirth under Axis Fields and drag ‘gender’ over to the ‘Values’ box next to it. Excel automatically turns it into ‘count of gender’, assuming that we want to sum the number of births per year.

The final task is to drag ‘gender’ down from the PivotTable Field List to ‘Legend Fields’ to create a key for which colours represent which gender. You should now see the pivot table representing the calculated values on the left and a graph in the middle.

Close-up of the pivot fields

 

When you click off the graph, the PivotTable options disappear – just click on the graph or the data again to bring them up.

Excel pivot table Results

You’ve made your first pivot chart!

You might want to drag it out a bit so the values aren’t so squished. Tate’s data covers about 500 years so there’s a lot to fit in.

Now you’ve made a pivot chart, have a play – if you get into a mess you can always start again!

Colophon: the screenshots are from Excel 2010 for Windows because that’s what I have.

About the data: this data was originally supplied by Tate. The full version on Tate’s website includes name, date of birth, place of birth, year of death, place of death and URL on Tate’s website. The latest versions of their data can be downloaded from http://www.tate.org.uk/about/our-work/digital/collection-data The source data for this file can be downloaded from https://github.com/tategallery/collection/blob/master/artist_data.csv This version was simplified so it only contains a list of years of birth and the gender of the artist. Some blank values for gender were filled in based on the artist’s name or a quick web search; groups of artists or artists of unknown gender were removed as were rows without a birth year. This data was prepared in March 2015 for a British Library course on ‘Data Visualisation for Analysis in Scholarly Research’ by Mia Ridge.

I’d love to hear if you found this useful or have any suggestions for tweaks.