2016 was the year that deep fractures came to the surface, but they’d been building for some time. We might live in the same country as each other, but we can experience it very differently. What we know about the state of the world is affected by where we live, our education, and by how (if?) we get our news.
Life in 2017
'This is fine' (KC Green)
We can't pretend that it'll all go away and that society will heal itself. Divisions over Brexit, the role of propaganda in elections, climate change, the role of education, what we value as a society – they're all awkward to address, but if we don't it's hard to see how we can move forward. And since we're here to talk about museums – what role do museums have in divided societies? How much do they need to reflect voices they mightn't agree with? Do we need to make ourselves a bit uncomfortable in order to make spaces for sharing experiences and creating empathy? Can (digital) experiences, collections and exhibitions in cultural heritage help create a shared understanding of the world?
I've been struck lately by the observation that empathy can bridge divides, and give people the power to understand others. The arts and culture provide opportunities to 'understand and share in another person's feelings and experiences' and connect the past to the present. How can museums – in all their different forms – contribute to a more empathic (and maybe eventually less divided) society?
'The greatest benefit we owe to the artist, whether painter, poet, or novelist, is the extension of our sympathies. … Art is the nearest thing to life; it is a mode of amplifying experience and extending our contact with our fellow-men beyond the bounds of our personal lot.' George Eliot, as quoted in Peter Bazalgette's The Empathy Instinct
Digital experiences aren't shared in the same way as physical ones, and ‘social’ media isn't the same as being in the same space as someone experiencing the same thing, but they have other advantages – I hope we'll learn about some today.
We need to tell better stories about museums and computers
Shifting from the public to staff in museums… Museums have been using technology to serve audiences and manage collections for decades. But still it feels like museums are criticised for simultaneously having too much and too little technology. Shiny apps make the news, but they're built on decades of digitisation and care from heritage organisations. There's a lot museums could do better, and digital expertise is not evenly distributed or recognised, but there's a lot that's done well, too. My challenge to you is to find and share better stories about cultural heritage technologies connecting collections, people and knowledge. If we don't tell those stories, they'll be told about us. Too many articles and puff pieces ignore the thoughtful, quotidian and/or experimental work of experts across the digital cultural heritage sector.
[Later in the day I mentioned that the conference had an excellent response to the call for papers – we learnt about more interesting projects than we had room to fit in, so perhaps we should encourage more people to post case studies to the MCG's discussion list and website.]
Dealing with distance; bringing the museum to the people
How can museums use sound and chatbots?
Looking (back to look) forward
Speaking of better stories – I'm looking forward to hearing from all our speakers today – they're covering an incredible range of topics, approaches and technologies, so hopefully each of you will leave full of ideas. Join us for drinks afterwards to keep the conversation going. And to set the tone for the day, it's a great time to hear Hannah Fox on the topic of 'what makes a museum'…
Speaking of the conference – a lot of people helped out in different ways, so thanks to them all!
These notes were prepared for a panel discussion at the 'Always Already Computational: Collections as Data' (#AACdata) workshop, held in Santa Barbara in March 2017. While my latest thinking on the gap between the scale of collections and the quality of data about them is informed by my role in the Digital Scholarship team at the British Library, I've also drawn on work with catalogues and open cultural data at Melbourne Museum, the Museum of London, the Science Museum and various fellowships. My thanks to the organisers and the Institute of Museum and Library Services for the opportunity to attend. My position paper was called 'From libraries as patchwork to datasets as assemblages?' but in hindsight, piles and patchwork of material seemed a better analogy.
The invitation to this panel asked us to share our experience and perspective on various themes. I'm focusing on the challenges in making collections available as data, based on years of working towards open cultural data from within various museums and libraries. I've condensed my thoughts about the challenges down into the question on the slide: How do we embed the production of usable collections data into library work?
It has to be usable, because if it's not then why are we doing it? It has to be embedded because data in one-off projects gets isolated and stale. 'Production' is there because infrastructure and workflow is unsexy but necessary for access to the material that makes digital scholarship possible.
One of the biggest issues the British Library (BL) faces is scale. The BL's collections are vast – maybe 200 million items – and extremely varied. My experience shows that publishing datasets (or sharing them with aggregators) exposes the shortcomings of past cataloguing practices, making the size of the backlog all too apparent.
Good collections data (or metadata, depending on how you look at it) is necessary to avoid the overwhelmed, jumble sale feeling of using a huge aggregator like Europeana, Trove, or the DPLA, where you feel there's treasure within reach, if only you could find it. Publishing collections online often increases the number of enquiries about them – how can institution deal with enquiries at scale when they already have a cataloguing backlog? Computational methods like entity identification and extraction could complement the 'gold standard' cataloguing already in progress. If they're made widely available, these other methods might help bridge the resourcing gaps that mean it's easier to find items from richer institutions and countries than from poorer ones.
You probably already all know this, but it's worth remembering: our collections aren't even (yet) a patchwork of materials. The collections we hold, and the subset we can digitise and make available for re-use are only a tiny proportion of what once existed. Each piece was once part of something bigger, and what we have now has been shaped by cumulative practical and intellectual decisions made over decades or centuries. Digitisation projects range from tiny specialist databases to huge commercial genealogy deals, while some areas of the collections don't yet have digital catalogue records. Some items can't be digitised because they're too big, small or fragile for scanning or photography; others can't be shared because of copyright, data protection or cultural sensitivities. We need to be careful in how we label datasets so that the absences are evident.
(Here, 'data' may include various types of metadata, automatically generated OCR or handwritten text recognition transcripts, digital images, audio or video files, crowdsourced enhancements or any combination or these and more)
In addition to the incompleteness or fuzziness of catalogue data, when collections appear as data, it's often as great big lumps of things. It's hard for normal scholars to process (or just unzip) 4gb of data.
Currently, datasets are often created outside normal processes, and over time they become 'stale' as they're not updated when source collections records change. And when they manage to unzip them, the records rely on internal references – name authorities for people, places, etc – that can only be seen as strings rather than things until extra work is undertaken.
The BL's metadata team have experimented with 'researcher format' CSV exports around specific themes (eg an exhibition), and CSV is undoubtedly the most accessible format – but what we really need is the ability for people to create their own queries across catalogues, and create their own datasets from the results. (And by queries I don't mean SPARQL but rather faceted browsing or structured search forms).
Collections are huge (and resources relatively small) so we need to supplement manual cataloguing with other methods. Sometimes the work of crafting links from catalogues to external authorities and identifiers will be a machine job, with pieces sewn together at industrial speed via entity recognition tools that can pull categories out or text and images. Sometimes it's operated by a technologist who runs records through OpenRefine to find links to name authorities or Wikidata records. Sometimes it's a labour of scholarly love, with links painstakingly researched, hand-tacked together to make sure they fit before they're finally recorded in a bespoke database.
This linking work often happens outside the institution, so how can we ingest and re-use it appropriately? And if we're to take advantage of computational methods and external enhancements, then we need ways to signal which categories were applied by catalogues, which by software, by external groups, etc.
The workflow and interface adjustments required would be significant, but even more challenging would be the internal conversations and changes required before a consensus on the best way to combine the work of cataloguers and computers could emerge.
The trick is to move from a collection of pieces to pieces of a collection. Every collection item was created in and about places, and produced by and about people. They have creative, cultural, scientific and intellectual properties. There's a web of connections from each item that should be represented when they appear in datasets. These connections help make datasets more usable, turning strings of text into references to things and concepts to aid discoverability and the application of computational methods by scholars. This enables structured search across datasets – potentially linking an oral history interview with a scientist in the BL sound archive, their scientific publications in journals, annotated transcriptions of their field notebooks from a crowdsourcing project, and published biography in the legal deposit library.
A lot of this work has been done as authority files like AAT, ULAN etc are applied in cataloguing, so our attention should turn to turning local references into URIs and making the most of that investment.
Applying identifiers is hard – it takes expert care to disambiguate personal names, places, concepts, even with all the hinting that context-aware systems might be able to provide as machine learning etc techniques get better. Catalogues can't easily record possible attributions, and there's understandable reluctance to publish an imperfect record, so progress on the backlog is slow. If we're not to be held back by the need for records to be perfectly complete before they're published, then we need to design systems capable of capturing the ambiguity, fuzziness and inherent messiness of historical collections and allowing qualified descriptors for possible links to people, places etc. Then we need to explain the difference to users, so that they don't overly rely on our descriptions, making assumptions about the presence or absence of information when it's not appropriate.
A lot of what we need relies on more responsive infrastructure for workflows and cataloguing systems. For example, the BL's systems are designed around the 'deliverable unit' – the printed or bound volume, the archive box – because for centuries the reading room was where you accessed items. We now need infrastructure that makes items addressable at the manuscript, page and image level in order to make the most of the annotations and links created to shared identifiers.
(I'd love to see absorbent workflows, soaking up any related data or digital surrogates that pass through an organisation, no matter which system they reside in or originate from. We aren't yet making the most of OCRd text, let alone enhanced data from other processes, to aid discoverability or produce datasets from collections.)
Image credit: https://www.flickr.com/photos/snorski/34543357 My final thought – we can start small and iterate, which is just as well, because we need to work on understanding what users of collections data need and how they want to use them. We're making a start and there's a lot of thoughtful work behind the scenes, but maybe a bit more investment is needed from research libraries to become as comfortable with data users as they are with the readers who pass through their physical doors.
I've developed this exercise on computational data generation and entity extraction for various information/data visualisation workshops I've been teaching lately. These exercises help demonstrate the biases embedded in machine learning and 'AI' tools. As these methods have become more accessible, my dataviz workshops have included more discussion of computational methods for generating data to be visualised. There are two versions of the exercise – the first works with images, the second with text.
In teaching I've found that services that describe images were more accessible and generated richer discussion in class than text-based sites, but it's handy to have the option for people who work with text. If you try something like this in your classes I'd love to hear from you.
It's also a chance to talk about the uses of these technologies in categorising and labelling our posts on social media. We can tell people that their social media posts are analysed for personality traits and mentions of brands, but seeing it in action is much more powerful.
Image exercise: trying computational data generation and entity extraction
Time: c. 5 – 10 minutes plus discussion.
Goal: explore methods for extracting information from text or an image and reflect on what the results tell you about the algorithms
1. Find a sample image
Find an image (e.g. from a news site or digitised text) you can download and drag into the window. It may be most convenient to save a copy to your desktop. Many sites let you load images from a URL, so right- or control-clicking to copy an image location for pasting into the site can be useful.
2. Work in your browser
It's probably easiest to open each of these links in a new browser window. It's best to use Firefox or Chrome, if you can. Safari and Internet Explorer may behave slightly differently on some sites. You should not need to register to use these sites – please read the tips below or ask for help if you get stuck.
Clarifai https://www.clarifai.com/demo – you can drag and drop, open the file explorer to find an image, or load one from a URL via the large '+' in the bottom right-hand corner. You can adjust settings via the 'Configure' tab.
Google Cloud Vision API https://cloud.google.com/vision/ – don't sign up, scroll down to the 'Try the API' box. Drag and drop your image on the box or click the box to open the file finder. You may need to go through the 'I am not a robot' process.
IBM Watson Visual Recognition https://visual-recognition-demo.mybluemix.net/ – scroll to 'Try the service'. Drag an image onto the grey box or click in the grey box to open the file finder. You can also load an image directly from a URL. (You can no longer try this without signing up so it doesn't work for a quick exercise).
Make notes, or discuss with your neighbour. Be prepared to report back to the group.
What attributes does each tool report on?
Which attributes, if any, were unique to a service?
Based on this, what do companies like Clarifai, Google, IBM and Microsoft seem to think is important to them (or to their users)? (e.g. what does 'safe for work' really mean?)
Who are their users – the public or platform administrators?
How many of possible entities (concepts, people, places, events, references to time or dates, etc) did it pick up?
Is any of the information presented useful?
Did it label anything incorrectly?
What options for exporting or saving the results did the demo offer? What about the underlying service or software?
For tools with configuration options – what could you configure? What difference did changing classifiers or other parameters make?
If you tried it with a few images, did it do better with some than others? Why might that be?
Text exercise: trying computational data generation and entity extraction
Time: c. 5 minutes plus discussion
Goal: explore the impact of source data and algorithms on input text
1.Grab some text
You will need some text for this exercise. The more 'entities' – people, places, dates, concepts – discussed, the better. If you have some text you're working on handy, you can use that. If you're stuck for inspiration, pick a front page story from an online news site. Keep the page open so you can copy a section of text to paste into the websites.
2.Compare text entity labelling websites
Open four or more browser windows or tabs. Open the links below in separate tabs or windows so you can easily compare the results.
In my presentation, I responded to some of the questions posed in the workshop outline:
In this workshop we want to explore how network visualisations and infrastructures will change the research and outreach activities of cultural heritage professionals and historians. Among the questions we seek to discuss during the workshop are for example: How do users benefit from graphs and their visualisation? Which skills do we expect from our users? What can we teach them? Are SNA [social network analysis] theories and methods relevant for public-facing applications? How do graph-based applications shape a user’s perception of the documents/objects which constitute the data? How can applications benefit from user engagement? How can applications expand and tap into other resources?
A rough version of my talk notes is below. The original slides are also online.
Network visualisations and the 'so what?' problem
While I may show examples of individual network visualisations, this talk isn't a critique of them in particular. There's lots of good practice around, and these lessons probably aren't needed for people in the room.
Fundamentally, I think network visualisations can be useful for research, but to make them more effective tools for outreach, some challenges should be addressed.
I'm a Digital Curator at the British Library, mostly working with pre-1900 collections of manuscripts, printed material, maps, etc. Part of my job is to help people get access to our digital collections. Visualisations are a great way to firstly help people get a sense of what's available, and then to understand the collections in more depth.
I've been teaching versions of an 'information visualisation 101' course at the BL and digital humanities workshops since 2013. Much of what I'm saying now is based on comments and feedback I get when presenting network visualisations to academics, cultural heritage staff (who should be a key audience for social network analyses).
Provocation: digital humanists love network visualisations, but ordinary people say, 'so what'?
And this is a problem. We're not conveying what we're hoping to convey.
When teaching datavis, I give people time to explore examples like this, then ask questions like 'Can you tell what is being measured or described? What do the relationships mean?'. After talking about the pros and cons of network visualisations, discussion often reaches a 'yes, but so what?' moment.
Here are some examples of problems ordinary people have with network visualisations…
Spatial layout based on the pragmatic aspects of fitting something on the screen using physics, rules of attraction and repulsion doesn't match what people expect to see. It's really hard for some to let go of the idea that spatial layout has meaning. The idea that location on a page has meaning of some kind is very deeply linked to their sense of what a visualisation is.
Animated physics is … pointless?
People sometimes like the sproinginess when a network visualisation resettles after a node has been dragged, but waiting for the animation to finish can also be slow and irritating. Does it convey meaning? If not, why is it there?
Size, weight, colour = meaning?
The relationship between size, colour, weight isn't always intuitive – people assume meaning where there might be none.
In general, network visualisations are more abstract than people expect a visualisation to be.
'What does this tell me that I couldn't learn as quickly from a sentence, list or table?'
Scroll down the page that contains the network graph above and you get other visualisations. Sometimes they're much more positively received, particularly people feel they learn more from them than from the network visualisation.
Onto other issues with 'network visualisations as communication'…
Which algorithmic choices are significant?
It's hard for novices to know which algorithmic and data-cleaning choices are significant, and which have a more superficial impact.
Images travel extremely well on social media. When they do so, they often leave information behind and end up floating in space. Who created this, and why? What world view does it represent? What source material underlies it, how was it manipulated to produce the image? Can I trust it?
'Can't see the wood for the trees'
When I showed this to a class recently, one participant was frustrated that they couldn't 'see the wood for the trees'. The visualisations gives a general impression of density, but it's not easy to dive deeper into detail.
Stories vs hairballs
But when I started to explain what was being represented – the ways in which stories were copied from one newspaper to another – they were fascinated. They might have found their way there if they'd read the text but again, the visualisation is so abstract that it didn't hint at what lay underneath. (Also I have only very, very rarely seen someone stop to read the text before playing with a visualisation.)
No sense of change over time
This flattening of time into one simultaneous moment is more vital for historical networks than for literary ones, but even so, you might want to compare relationships between sections of a literary work.
No sense of texture, detail of sources
All network visualisations look similar, whether they're about historical texts or cans of baked beans. Dots and lines mask texture, and don't always hint at the depth of information they represent.
There's a lot to take on to really understand what's being expressed in a network graph.
There is some hope…
Onto the positive bit!
Interactivity is engaging
People find the interactive movement, the ability to zoom and highlight links engaging, even if they have no idea what's being expressed. In class, people started to come up with questions about the data as I told them more about what was represented. That moment of curiosity is an opportunity if they can dive in and start to explore what's going on, what do the relationships mean?
…but different users have different interaction needs
For some, there's that frustration expressed earlier they 'can't get to see a particular tree' in the dense woods of a network visualisation. People often want to get to the detail of an instance of a relationship – the lines of text, images of the original document – from a graph.
This mightn't be how network visualisations are used in research, but it's something to consider for public-facing visualisations. How can we connect abstract lines or dots to detail, or provide more information about what the relationship means, show the quantification expressed as people highlight or filter parts of a graph? A harder, but more interesting task is hinting at the texture or detail of those relationships.
Proceed, with caution
One of the workshop questions was 'Are social network analysis theories and methods relevant for public-facing applications?' – and maybe the answer is a qualified yes. As a working tool, they're great for generating hypotheses, but they need a lot more care before exposing them to the public.
[As an aside, I’d always taken the difference between visualisations as working tools for exploring data – part of the process of investigating a research question – and visualisation as an output – a product of the process, designed for explanation rather than exploration – as fundamental, but maybe we need to make that distinction more explicit.]
But first – who are your 'users'?
During this workshop, at different points we may be talking about different 'users' – it's useful to scope who we mean at any given point. In this presentation, I was talking about end users who encounter visualisations, not scholars who may be organising and visualising networks for analysis.
Sometimes a network visualisationisn't the answer … even if it was part of the question.
As an outcome of an exploratory process, network visualisations are not necessarily the best way to present the final product. Be disciplined – make yourself justify the choice to use network visualisations.
No more untethered images
Include an extended caption – data source, tools and algorithms used. Provide a link to find out more – why this data, this form? What was interesting but not easily visualised? Let people download the dataset to explore themselves?
Present visualisations as the tip of the data iceberg
Lots of interesting data doesn't make it into a visualisation. Talking about what isn't included and why it was left out is important context.
Talk about data that couldn't exist
Beyond the (fuzzy, incomplete, messy) data that's left out because it's hard to visualise, data that never existed in the first place is also important:
'because we're only looking on one axis (letters), we get an inflated sense of the importance of spatial distance in early modern intellectual networks. Best friends never wrote to each other; they lived in the same city and drank in the same pubs; they could just meet on a sunny afternoon if they had anything important to say. Distant letters were important, but our networks obscure the equally important local scholarly communities.' Scott Weingart, 'Networks Demystified 8: When Networks are Inappropriate'
Help users learn the skills and knowledge they need to interpret network visualisations in context.
How? Good question! This is the point at which I hand over to you…
The Science Gossip project is one year old, and they're asking their contributors to decide which periodicals they'll work on next and to start new discussions about the documents and images they find interesting.
I've seen a few interesting studentships and jobs posted lately, hinting at research and projects to come. There's a funded PhD in HCI and online civic engagement and a (now closed) studentship on Co-creating Citizen Science for Innovation.
Some of their key findings for museums (PDF) are below, interspersed with my comments. I read this section before the event, and found I didn't really recognise the picture of museums it presented. 'Museums' mightn't be the most useful grouping for a survey like this – the material that MTM London's Ed Corn presented on the day broke the results down differently, and that made more sense. The c2,500 museums in the UK are too varied in their collections (from dinosaurs to net art), their audiences, and their local and organisational context (from tiny village museums open one afternoon a week, to historic houses, to university museums, to city museums with exhibitions that were built in the 70s, to white cube art galleries, to giants like the British Museum and Tate) to be squished together in one category. Museums tend to be quite siloed, so I'd love to know who fills out the survey, and whether they ask the whole organisation to give them data beforehand.
According to the survey, museums are significantly less likely to engage in:
email marketing (67 per cent vs. 83 per cent for the sector as a whole) – museums are missing out! Email marketing is relatively cheap, and it's easy to write newsletters. It's also easy to ask people to sign up when they're visiting online sites or physical venues, and they can unsubscribe anytime they want to. Social media figures can look seductively huge, but Facebook is a frenemy for organisations as you never know how many people will actually see a post.
publish content to their own website (55 per cent vs. 72 per cent) – I wasn't sure how to interpret this – does this mean museums don't have their own websites? Or that they can't update them? Or is 'content' a confusing term? At the event it was said that 10% of orgs have no email marketing, website or Facebook, so there are clearly some big gaps to fill still.
sell event tickets online (31 per cent vs. 45 per cent) – fair enough, how many museums sell tickets to anything that really need to be booked in advance?
post video or audio content (31 per cent vs. 43 per cent) – for most museums, this would require an investment to create as many don't already have filmable material or archived films to hand. Concerns about 'polish' might also be holding some museums back – they could try periscoping tours or sharing low-fi videos created by front of house staff or educators. Like questions about offering 'online interactive tours of real-world spaces' and 'artistic projects', this might reflect initial assumptions based on ACE's experience with the performing arts. A question about image sharing would make more sense for museums. Similarly, the kinds of storytelling that blog posts allow can sometimes work particularly well for history and science museums (who don't have gorgeous images of art that tell their own story).
make use of social media video advertising (18 per cent vs. 32 per cent) – again, video is a more natural format for performing arts than for museums
use crowdfunding (8 per cent vs. 19 per cent) – crowdfunding requires a significant investment of time and is often limited to specific projects rather than core business expenses, so it might be seen as too risky, but is this why museums are less likely to try it?
livestream performances (2 per cent vs. 12 per cent) – again, this is less likely to apply to museums than performing arts organisations
One of the key messages in Ed Corn's talk was that organisations are experimenting less, evaluating the impact of digital work less, and not using data in digital decision making. They're also scaling back on non-core work; some are focusing on consolidation – fixing the basics like websites (and mobile-friendly sites). Barriers include lack of funding, lack of in-house time, lack of senior digital managers, slow/limited IT systems, and lack of digital supplier. (Many of those barriers were also listed in a small-scale survey on 'issues facing museum technologists' I ran in 2010.)
When you consider the impact of the cuts year on year since 2010, and that 'one in five regional museums at least part closed in 2015', some of those continued barriers are less surprising. At one point everyone I know still in museums seemed to be doing at least one job on top of theirs, as people left and weren't replaced. The cuts might have affected some departments more deeply than others – have many museums lost learning teams? I suspect we've also lost two generations of museum technologists – the retiring generation who first set up mainframe computers in basements, and the first generation of web-ish developers who moved on to other industries as conditions in the sector got more grim/good pay became more important. Fellow panelist Ros Lawler also made the point that museums have to deal with legacy systems while also trying to look at the future, and that museum projects tend to slow when they could be more agile.
Like many in the audience, I really wanted to know who the 'digital leaders' – the 10% of organisations who thought digital was important, did more digital activities and reaped the most benefits from their investment – were, and what made them so successful. What can other organisations learn from them?
It seems that we still need to find ways to share lessons learnt, and to help everyone in the arts and cultural sectors learn how to make the most of digital technologies and social media. Training that meets the right need at the right time is really hard to organise and fund, and there are already lots of pockets of expertise within organisations – we need to get people talking to each other more! As I said at the event, most technology projects are really about people. Front of house staff, social media staff, collections staff – everyone can contribute something.
If you were there, have read the report or explored the data, I'd love to know what you think. And I'll close with a blatant plug: the MCG has two open calls for papers a year, so please keep an eye out for those calls and suggest talks or volunteer to help out!
I'm at the British Museum today for the Museums Computer Group's annual UK 'Museums on the Web' conference. UKMW15 has a packed line-up full of interesting presentations. As Chair of the MCG, I briefly introduced the event. My notes are below, in part to make sure that everyone who should be thanked is thanked! You can read a more polished version of this written with my Programme Committe Co-Chair Danny Birchall in a Guardian Culture Professionals article, 'How digital tech can bridge gaps between museums and audiences'.
UK Museums on the Web 2015: 'Bridging Gaps, Making Connections' #UKMW15
I'd like to start by thanking everyone who helped make today happen, and by asking the MCG Committee Members who are here today to stand up, so that you can chat to them, ideally even thank them, during the day. For those who don't know us, the Museums Computer Group is a practitioner-lead group who work to connect, support and inspire anyone working in museum technology. (There are lots of ways to get involved – we're electing new committee members at our AGM at lunchtime, and we will also be asking for people to host next year's event at their museum or help organise a regional event.)
I'd particularly like to thank Ina Pruegel and Jennifer Ross, who coordinated the event, the MCG Committee members who did lots of work on the event (Andrew, Dafydd, Danny, Ivan, Jess, Kath, Mia, Rebecca, Rosie), and the Programme Committee members who reviewed presentation proposals sent in. They were: co-chairs: Danny Birchall and Mia Ridge, with Chris Michaels (British Museum), Claire Bailey Ross (Durham University), Gill Greaves (Arts Council England), Jenny Kidd (Cardiff University), Jessica Suess (Oxford University Museums), John Stack (Science Museum Group), Kim Plowright (Mildly Diverting), Matthew Cock (Vocal Eyes), Rachel Coldicutt (Friday), Sara Wajid (National Maritime Museum), Sharna Jackson (Hopster), Suse Cairns (Baltimore Museum of Art), Zak Mensah (Bristol Museums, Galleries & Archives).
And of course I'd like to thank the speakers and session chairs, the British Museum, Matt Caines at the Guardian, and in advance I'd like to thank all the tweets, bloggers and photographers who'll help spread this event beyond the walls of this room.
Which brings me to the theme of the event, 'Bridging Gaps, Making Connections'. We've been running UK Museums on the Web since 2001; last year our theme was 'museums beyond the web' in recognition that barriers between 'web teams' and 'web projects' and the rest of the organisation were breaking down. But it's also apparent that the gap between tiny, small, and even medium-sized museums and the largest, best-funded museums meant that digital expertise and knowledge had not reached the entire sector. The government's funding cuts and burnout mean that old museum hands have left, and some who replace them need time to translate their experience in other sectors into museums. Our critics and audiences are confused about what to expect, and museums are simultaneously criticised for investing too much in technologies that disrupt the traditional gallery and for being 'dull and dusty'. Work is duplicated across museums, libraries, archives and other cultural organisations; academic and commercial projects sometimes seem to ignore the wealth of experience in the sector.
So today is about bridging those gaps, and about making new connections. (I've made my own steps in bridging gaps by joining the British Library as a Digital Curator.) We have a fabulous line-up representing the wealth and diversity of experience in museum technologies.
Ironically, the internet was down on the evening of Ada Lovelace Day 2015, an annual, international 'celebration of the achievements of women in science, technology, engineering and maths (STEM)', so I couldn't post at the time. Belatedly, the people whose achievements I've admired are:
Professor Monica Grady, whose joy when the probe Philae successfully landed on the Rosetta comet is just about the most wonderful thing on the internet (and she worked on one of the instruments on board, which is very cool). Like New Horizons sending back images of Pluto, it's a reminder of the awe-inspiring combination of planning, foresight, science and engineering in space that has made 2015 so interesting.
Finally, I love this image of Margaret Hamilton, lead software engineer on Project Apollo (1969), with some of the Apollo Guidance Computer (AGC) source code.
Back in September last year I blogged about the implications for cultural heritage and digital humanities crowdsourcing projects that used simple tasks as the first step in public engagement of advances in machine learning that mean that fun, easy tasks like image tagging and text transcription could be done by computers. (Broadly speaking, 'machine learning' is a label for technologies that allow computers to learn from the data available to them. It means they don't have to specifically programmed to know how to do a task like categorising images – they can learn from the material they're given.)
One reason I like crowdsourcing in cultural heritage so much is that time spent on simple tasks can provide opportunities for curiosity, help people find new research interests, and help them develop historical or scientific skills as they follow those interests. People can notice details that computers would overlook, and those moments of curiosity can drive all kinds of new inquiries. I concluded that, rather than taking the best tasks from human crowdsourcers, 'human computation' systems that combine the capabilities of people and machines can free up our time for the harder tasks and more interesting questions.
I've been thinking about 'ecosystems' of crowdsourcing tasks since I worked on museum metadata games back in 2010. An ecosystem of tasks – for example, classifying images into broad types and topics in one workflow so that people can find text to transcribe on subjects they're interested in, and marking up that text with relevant subjects in a final workflow – means that each task can be smaller (and thereby faster and more enjoyable). Other workflows might validate the classifications or transcribed text, allowing participants with different interests, motivations and time constraints to make meaningful contributions to a project.
The New York Public Library's Building Inspector is an excellent example of this – they offer five tasks (checking or fixing automatically-detected building 'footprints', entering street numbers, classifying colours or finding place names), each as tiny as possible, which together result in a complete set of checked and corrected building footprints and addresses. (They've also pre-processed the maps to find the building footprints so that most of the work has already been done before they asked people to help.)
After teaching 'crowdsourcing cultural heritage' at HILT over the summer, where the concept of 'ecosystems' of crowdsourced tasks was put into practice as we thought about combining classification-focused systems like Zooniverse's Panoptes with full-text transcription systems, I thought it could be useful to give some specific examples of ecosystems for human computation in cultural heritage. If there are daunting data cleaning, preparation or validation tasks necessary before or after a core crowdsourcing task, computational ecosystems might be able to help. So how can computational ecosystems help pre- and post-process cultural heritage data for a better crowdsourcing experience?
While older ecosystems like Project Gutenberg and Distributed Proofreaders have been around for a while, we're only just seeing the huge potential for combining people + machines into crowdsourcing ecosystems. The success of the Smithsonian Transcription Center points to the value of 'niche' mini-projects, but breaking vast repositories into smaller sets of items about particular topics, times or places also takes resources. Machines can learn to classify source material by topic, by type, by difficulty or any other system that crowdsourcers can teach it. You can improve machine learning by giving systems 'ground truth' datasets with (for example) a crowdsourced transcription of the text in images, and as Ted Underwood pointed out on my last post, comparing the performance of machine learning and crowdsourced transcriptions can provide useful benchmarks for the accuracy of each method. Small, easy correction tasks can help improve machine learning processes while producing cleaner data.
Computational ecosystems might be able to provide better data validation methods. Currently, tagging tasks often rely on raw consensus counts when deciding whether a tag is valid for a particular image. This is a pretty crude measure – while three non-specialists might apply terms like 'steering' to a picture of a ship, a sailor might enter 'helm', 'tiller' or 'wheelhouse', but their terms would be discarded if no-one else enters them. Mining disciplinary-specific literature for relevant specialist terms, or finding other signals for subject-specific expertise would make more of that sailor's knowledge.
Computational ecosystems can help at the personal, as well as the project level. One really exciting development is computational assistance during crowdsourcing tasks. In Transcribing Bentham … with the help of a machine?, Tim Causer discusses TSX, a new crowdsourced transcription platform from the Transcribe Bentham and tranScriptorium projects. You can correct computationally-generated handwritten text transcription (HTR), which is a big advance in itself. Most importantly, you can also request help if you get stuck transcribing a specific word. Previously, you'd have to find a friendly human to help with this task. And from here, it shouldn't be too difficult to combine HTR with computational systems to give people individualised feedback on their transcriptions. The potential for helping people learn palaeography is huge!
Better validation techniques would also improve the participants' experience. Providing personalised feedback on the first tasks a participant completes would help reassure them while nudging them to improve weaker skills.
Most science and heritage projects working on human computation are very mindful of the impact of their choices on the participants' experience. However, there's a risk that anyone who treats human computation like a computer science problem (for example, computationally assigning tasks to the people with the best skills for them) will lose sight of the 'human' part of the project. Individual agency is important, and learning or mastering skills is an important motivation. Non-profit crowdsourcing should never feel like homework. We're still learning about the best ways to design crowdsourcing tasks, and that job is only going to get more interesting.