Notes from ‘AI, Society & the Media: How can we Flourish in the Age of AI’

Before we start: in the spirit of the mid-2000s, I thought I’d have a go at blogging about events again. I’ve realised I miss the way that blogging and reading other people’s posts from events made me feel part of a distributed community of fellow travellers. Journal articles don’t have the same effect (they’re too long and jargony for leisure readers, assuming they’re accessible outside universities at all), and tweets are great for connecting with people, but they’re very ephemeral. Here goes…

BBC Broadcasting House

On September 3 I was at BBC Broadcasting House for ‘AI, Society & the Media: How can we Flourish in the Age of AI?’ by BBC, LCFI and The Alan Turing Institute. Artificial intelligence is a hot topic so it was a sell-out event. My notes are very partial (in both senses of the word), and please do let me know if there are errors. The event hashtag will provide more coverage: https://twitter.com/hashtag/howcanweflourish.

The first session was ‘AI – What you need to know!’. Matthew Postgate began by providing context for the BBC’s interest in AI. ‘We need a plurality of business models for AI – not just ad-funded’ – yes! The need for different models for AI (and related subjects like machine learning) was a theme that recurred throughout the day (and at other events I was at this week).

Adrian Weller spoke on the limitations of AI. It’s data hungry, compute intensive, poor at representing uncertainty, easily fooled by adversarial examples (and more that I missed). We need sensible measures of trustworthiness including robustness, fairness, protection of privacy, transparency.

Been Kim shared Google’s AI principles: https://ai.google/principles She’s focused on interpretability – goals are to ensure that our values are aligned and our knowledge is reflected. She emphasised the need to understand your data (another theme across the day and other events this week). You can an inherently interpretable machine model (so it can explain its reasoning) or can build an interpreter, enabling conversations between humans and machines. You can then uncover bias using the interpreter, asking what weight it gave to different aspects in making decisions.

Jonnie Penn (who won me with an early shout out to the work of Jon Agar) asked, from where does AI draw its authority? AI is feeding a monopoly of Google-Amazon-Facebook who control majority of internet traffic and advertising spend. Power lies in choosing what to optimise for, and choosing what not to do (a tragically poor paraphrase of his example of advertising to children, but you get the idea). We need ‘bureaucratic biodiversity’ – need lots of models of diverse systems to avoid calcification.

Kate Coughlan – only 10% of people feel they can influence AI. They looked at media narratives re AI on axes of time (ease vs obsolescence), power (domination vs uprising), desire (gratification vs alienation), life (immortality vs inhumanity). Their survey found that each aspect was equally disempowering. Passivity drives negative outcomes re feelings about change, tech – but if people have agency, then it’s different. We need to empower citizens to have active role in shaping AI.

The next session was ‘Fake News, Real Problems: How AI both builds and destroys trust in news’. Ryan Fox spoke on ‘manufactured consensus’ – we’re hardwired to agree with our community so you can manipulate opinion by making it look like everyone else thinks a certain way. Manipulating consensus is currently legal, though against social network T&S. ‘Viral false narratives can jeopardise brand trust and integrity in an instant’. Manufactured outrage campaigns etc. They’re working on detecting inorganic behaviour through the noise – it’s rapid, repetitive, sticky, emotional (missed some).

One of the panel questions was, would AI replace journalists? No, it’s more like having lots of interns – you wouldn’t have them write articles. AI is good for tasks you can explain to a smart 16 year old in the office for a day. The problematic ad-based model came up again – who is the arbiter of truth (e.g. fake news on Facebook). Who’s paying for those services and what power does it give them?

This panel made me think about discussions about machine learning and AI at work. There are so many technical, contextual and ethical challenges for collecting institutions in AI, from capturing the output of an interactive voice experience with Alexa, to understanding and recording the difference between Russia Today as a broadcast news channel and as a manipulator of YouTube rankings.

Next was a panel on ‘AI as a Creative Enabler’. Cassian Harrison spoke about ‘Made By Machine’, an experiment with AI and archive programming. They used scene detection, subtitle analysis, visual ‘energy’, machine learning on the BBC’s Redux archive of programmes. Programmes were ranked by how BBC4 they were; split into sections then edited down to create mini BBC4 programmes.

Kanta Dihal and Stephen Cave asked why AI fascinates us in a thoughtful presentation. It’s between dead and alive, uncanny (and lots more but clearly my post-lunch notetaking isn’t the best).

Anna Ridler and Amy Cutler have created an AI-scripted nature documentary (trained on and re-purposing a range of tropes and footage from romance novels and nature documentaries) and gave a brilliant presentation about AI as a medium and as a process. Anna calls herself a dataset artist, rather than a machine learning artist. You need to get to know the dataset, look out for biases and mistakes, understand the humanness of decisions about what was included or excluded. Machines enact distorted versions of language.

Text from slide is transcribed above
Diane Coyle on ‘Lessons for the era of AI’

I don’t have notes from ‘Next Gen AI: How can the next generation flourish in the age of AI?’ but it was great to hear about hackathons where teenagers could try applying AI. The final session was ‘The Conditions for Flourishing: How to increase citizen agency and social value’. Hannah Fry – once something is dressed up as an algorithm it gains some authority that’s hard to question. Diane Coyle talked about ‘general purpose technologies’, which transform one industry then others. Printing, steam, electricity, internal combustion engine, digital computing, AI. Her ‘lessons for the era of AI’ were: all technology is social; all technologies are disruptive and have unpredictable consequences; all successful technologies enhance human freedoms’, and accordingly she suggested we ‘think in systems; plan for change; be optimistic’.

Konstantinos Karachalios called for a show of hands re who feels they have control over their data and what’s done with it? Very few hands were raised. ‘If we don’t act now we’ll lose our agency’.

I’m going to give the final word to Terah Lyons as the key takeaway from the day: ‘technology is not destiny’.

I didn’t hear a solution to the problems of ‘fake news’ that doesn’t require work from all of us. If we don’t want technology to be destiny, we all need pay attention to the applications of AI in our lives, and be prepared to demand better governance and accountability from private and government agents.

(A bonus ‘question I didn’t ask’ for those who’ve read this far: how do BBC aims for ethical AI relate to the introduction compulsory registration to access tv and radio? If I turn on the radio in my kitchen, my listening habits aren’t tracked; if I listen via the app they’re linked to my personal ID).

Updates from Digital Scholarship at the British Library

I’ve been posting on the work blog far more frequently than I have here. Launching and running In the Spotlight, crowdsourcing the transcription of the British Library’s historic playbills collection, was a focus in 2017-18. Some blog posts:

And a press release and newsletters:

Other updates from work, including a new project, information about the Digital Scholarship Reading Group I started, student projects, and an open data project I shepherded:

Cross-post: Seeking researchers to work on an ambitious data science and digital humanities project

I rarely post here at the moment, in part because I post on the work blog. Here’s a cross-post to help spread the word about some exciting opportunities currently available: Seeking researchers to work on an ambitious data science and digital humanities project at the British Library and Alan Turing Institute (London)

‘If you follow @BL_DigiSchol or #DigitalHumanities hashtags on twitter, you might have seen a burst of data science, history and digital humanities jobs being advertised. In this post, Dr Mia Ridge of the Library’s Digital Scholarship team provides some background to contextualise the jobs advertised with the ‘Living with Machines’ project.

We are seeking to appoint several new roles who will collaborate on an exciting new project developed by the British Library and The Alan Turing Institute, the national centre for data science and artificial intelligence.

Jobs currently advertised:

The British Library jobs are now advertised, closing September 21:

You may have noticed that the British Library is also currently advertising for a Curator, Newspaper Data (closes Sept 9). This isn’t related to Living with Machines, but with an approach of applying data-driven journalism and visualisation techniques to historical collections, it should have some lovely synergies and opportunities to share work in progress with the project team. There’s also a Research Software Engineer advertised that will work closely with many of the same British Library teams.

If you’re applying for these posts, you may want to check out the Library’s visions and values on the refreshed ‘Careers’ website.’

My opening remarks for MCG’s Museums+Tech 2017

My notes introducing the theme of the Museums Computer Group’s 2017 conference and a call to action for people working in cultural heritage technology below.

A divided world

2016 was the year that deep fractures came to the surface, but they’d been building for some time. We might live in the same country as each other, but we can experience it very differently. What we know about the state of the world is affected by where we live, our education, and by how (if?) we get our news.

Life in 2017

Cartoon of a dog surrounded by fire drinking coffee

    ‘This is fine’ (KC Green)

We can’t pretend that it’ll all go away and that society will heal itself. Divisions over Brexit, the role of propaganda in elections, climate change, the role of education, what we value as a society – they’re all awkward to address, but if we don’t it’s hard to see how we can move forward. And since we’re here to talk about museums – what role do museums have in divided societies? How much do they need to reflect voices they mightn’t agree with? Do we need to make ourselves a bit uncomfortable in order to make spaces for sharing experiences and creating empathy? Can (digital) experiences, collections and exhibitions in cultural heritage help create a shared understanding of the world?

‘arts and cultural engagement [helps] shape reflective individuals, facilitating greater understanding of themselves and their lives, increasing empathy with respect to others, and an appreciation of the diversity of human experience and cultures.’ From Understanding the value of arts & culture: The AHRC Cultural Value Project by Geoffrey Crossick & Patrycja Kaszynska

I’ve been struck lately by the observation that empathy can bridge divides, and give people the power to understand others. The arts and culture provide opportunities to ‘understand and share in another person’s feelings and experiences’ and connect the past to the present. How can museums – in all their different forms – contribute to a more empathic (and maybe eventually less divided) society?

‘The greatest benefit we owe to the artist, whether painter, poet, or novelist, is the extension of our sympathies. … Art is the nearest thing to life; it is a mode of amplifying experience and extending our contact with our fellow-men beyond the bounds of our personal lot.’ George Eliot, as quoted in Peter Bazalgette’s The Empathy Instinct

Digital experiences aren’t shared in the same way as physical ones, and ‘social’ media isn’t the same as being in the same space as someone experiencing the same thing, but they have other advantages – I hope we’ll learn about some today.

We need to tell better stories about museums and computers

Woman with buckets of computer cables
Engineer Karen Leadlay in Analog Computer Lab

Shifting from the public to staff in museums… Museums have been using technology to serve audiences and manage collections for decades. But still it feels like museums are criticised for simultaneously having too much and too little technology. Shiny apps make the news, but they’re built on decades of digitisation and care from heritage organisations. There’s a lot museums could do better, and digital expertise is not evenly distributed or recognised, but there’s a lot that’s done well, too. My challenge to you is to find and share better stories about cultural heritage technologies connecting collections, people and knowledge. If we don’t tell those stories, they’ll be told about us. Too many articles and puff pieces ignore the thoughtful, quotidian and/or experimental work of experts across the digital cultural heritage sector.

[Later in the day I mentioned that the conference had an excellent response to the call for papers – we learnt about more interesting projects than we had room to fit in, so perhaps we should encourage more people to post case studies to the MCG’s discussion list and website.]

The Museums+Tech 2017 programme

  • Keynote: ‘What makes a Museum?
  • Museums in a post-truth world of fake news
  • Challenging Expectations
  • Dealing with distance; bringing the museum to the people
  • How can museums use sound and chatbots?
  • Looking (back to look) forward

Speaking of better stories – I’m looking forward to hearing from all our speakers today – they’re covering an incredible range of topics, approaches and technologies, so hopefully each of you will leave full of ideas. Join us for drinks afterwards to keep the conversation going. And to set the tone for the day, it’s a great time to hear Hannah Fox on the topic of ‘what makes a museum’

Speaking of the conference – a lot of people helped out in different ways, so thanks to them all!

Save

Save

Save

Do people want access to digitised collections?

manuscript drawing
Drawing of the Battle of Lincoln from Henry of Huntingdon’s Historia Anglorum, British Library, Arundel 48. Viewed 33 million times on the front page of Italian Wikipedia in Feb 2017.

Someone asked me recently if there’s any evidence that people really want access to digitised collections, so I popped onto twitter and asked, ‘Does anyone have a good example of a digitised image on Wikimedia or similar that reached a huge audience compared to the GLAM’s own site?’. Here are the responses I received:

Michael Gasser @M_Gasser mentioned a photo from Zurich’s ETH Library that by mid-September had 160,000 views on the Wikipedia page about Sagrada Familia, dwarfing views on their own site. He also shared a blog post about their project, Reaching out to new users. ETH Library’s archives in the world of Wikimedia.

Jason Evans, (@WIKI_NLW), Wikimedian at the National Library of Wales said, ‘We shared around 15,000 images from @NLWales about 2 years ago and they have been viewed over 300 million times on Wiki’, and ‘This image by Magnum Photographer Philip Jones Griffiths is our most viewed with around half a mil views each month [link to stats on BaGLAMa]‘.

Pat Hadley (@PatHadley) said ‘Coins from @YorkshireMuseum get loads of traffic [link to stats on BaGLAMa] thanks to @YMT_Coins work long after my residency!’. Andrew Woods @YMT_Coins expanded that the project wasn’t just about getting big numbers: ‘My aims were more associated w proof of concept. Can we do this? How long does it take? Possible with volunteers with no previous exp? Etc’. It’s fantastic to see this sort of experiment with specialist collections.

Helge David (@helge_david) shared a link to a YouTube video of The Roentgens’ Berlin Secretary Cabinet, saying ‘14.1 million views of an 18th century cabinet suggests the right object can catch people’s imagination when some care is taken to make it intellectually accessible and freely available online.’ The video proves that perfectly, I think.

Sara Devine (@SaraDevine) replied to say ‘Yes! We have several @brooklynmuseum examples from past project[s]’, linking to “Africanizing” Wikipedia, one of Brooklyn Museum’s experiments with sharing images and improving content on Wikipedia.

Merete Sanderhoff (@MSanderhoff) said ‘This painting @smkmuseum is not on display but widely used on Wikipedia i.e. in entry on Lions [Christian VIII og Caroline Amalie i salvingsdragt.jpg] (thx @LizzyJongma :)’ and that ‘Some of the most popular @rijksmuseum images on Wikimedia are hidden treasures like Het kanonschot, Willem van de Velde (II), ca. 1680 and Het kasteel van Batavia, Andries Beeckman, ca. 1661‘.

Aron Ambrosiani‏ (@AronAmbrosiani) said ‘this one, on the “walrus” wikipedia page, had 280 000 views last month 🙂 Photo from @Skansen in 1908: [a man in a top hat feeding a walrus]’.

Illtud Daniel‏ (@illtud) simply linked to a tweet saying that a National Library of Wales image was used on Europeana’s 404 page, asking ‘Is this cheating?’.

Discussing images from the British Library, my colleague Ben O’Steen (@benosteen) noted that a manuscript image of Stephen of England had 735,324,085 views when it was on the front page of the English-language Wikipedia in October 2016.

Maarten Brinkerink and Johan Oomen provided an update on a 2011 post on usage of the Dutch Open Images platform for audiovisual material via email:

As of May 2017, ‘On average we get 19 million page views a month on articles that feature material from our archive. This exposure is generated by the 9,000 articles that reuse our material (spread over more than 100 languages versions of Wikipedia).

Since we’ve been available for reuse on Wikimedia Commons, in total, pages that reuse our content have generated 668 million page views.

To date we have donated about 10,000 digital objects to Wikimedia Commons, of which 35% are actually being reused in one article or more.’

As you can tell by the number of links to stats on BaGLAMa, this tool is key for organisations who want to understand where their images are being viewed across Wikimedia. The huge spike in the image shows the month mentioned by Ben when Stephen of England hit the front page of Wikipedia. (A few years ago I posted tips on Who loves your stuff? How to collect links to your site.)

British Library stats on BaGLAMa.

 

Thanks to the example shared in response to a single tweet, it seems clear that even if people don’t say to themselves, ‘what I really want is an image from a museum, archive or library’, when they want the answer to a question, content from cultural institutions helps make that answer a good one. Views on images on an institution’s own site might be relatively low, but making those images reusable by Wikimedia and other sites like Retronaut clearly has an impact. It’s not just that someone has done the work to put items in context and make them intellectually (or emotionally) accessible, it’s also that they’re placed on sites and platforms that people are already used to visiting. Access to digitised collections provides a useful public service, provoking curiosity and wonder, and teaching us about the past.

Save

Save

Save

From piles of material to patchwork: How do we embed the production of usable collections data into library work?

How do we embed the production of usable collections data into library work?These notes were prepared for a panel discussion at the ‘Always Already Computational: Collections as Data‘ (#AACdata) workshop, held in Santa Barbara in March 2017. While my latest thinking on the gap between the scale of collections and the quality of data about them is informed by my role in the Digital Scholarship team at the British Library, I’ve also drawn on work with catalogues and open cultural data at Melbourne Museum, the Museum of London, the Science Museum and various fellowships. My thanks to the organisers and the Institute of Museum and Library Services for the opportunity to attend. My position paper was called ‘From libraries as patchwork to datasets as assemblages?‘ but in hindsight, piles and patchwork of material seemed a better analogy.

The invitation to this panel asked us to share our experience and perspective on various themes. I’m focusing on the challenges in making collections available as data, based on years of working towards open cultural data from within various museums and libraries. I’ve condensed my thoughts about the challenges down into the question on the slide: How do we embed the production of usable collections data into library work?

It has to be usable, because if it’s not then why are we doing it? It has to be embedded because data in one-off projects gets isolated and stale. ‘Production’ is there because infrastructure and workflow is unsexy but necessary for access to the material that makes digital scholarship possible.

One of the biggest issues the British Library (BL) faces is scale. The BL’s collections are vast – maybe 200 million items – and extremely varied. My experience shows that publishing datasets (or sharing them with aggregators) exposes the shortcomings of past cataloguing practices, making the size of the backlog all too apparent.

Good collections data (or metadata, depending on how you look at it) is necessary to avoid the overwhelmed, jumble sale feeling of using a huge aggregator like Europeana, Trove, or the DPLA, where you feel there’s treasure within reach, if only you could find it. Publishing collections online often increases the number of enquiries about them – how can institution deal with enquiries at scale when they already have a cataloguing backlog? Computational methods like entity identification and extraction could complement the ‘gold standard’ cataloguing already in progress. If they’re made widely available, these other methods might help bridge the resourcing gaps that mean it’s easier to find items from richer institutions and countries than from poorer ones.

Photo of piles of materialYou probably already all know this, but it’s worth remembering: our collections aren’t even (yet) a patchwork of materials. The collections we hold, and the subset we can digitise and make available for re-use are only a tiny proportion of what once existed. Each piece was once part of something bigger, and what we have now has been shaped by cumulative practical and intellectual decisions made over decades or centuries. Digitisation projects range from tiny specialist databases to huge commercial genealogy deals, while some areas of the collections don’t yet have digital catalogue records. Some items can’t be digitised because they’re too big, small or fragile for scanning or photography; others can’t be shared because of copyright, data protection or cultural sensitivities. We need to be careful in how we label datasets so that the absences are evident.

(Here, ‘data’ may include various types of metadata, automatically generated OCR or handwritten text recognition transcripts, digital images, audio or video files, crowdsourced enhancements or any combination or these and more)

Image credit: https://www.flickr.com/photos/teen_s/6251107713/

In addition to the incompleteness or fuzziness of catalogue data, when collections appear as data, it’s often as great big lumps of things. It’s hard for normal scholars to process (or just unzip) 4gb of data.

Currently, datasets are often created outside normal processes, and over time they become ‘stale’ as they’re not updated when source collections records change. And when they manage to unzip them, the records rely on internal references – name authorities for people, places, etc – that can only be seen as strings rather than things until extra work is undertaken.

The BL’s metadata team have experimented with ‘researcher format’ CSV exports around specific themes (eg an exhibition), and CSV is undoubtedly the most accessible format – but what we really need is the ability for people to create their own queries across catalogues, and create their own datasets from the results. (And by queries I don’t mean SPARQL but rather faceted browsing or structured search forms).

Image credit: screenshot from http://data.bl.uk/

Collections are huge (and resources relatively small) so we need to supplement manual cataloguing with other methods. Sometimes the work of crafting links from catalogues to external authorities and identifiers will be a machine job, with pieces sewn together at industrial speed via entity recognition tools that can pull categories out or text and images. Sometimes it’s operated by a technologist who runs records through OpenRefine to find links to name authorities or Wikidata records. Sometimes it’s a labour of scholarly love, with links painstakingly researched, hand-tacked together to make sure they fit before they’re finally recorded in a bespoke database.

This linking work often happens outside the institution, so how can we ingest and re-use it appropriately? And if we’re to take advantage of computational methods and external enhancements, then we need ways to signal which categories were applied by catalogues, which by software, by external groups, etc.

The workflow and interface adjustments required would be significant, but even more challenging would be the internal conversations and changes required before a consensus on the best way to combine the work of cataloguers and computers could emerge.

The trick is to move from a collection of pieces to pieces of a collection. Every collection item was created in and about places, and produced by and about people. They have creative, cultural, scientific and intellectual properties. There’s a web of connections from each item that should be represented when they appear in datasets. These connections help make datasets more usable, turning strings of text into references to things and concepts to aid discoverability and the application of computational methods by scholars. This enables structured search across datasets – potentially linking an oral history interview with a scientist in the BL sound archive, their scientific publications in journals, annotated transcriptions of their field notebooks from a crowdsourcing project, and published biography in the legal deposit library.

A lot of this work has been done as authority files like AAT, ULAN etc are applied in cataloguing, so our attention should turn to turning local references into URIs and making the most of that investment.

Applying identifiers is hard – it takes expert care to disambiguate personal names, places, concepts, even with all the hinting that context-aware systems might be able to provide as machine learning etc techniques get better. Catalogues can’t easily record possible attributions, and there’s understandable reluctance to publish an imperfect record, so progress on the backlog is slow. If we’re not to be held back by the need for records to be perfectly complete before they’re published, then we need to design systems capable of capturing the ambiguity, fuzziness and inherent messiness of historical collections and allowing qualified descriptors for possible links to people, places etc. Then we need to explain the difference to users, so that they don’t overly rely on our descriptions, making assumptions about the presence or absence of information when it’s not appropriate.

Image credit: http://europeana.eu/portal/record/2021648/0180_N_31601.html

Photo of pipes over a buildingA lot of what we need relies on more responsive infrastructure for workflows and cataloguing systems. For example, the BL’s systems are designed around the ‘deliverable unit’ – the printed or bound volume, the archive box – because for centuries the reading room was where you accessed items. We now need infrastructure that makes items addressable at the manuscript, page and image level in order to make the most of the annotations and links created to shared identifiers.

(I’d love to see absorbent workflows, soaking up any related data or digital surrogates that pass through an organisation, no matter which system they reside in or originate from. We aren’t yet making the most of OCRd text, let alone enhanced data from other processes, to aid discoverability or produce datasets from collections.)

Image credit: https://www.flickr.com/photos/snorski/34543357
My final thought – we can start small and iterate, which is just as well, because we need to work on understanding what users of collections data need and how they want to use them. We’re making a start and there’s a lot of thoughtful work behind the scenes, but maybe a bit more investment is needed from research libraries to become as comfortable with data users as they are with the readers who pass through their physical doors.

Trying computational data generation and entity extraction

I’ve developed this exercise on computational data generation and entity extraction for various information/data visualisation workshops I’ve been teaching lately. As these methods have become more accessible, my dataviz workshops have included more discussion of computational methods for generating data to be visualised. There are two versions of the exercise – the first works with images, the second with text.

In teaching I’ve found that services that describe images were more accessible and generated richer discussion in class than text-based sites, but it’s handy to have the option for people who work with text. If you try something like this in your classes I’d love to hear from you.

It’s also a chance to talk about the uses of these technologies in categorising and labelling our posts on social media. We can tell people that their social media posts are analysed for personality traits and mentions of brands, but seeing it in action is much more powerful.

Image exercise: trying computational data generation and entity extraction

Time: c. 5 minutes plus discussion.

Goal: explore methods for extracting information from text or an image and reflect on what the results tell you about the algorithms

1. Find a sample image

Find an image (e.g. from a news site or digitised text) you can download and drag into the window. It may be most convenient to save a copy to your desktop. Many sites let you load images from a URL, so right- or control-clicking to copy an image location for pasting into the site can be useful.

2. Work in your browser

It’s probably easiest to open each of these links in a new browser window. It’s best to use Firefox or Chrome, if you can. Safari and Internet Explorer may behave slightly differently on some sites. You should not need to register to use these sites – please read the tips below or ask for help if you get stuck.

3. Review the outputs

Make notes, or discuss with your neighbour. Be prepared to report back to the group.

  • What attributes does each tool report on?
  • Which attributes, if any, were unique to a service?
  • Based on this, what do Clarifai, Google, IBM and Microsoft seem to think is important to them (or to their users)?
  • How many of possible entities (concepts, people, places, events, references to time or dates, etc) did it pick up?
  • Is any of the information presented useful?
  • Did it label anything incorrectly?
  • What options for exporting or saving the results did the demo offer? What about the underlying service or software?
  • For tools with configuration options – what could you configure? What difference did changing classifiers or other parameters  make?
  • If you tried it with a few images, did it do better with some than others? Why might that be?

Text exercise: trying computational data generation and entity extraction

Time: c. 5 minutes plus discussion
Goal: explore the impact of source data and algorithms on input text

1.     Grab some text

You will need some text for this exercise. If you have something you’re working on handy, you can use that. If you’re stuck for inspiration, pick a front page story from an online news site. Keep the page open so you can copy a section of text to paste into the websites.

2.     Compare text entity labelling websites

  • Open three more browser windows or tabs
  • In one, go to DBpedia Spotlight https://dbpedia-spotlight.github.io/demo/. Paste your copied text into the box, or keep the sample text in the box. Hit ‘Annotate’.
  • In the other, go to Ontotext http://tag.ontotext.com/. You may need to click through the opening screen. Paste your copied text into the box. Hit ‘annotate’.
  • Finally, go to Stanford Named Entity Tagger http://nlp.stanford.edu:8080/ner/. Paste your text into the box. Hit ‘Submit query’.

3.     Review the outputs

  • How many possible entities (concepts, people, places, events, references to time or dates, etc) did each tool pick up? Is any of the other information presented useful?
  • Did it label anything incorrectly?
  • What if you change classifiers or other parameters?
  • Does it do better with different source material?
  • What differences did you find between the two tools? What do you think caused those differences?
  • How much can you find out about the tools and the algorithms they use to create labels?
  • Where does the data underlying the process come from?

 

Spoiler alert!

screenshot
Clarifai’s image recognition tool with a historical image

Keynote online: ‘Reaching out: museums, crowdsourcing and participatory heritage’

In September I was invited to give a keynote at the Museum Theme Days 2016 in Helsinki. I spoke on ‘Reaching out: museums, crowdsourcing and participatory heritage. In lieu of my notes or slides, the video is below. (Great image, thanks YouTube!)

Crowdsourcing in cultural heritage, citizen science – September 2016

More new projects and project updates I’ve noticed over September 2016.

Gillian Lattimore @Irl_HeritageDig has posted some of her dissertation research on Crowdsourcing Motivations in a GLAM Context: A Research Survey of Transcriber Motivations of the Meitheal Dúchas.ie Crowdsourcing Project. dúchas.ie is ‘a project to digitize the National Folklore Collection of Ireland, one of the largest folklore collections in the world’.

A long read on Brighton Pavilion and Museums’ Map The Museum, ‘#HeritageEveryware Map The Museum: connecting collections to the street‘ includes some great insights from Kevin Bacon.

Meghan Ferriter and Christine Rosenfeld have produced a special edition of a journal, ‘Exploring the Smithsonian Institution Transcription Center‘ with articles on ‘Crowdsourcing as Practice and Method in the Smithsonian Transcription Center’ and more.

Two YouGov posts on American and British people’s knowledge of their recent family history provide some useful figures on how many people in each region have researched family history.

Richard Light’s posted some interesting questions and feedback for crowdsourcing projects at The GB1900.org project – first look.

Archiving the Civil War’s Text Messages‘ provides more information about the Decoding the Civil War project.

Zooniverse blog post ‘Why Cyclone Center is the CrockPot of citizen science projects‘ gives some insight into why some projects appear ‘slower’ than others.

A December 2015 post, ‘How a citizen science app with over 70,000 users is creating local community’ (HT Jill Nugent ‏@ntxscied) and an interesting contrast to ‘Volunteer field technicians are bad for wildlife ecology‘. A nice quote from the first piece: ‘Young says that the number one thing that keeps iNaturalist users involved is the community that they create: “meeting other people who are into the same thing I am”’.

iNaturalist Bioblitz‘s are also more evidence for the value of time-limited challenges, or as they describe them, ‘a communal citizen-science effort to record as many species within a designated location and time period as possible’.

Micropasts continue to add historical and archaeological projects.

Survey of London and CASA launched the Histories of Whitechapel website, providing ‘a new interactive map for exploring the Survey’s ongoing research into Whitechapel’ and ‘inviting people to submit their own memories, research, photographs, and videos of the area to help us uncover Whitechapel’s long and rich history’.

New Zooniverse project Mapping Change: ‘Help us use over a century’s worth of specimens to map the distribution of animals, plants, and fungi. Your data will let us know where species have been and predict where they may end up in the future!’

New Europeana project Europeana Transcribe: ‘a crowdsourcing initiative for the transcription of digital material from the First World War, compiled by Europeana 1914-1918. With your help, we can create a vast and fully digital record of personal documents from the collection.’

‘Holiday pictures help preserve the memory of world heritage sites’ introduces Curious Travellers, a ‘data-mining and crowd sourced infrastructure to help with digital documentation of archaeological sites, monuments and heritage at risk’. Or in non-academese, send them your photos and videos of threatened historic sites, particularly those in ‘North Africa, including Cyrene in Libya, as well as those in Syria and the Middle East’.

I’ve added two new international projects, Les herbonautes, a French herbarium transcription project led by the Paris Natural History Museum, and Loki a Finnish project on maritime, coastal history to my post on Crowdsourcing the world’s heritage – as always, let me know of other projects that should be included.

 

Survey of London
Survey of London site

Crowdsourcing in cultural heritage, citizen science – recent updates

A small* collection of links from the past little while.

Projects

  • A new Zooniverse project, Decoding the Civil War, launched in June: ‘Witness the United States Civil War by transcribing and deciphering messages and codes from the United States Military Telegraph’.
  • Another Zooniverse project, Camera CATalogue: ‘Analyze Wildlife Photos to Help Panthera Protect Big Cats’.

Articles

  • Palmer, Stuart, and Deb Verhoeven, ‘Crowdfunding Academic Researchers–the Importance of Academic Social Media Profiles’, in ECSM 2016: Proceedings of the 3rd European Conference on Social Media (Academic Conferences and Publishing International, 2016), pp. 291–299
  • Preece, Jennifer, ‘Citizen Science: New Research Challenges for Human–Computer Interaction’, International Journal of Human-Computer Interaction, 32 (2016), 585–612 <http://dx.doi.org/10.1080/10447318.2016.1194153>
  • Dillon, Justin, Robert B. Stevenson, and Arjen E. J. Wals, ‘Introduction: Special Section: Moving from Citizen to Civic Science to Address Wicked Conservation Problems’, Conservation Biology, 30 (2016), 450–55 <http://dx.doi.org/10.1111/cobi.12689> – has an interesting new model, putting citizen sciences ‘on a continuum from highly instrumental forms driven by experts or science to more emancipatory forms driven by public concern. The variations explain why citizens participate in CS and why scientists participate too. To advance the conversation, we distinguish between three strands or prototypes: science-driven CS, policy-driven CS, and transition-driven civic science.’

    ‘We combined Jickling and Wals’ (2008) heuristic for understanding environmental and sustainability education (Jickling & Wals 2008) and M. Fox and R. Gibson’s problem typology (Fig. 1) to provide an overview of the different possible configurations of citizen science (Fig. 2). The heuristic has 2 axes. We call the horizontal axis the participation axis, along which extend the possibilities (increasing from left to right) for stakeholders, including the public, to participate in setting the agenda; determining the questions to be addressed; deciding the mechanisms and tools to be used; choosing how to monitor, evaluate, and interpret data; and choosing the course of action to take. The vertical (goal) axis shows the possibilities for autonomy and self-determination in setting goals and objectives. The resulting quadrants correspond to a particular strand of citizen science. All three occupied quadrants are important and legitimate.’

    A heuristic of citizen science based on Wals and Jickling (2008).
    A heuristic of citizen science based on Wals and Jickling (2008). From Dillon, Justin, Robert B. Stevenson, and Arjen E. J. Wals (2016)

    * It’s a short list this month as I’ve been busy and things seem quieter over the northern hemisphere summer.