Helping us fly? Machine learning and crowdsourcing

Image of a man in a flying contrapation powered by birds
Moon Machine by Bernard Brussel-Smith via Serendip-o-matic

Over the past few years we've seen an increasing number of projects that take the phrase 'human-computer interaction' literally (perhaps turning 'HCI' into human-computer integration), organising tasks done by people and by computers into a unified system. One of the most obvious benefits of crowdsourcing on digital platforms has been the ability to coordinate the distribution and validation of tasks. Increasingly, data manually classified through crowdsourcing is being fed into computers to improve machine learning so that computers can learn to recognise images or words almost as well as we do. I've outlined a few projects putting this approach to work below.

This creates new challenges for the future: if fun, easy tasks like image tagging and text transcription can be done by computers, what are the implications for cultural heritage and digital humanities crowdsourcing projects that used simple tasks as the first step in public engagement? After all, Fast Company reported that 'at least one Zooniverse project, Galaxy Zoo Supernova, has already automated itself out of existence'. What impact will this have on citizen science and history communities? How might machine learning free us to fly further, taking on more interesting tasks with cultural heritage collections?

The Public Catalogue Foundation has taken tags created through Your Paintings Tagger and achieved impressive results in the art of computer image recognition: 'Using the 3.5 million or so tags provided by taggers, the research team at Oxford 'educated' image-recognition software to recognise the top tagged terms'. All paintings tagged with a particular subject (e.g. 'horse') were fed into feature extraction processes to build an 'object model' of a horse (a set of characteristics that would indicate that a horse is depicted) then tested to see the system could correctly tag horses.

The BBC World Service archive used an 'open-source speech recognition toolkit to listen to every programme and convert it to text' and keywords then asked people to check the correctness of the data created (Algorithms and Crowd-Sourcing for Digital Archives, see also What we learnt by crowdsourcing the World Service archive).

The CUbRIK project combines 'machine, human and social computation for multimedia search' in their technical demonstrator, HistoGraph. The SOCIAM: The Theory and Practice of Social Machines project is looking at 'a new kind of emergent, collective problem solving', including 'citizen science social machines'.

And of course the Zooniverse is working on this, most recently with Galaxy Zoo. A paper summarised on their Milky Way project blog, outlines the powerful synergy between citizens scientists, professional scientists, and machine learning: 'citizens can identify patterns that machines cannot detect without training, machine learning algorithms can use citizen science projects as input training sets, creating amazing new opportunities to speed-up the pace of discovery', addressing the weakness of each approach if deployed alone.

Further reading: an early discussion of human input into machine learning is in Quinn and Bederson's 2011 Human Computation: A Survey and Taxonomy of a Growing Field. You can get a sense of the state of the field from various conference papers, including ICML ’13 Workshop: Machine Learning Meets Crowdsourcing and ICML ’14 Workshop: Crowdsourcing and Human Computing. There's also a mega-list of academic crowdsourcing conferences and workshops, though it doesn't include much on the tiny corner of the world that is crowdsourcing in cultural heritage.

Last update: March 2015. This post collects my thoughts on machine learning and human-computer integration as I finish my thesis. Do you know of examples I've missed, or implications we should consider?

QR tags edging towards mainstream?

A London-based 'tech PR' blog post said this week:

When Kelly Brooks starts appearing in ads featuring QR codes you know that the 2D dot matrix bar code technology is close to a tipping point. Brooks features in a Pepsi campaign that has gone live this week and images of her clutching a QR code have featured in most of the tabloids.

Source: QR codes and the Kelly Brooks Pepsi campaign, hat tip for link: Heleana Quartey.

See also p8tch.com who says 'think of it as a TinyURL you can wear' and emmacott.com who say 'wear your profile'. There's even a Facebook 'add to friends' QR app and a Google Charts QR Code API.

It's interesting timing, as QR codes were discussed in a MCG thread on 'Putting web addresses on interpretation' that was in essence about linking from the offline physical world and online content.

While they're not mainstream enough to be a viable solution yet, we could be getting close to the tipping point where QR tags might become a viable way of bookmarking real world objects and locations. QR tags also provide a way of linking locations to online content without the requirements for a location-aware device.

Finding problems for QR tags to solve

QR tags (square or 2D barcodes that can hold up to 4,296 characters) are famously 'big in Japan'. Outside of Japan they've often seemed a solution in search of a problem, but we're getting closer to recognising the situations where they could be useful.

There's a great idea in this blog post, Video Print:

By placing something like a QR code in the margin text at the point you want the reader to watch the video, you can provide an easy way of grabbing the video URL, and let the reader use a device that's likely to be at hand to view the video with…

I would use this a lot myself – my laptop usually lives on my desk, but that's not where I tend to read print media, so in the past I've ripped URLs out of articles or taken a photo on my phone to remind myself to look at them later, but I never get around to it. But since I always have my phone with me I'd happily snap a QR code (the Nokia barcode software is usually hidden a few menus down, but it's worth digging out because it works incredibly well and makes a cool noise when it snaps onto a tag) and use the home wifi connection to view a video or an extended text online.

As a 'call to action' a QR tag may work better than a printed URL because it saves typing in a URL on a mobile keyboard.

QR tags would also work well as physical world hyperlinks, providing a visible sign that information about a particular location is available online or as a short piece of text encoded in the QR tag. They could work as well for a guerrilla campaign to make contested or forgotten histories visible again – stickers are easy to produce and can be replaced if they weather – as for official projects to take cultural heritage content outside the walls of the museum.

The Powerhouse Museum have also experimented with QR tags, creating special offer vouchers.

Here's the obligatory sample QR – if your phone has a barcode reader you should get the URL of this blog*:

qrcode

* which is totally not optimised for mobile reading as the main pages tend to be quite long but it works ok over wifi broadband.

[Update – I just came across this post about Barcode wikipedia that suggests: "People would be able to access the info by entering/scanning the barcode number. The kind of information that would be stored against the product would be things like reviews, manufacturing conditions, news stories about the product/manufacturer, farm subsidies paid to the manufacturer etc." I'm a bit (ok, a lot) of a hippie and check product labels before I buy – I love this idea because it's like a version of the ethical shopping guide small enough to fit inside my wap phone.]

[Update 2 – more discussion of a 'what are QR codes good for' ilk over at http://blog.paulwalk.net/2008/10/24/quite-resourceful/]

Nice summary of web 2.0 for the digital humanities

It's an old post (2006, gasp!) but the points Web 2.0 and the Digital Humanities raises are still just as relevant in the digital cultural heritage sector today:

In summary:

  • Give users tools to visualise and network their own data. And make it easy.
  • Harness the self-interest of your users – "help the user with their own research interests as a first priority".
  • Have an API -"You don’t know what you’ve got until you give it away", "Sharing data in a machine readable and retrievable format, is the most important feature. It lets other people build features for you"
  • Embrace the chaos of knowledge – "a bottom-up method of knowledge representation can be more powerful and more accurate than traditional top-down methods".

Microupdates and you (a.ka. 'twits in the museum')

I was trying to describe Twitter-esque applications for a presentation today, and I wasn't really happy with 'microblogging' so I described them as 'micro-updates'. Partly because I think of them as a bit like Facebook status updates for geeks, and partly because they're a lot more actively social than blog posts.

In case you haven't come across them, Twitter, Pownce, Jaiku, tumblr, etc, are services that let you broadcast short (140 characters) messages via a website or mobile device. I find them useful for finding like-minded people (or just those who also fancy a drink) at specific events (thanks to Brian Kelly for convincing me to try it).

You can promote a 'hash tag' for use at your event – yes, it's a tag with a # in front of it, low tech is cool. Ideally your tag should be short and snappy yet distinct, because it has to be typed in manually (mistakes happen easily, especially from a mobile device) and it's using up precious characters. You can use tools like Summize, hashtags, Quotably or Twemes to see if anyone else has used the same tag recently.

You can also ask people to use your event tag on blog posts, photos and videos to help bring together all the content about your event and create an ad hoc community of participants. Be aware that especially with Twitter-type services you may get fairly direct criticism as well as praise – incredibly useful, but it can seem harsh out of context (e.g. in a report to your boss).

More generally, you can use the same services above to search twitter conversations to find posts about your institution, events, venues or exhibitions. You can add in a search term and subscribe to an RSS feed to be notified when that term is used. For example, I tried http://summize.com/search?q="museum+of+london" and discovered a great review of the last 'Lates' event that described it as 'like a mini festival'. You should also search for common variations or misspellings, though they may return more false positives. When someone tweets (posts) using your search phrase it'll show up in your RSS reader and you can then reply to the poster or use the feedback to improve your projects.

This can be a powerful way to interact with your audience because you can respond directly and immediately to questions, complaints or praise. Of course you should also set up google alerts for blog posts and other websites but micro-update services allow for an incredible immediacy and directness of response.

As an example, yesterday I tweeted (or twitted, if you prefer):

me: does anyone know how to stop firefox 3 resizing pages? it makes images look crappy

I did some searching [1] and found a solution, and posted again:

me: aha, it's browser.zoom.full or "View → Zoom → Zoom Text Only" on windows, my firefox is sorted now

Then, to my surprise, I got a message from someone involved with Firefox [2]:

firefox_answers: Command/Control+0 (zero, not oh) will restore the default size for a page that's been zoomed. Also View->Zoom->Reset

me: Impressed with @firefox_answers providing the answer I needed. I'd been looking in the options/preferences tabs for ages

firefox_answers: Also, for quick zooming in & out use control plus or control minus. in Firefox 3, the zoom sticks per site until you change it.

Not only have I learnt some useful tips through that exchange, I feel much more confident about using Firefox 3 now that I know authoritative help is so close to hand, and in a weird way I have established a relationshp with them.

Finally, twitter et al have a social function – tonight I met someone who was at the same event I was last week who vaguely recognised me because of the profile pictures attached to Twitter profiles on tweets about the event. Incidentally, he's written a good explanation of twitter, so I needn't have written this!

[1] Folksonomies to the rescue! I'd been searching for variations on 'firefox shrink text', 'firefox fit screen', 'firefox screen resize' but since the article that eventually solved my problem called it 'zoom', it took me ages to find it. If the page was tagged with other terms that people might use to describe 'my page jumps, everything resizes and looks a bit crappy' in their own words, I'd have found the solution sooner.

[2] Anyone can create a username and post away, though I assume Downing Street is the real thing.

Introducing modern bluestocking

[Update, May 2012: I've tweaked this entry so it makes a little more sense.  These other posts from around the same time help put it in context: Some ideas for location-linked cultural heritage projectsExposing the layers of history in cityscapes, and a more recent approach  '…and they all turn on their computers and say 'yay!" (aka, 'mapping for humanists'). I'm also including below some content rescued from the ning site, written by Joanna:

What do historian Catharine Macauley, scientist Ada Lovelace, and photographer Julia Margaret Cameron have in common? All excelled in fields where women’s contributions were thought to be irrelevant. And they did so in ways that pushed the boundaries of those disciplines and created space for other women to succeed. And, sadly, much of their intellectual contribution and artistic intervention has been forgotten.

Inspired by the achievements and exploits of the original bluestockings, Modern Bluestockings aims to celebrate and record the accomplishments not just of women like Macauley, Lovelace and Cameron, but also of women today whose actions within their intellectual or professional fields are inspiring other women. We want to build up an interactive online resource that records these women’s stories. We want to create a feminist space where we can share, discuss, commemorate, and learn.

So if there is a woman whose writing has inspired your own, whose art has challenged the way you think about the world, or whose intellectual contribution you feel has gone unacknowledged for too long, do join us at http://modernbluestocking.ning.com/, and make sure that her story is recorded. You'll find lots of suggestions and ideas there for sharing content, and plenty of willing participants ready to join the discussion about your favourite bluestocking.

And more explanation from modernbluestocking on freebase:

Celebrating the lives of intellectual women from history…

Wikipedia lists bluestocking as 'an obsolete and disparaging term for an educated, intellectual woman'.  We'd prefer to celebrate intellectual women, often feminist in intent or action, who have pushed the boundaries in their discipline or field in a way that has created space for other women to succeed within those fields.

The original impetus was a discussion at the National Portrait Gallery in London held during the exhibition 'Brilliant Women, 18th Century Bluestockings' (http://www.npg.org.uk/live/wobrilliantwomen1.asp) where it was embarrassingly obvious that people couldn't name young(ish) intellectual women they admired.  We need to find and celebrate the modern bluestockings.  Recording and celebrating the lives of women who've gone before us is another way of doing this.

However, at least one of the morals of this story is 'don't get excited about a project, then change jobs and start a part-time Masters degree.  On the other hand, my PhD proposal was shaped by the ideas expressed here, particularly the idea of mapping as a tool for public history by e.g using geo-located stories to place links to content in the physical location.

While my PhD has drifted away from early scientific women, I still read around the subject and occasionally adding names to modernbluestocking.freebase.com.  If someone's not listed in Wikipedia it's a lot harder to add them, but I've realised that if you want to make a difference to the representation of intellectual women, you need to put content where people look for information – i.e. Wikipedia.

And with the launch of Google's Knowledge Graph, getting history articles into Wikipedia then into Freebase is even more important for the visibility of women's history: "The Knowledge Graph is built using facts and schema from Freebase so everyone who has contributed to Freebase had a part in making this possible. …The Knowledge Graph is built using facts and schema from Freebase soeveryone who has contributed to Freebase had a part in making this possible. (Source: this post to the Freebase list).  I'd go so far as to say that if it's worth writing a scholarly article on an intellectual woman, it's worth re-using  your references to create or improve their Wikipedia entry.]

Anyway. On with the original post…]

I keep meaning to find the time to write a proper post explaining one of the projects I'm working on, but in the absence of time a copy and paste job and a link will have to do…

I've started a project called 'modern bluestocking' that's about celebrating and commemorating intellectual women activists from the past and present while reclaiming and redefining the term 'bluestocking'.  It was inspired by the National Portrait Gallery's exhibition, 'Brilliant Women: 18th-Century Bluestockings'.  (See also the review, Not just a pretty face).

It will be a website of some sort, with a community of contributors and it'll also incorporate links to other resources.

We've started talking about what it might contain and how it might work at modernbluestocking.ning.com (ning died, so it's at modernbluestocking.freebase.com…)

Museum application (something to make for mashed museum day?): collect feminist histories, stories, artefacts, images, locations, etc; support the creation of new or synthesised content with content embedded and referenced from a variety of sources. Grab something, tag it, display them, share them; comment, integrate, annotate others. Create a collection to inspire, record, commemorate, and build on.
What, who, how should this website look? Join and help us figure it out.

Why modernbluestocking? Because knowing where you've come from helps you know where you're going.

Sources could include online exhibition materials from the NPG (tricky interface to pull records from).  How can this be a geek/socially friendly project and still get stuff done?  Run a Modernbluestocking, community and museum hack day app to get stuff built and data collated?  Have list of names, portraits, objects for query. Build a collection of links to existing content on other sites? Role models and heroes from current life or history. Where is relatedness stored? 'Significance' -thorny issue? Personal stories cf other more mainstream content?  Is it like a museum made up of loan objects with new interpretation? How much is attribution of the person who added the link required? Login v not? Vandalism? How do deal with changing location or format of resources? Local copies or links? Eg images. Local don't impact bandwidth, but don't count as visits on originating site. Remote resources might disappear – moved, permissions changed, format change, taken offline, etc, or be replaced with different content. Examine the sources, look at their format, how they could be linked to, how stable they appear to be, whether it's possible to contact the publisher…

Could also be interesting to make explicit, transparent, the processes of validation and canonisation.

Yahoo! SearchMonkey, the semantic web – an example from last.fm

I had meant to blog about SearchMonkey ages ago, but last.fm's post 'Searching with my co-monkey' about a live example they've created on the SearchMonkey platform has given me the kick I needed. They say:

The first version of our application deals with artist, album and track pages giving you a useful extract of the biography, links to listen to the artist if we have them available, tags, similar artists and the best picture we can muster for the page in question.

Some background on SearchMonkey from ReadWriteWeb:

At the same time, it was clear that enhancing search results and cross linking them to other pieces of information on the web is compelling and potentially disruptive. Yahoo! realized that in order to make this work, they need to incentivize and enable publishers to control search result presentation.

SearchMonkey is a system that motivates publishers to use semantic annotations, and is based on existing semantic standards and industry standard vocabularies. It provides tools for developers to create compelling applications that enhance search results. The main focus of these applications is on the end user experience – enhanced results contain what Yahoo! calls an "infobar" – a set of overlays to present additional information.

SearchMonkey's aim is to make information presentation more intelligent when it comes to search results by enabling the people who know each result best – the publishers – to define what should be presented and how.

(From Making the Web Searchable: The Story of SearchMonkey)

And from Yahoo!'s search blog:

This new developer platform, which we're calling SearchMonkey, uses data web standards and structured data to enhance the functionality, appearance and usefulness of search results. Specifically, with SearchMonkey:

  • Site owners can build enhanced search results that will provide searchers with a more useful experience by including links, images and name-value pairs in the search results for their pages (likely resulting in an increase in traffic quantity and quality)
  • Developers can build SearchMonkey apps that enhance search results, access Yahoo! Search's user base and help shape the next generation of search
  • Users can customize their search experience with apps built by or for their favorite sites

This could be an interesting new development – the question is, how well does the data we currently output play with it; could we easily adapt our pages so they're compatible with SearchMonkey; should we invest the time it might take? Would a simple increase in the visibility and usefulness of search results be enough? Could there be a greater benefit in working towards federated searches across the cultural heritage sector or would this require a coordinated effort and agreement on data standards and structure?

Update to link to the Yahoo! Search Blog post ;The Yahoo! Search Gallery is Open for Business' which has a few more examples.

Some ideas for location-linked cultural heritage projects

I loved the Fire Eagle presentation I saw at the WSG Findability event [my write-up] because it got me all excited again about ideas for projects that take cultural heritage outside the walls of the museum, and more importantly, it made some of those projects seem feasible.

There's also been a lot of talk about APIs into museum data recently and hopefully the time has come for this idea. It'd be ace if it was possible to bring museum data into the everyday experience of people who would be interested in the things we know about but would never think to have 'a museum experience'.

For example, you could be on your way to the pub in Stoke Newington, and your phone could let you know that you were passing one of Daniel Defoe's hang outs, or the school where Mary Wollstonecraft taught, or that you were passing a 'Neolithic working area for axe-making' and that you could see examples of the Neolithic axes in the Museum of London or Defoe's headstone in Hackney Museum.

That's a personal example, and those are some of my interests – Defoe wrote one of my favourite books (A Journal of the Plague Year), and I've been thinking about a project about 'modern bluestockings' that will collate information about early feminists like Wollstonecroft (contact me for more information) – but ideally you could tailor the information you receive to your interests, whether it's football, music, fashion, history, literature or soap stars in Melbourne, Mumbai or Malmo. If I can get some content sources with good geo-data I might play with this at the museum hack day.

I'm still thinking about functionality, but a notification might look something like "did you know that [person/event blah] [lived/did blah/happened] around here? Find out more now/later [email me a link]; add this to your map for sharing/viewing later".

I've always been fascinated with the idea of making the invisible and intangible layers of history linked to any one location visible again. Millions of lives, ordinary or notable, have been lived in London (and in your city); imagine waiting at your local bus stop and having access to the countless stories and events that happened around you over the centuries. Wikinear is a great example, but it's currently limited to content on Wikipedia, and this content has to pass a 'notability' test that doesn't reflect local concepts of notability or 'interestingness'. Wikipedia isn't interested in the finds associated with an archaeological dig that happened at the end of your road in the 1970s, but with a bit of tinkering (or a nudge to me to find the time to make a better programmatic interface) you could get that information from the LAARC catalogue.

The nice thing about local data is that there are lots of people making content; the not nice thing about local data is that it's scattered all over the web, in all kinds of formats with all kinds of 'trustability', from museums/libraries/archives, to local councils to local enthusiasts and the occasional raving lunatic. If an application developer or content editor can't find information from trusted sources that fits the format required for their application, they'll use whatever they can find on other encyclopaedic repositories, hack federated searches, or they'll screen-scrape our data and generate their own set of entities (authority records) and object records. But what happens if a museum updates and republishes an incorrect record – will that change be reflected in various ad hoc data solutions? Surely it's better to acknowledge and play with this new information environment – better for our data and better for our audiences.

Preparing the data and/or the interface is not necessarily a project that should be specific to any one museum – it's the kind of project that would work well if it drew on resources from across the cultural heritage sector (assuming we all made our geo-located object data and authority records available and easily queryable; whether with a commonly agreed core schema or our own schemas that others could map between).

Location-linked data isn't only about official cultural heritage data; it could be used to display, preserve and commemorate histories that aren't 'notable' or 'historic' enough for recording officially, whether that's grime pirate radio stations in East London high-rise roofs or the sites of Turkish social clubs that are now new apartment buildings. Museums might not generate that data, but we could look at how it fits with user-generated content and with our collecting policies.

Or getting away from traditional cultural heritage, I'd love to know when I'm passing over the site of one of London's lost rivers, or a location that's mentioned in a film, novel or song.

[Updated December 2008 to add – as QR tags get more mainstream, they could provide a versatile and cheap way to provide links to online content, or 250 characters of information. That's more information than the average Blue Plaque.]

Notes from 'How Can Culture Really Connect? Semantic Front Line Report' at MW2008

These are my notes from the workshop on "'How Can Culture Really Connect? Semantic Front Line Report" at Museums and the Web 2008. This session was expertly led by Ross Parry.

The paper, "Semantic Dissonance: Do We Need (And Do We Understand) The Semantic Web?" (written by Ross Parry, Jon Pratty and Nick Poole) and the slides are online. The blog from the original Semantic Web Think Tank (SWTT) sessions is also public.

These notes are pretty rough so apologies for any mistakes; I hope they're a bit useful to people, even though it's so late after the event. I've tried to include most of what was discussed but it's taken me a while to catch up.

There's so much to see at MW I missed the start of this session; when we arrived Ross had the participants debating the meaning of terms like 'Web 2.0', 'Web 3.0', 'semantic web, 'Semantic Web'.

So what is the semantic web (sw) about? It's about intelligent and efficient searching; discovering resources (e.g. URIs of picture, news story, video, biographical detail, museum object) rather than pages; machine-to-machine linking and processing of data.

Discussion: how much/what level of discourse do we need to take to curators and other staff in museums?
me: we need to show people what it can do, not bother them with acronyms.
Libby Neville: believes in involving content/museum people, not sure viewing through the prism of technology.
[?]: decisions about where data lives have an effect.

Slide 39 shows various axes against which the Semantic Web (as formally defined) and the semantic web (the SW 'lite'?) can be assessed.
Discussion: Aaron: it's context-dependent.

'expectations increase in proportion to the work that can be done' so the work never decreases.

sw as 'webby way to link data'; 'machine processable web' saves getting hung up on semantics [slide 40 quoting Emma Tonkin in BECTA research report, ‘If it quacks like a duck…’ Developments in search technologies].

What should/must/could we (however defined) do/agree/build/try next (when)?

Discussion: Aaron: tagging, clusters. Machine tags (namespace: predicate: value).
me: let's build semantic webby things into what we're doing now to help facilitate the conversations and agreements, provide real world examples – attack the problem from the bottom up and the top down.

Slide 49 shows three possible modes: make collections machine-processable via the web; build ontologies and frameworks around added tags; develop more layered and localised meaning. [The data (the data around the data) gets smarter and richer as you move through those modes.]

I was reminded of this 'mash it' video during this session, because it does a good jargon-free job of explaining the benefits of semantic webby stuff. I also rather cynically tweeted that the semantic web will "probably happen out there while we talk about it".