Sentiment analysis is opinion turned into code

Modern elections are data visualisation bonanzas, and the 2015 UK General Election is no exception.

Last night seven political leaders presented their views in a televised debate. This morning the papers are full of snap polls, focus groups, body language experts, and graphs based on public social media posts describing the results. Graphs like the one below summarise masses of text using a technique called ‘sentiment analysis’, a form of computational language processing.* After a twitter conversation with @benosteen and @MLBrook I thought it was worth posting about the inherent biases in the tools that create these visualisations. Ultimately, ‘sentiment analysis’ is someone’s opinion turned into code – so whose opinion are you seeing?

ChartThis is a great time to remember that sentiment analysis – mining text to see what people are talking about and how they feel about it – is based on algorithms and software libraries that were created and configured by people who’ve made a series of small, accumulative decisions that affect what we see. You can think of sentiment analysis as a sausage factory with the text of tweets as the mince going in one end, and pretty pictures as the product coming out the other end. A healthy democracy needs the list of secret ingredients added during processing, not least because this election prominently features spin rooms and party lines.

What are those ‘ingredients’? The software used for sentiment analysis is ‘trained’ on existing text, and the type of text used affects what the software assumes about the world. For example, software trained on business articles is great at recognising company names but does not do so well on content taken from museum catalogues (unless the inventor of an object went on to found a company and so entered the trained vocabulary). The algorithms used to process text change the output, as does the length of the phrase analysed. The results are riddled with assumptions about tone, intent, the demographics of the poster and more.

In the case of an election, we’d also want to know when the text used for training was created, whether it looks at previous posts by the same person, and how long the software was running over the given texts. Where was the baseline of sentiment on various topics set? Who defines what ‘neutral’ looks like to an algorithm?

We should ask the same questions about visualisations and computational analysis that we’d ask about any document. The algorithmic ‘black box’ is a human construction, and just like every other text, software is written by people. Who’s paying for it? What sources did they use? If it’s an agency promoting their tools, do they mention the weaknesses and probable error rates or gloss over it? If it’s a political party (or a company owned by someone associated with a party), have they been scrupulous in weeding out bots? Do retweets count? Are some posters weighted more heavily? Which visualisations were discarded and how did various news outlets choose the visualisations they featured? Which parties are left out?

It matters because, all software has biases, and, as Brandwatch say, ‘social media will have a significant role in deciding the outcome of the general election’. And finally, as always, who’s not represented in the dataset?

* If you already know this, hopefully you’ll know the rest too. This post is deliberately light on technical detail but feel free to add more detailed information in the comments.

Creating simple graphs with Excel’s Pivot Tables and Tate’s artist data

I’ve been playing with Tate’s collections data while preparing for a workshop on data visualisation. On the day I’ll probably use Google Fusion Tables as an example, but I always like to be prepared so I’ve prepared a short exercise for creating simple graphs in Excel as an alternative.

The advantage of Excel is that you don’t need to be online, your data isn’t shared, and for many people, gaining additional skills in Excel might be more useful than learning the latest shiny web tool. PivotTables are incredibly useful for summarising data, so it’s worth trying them even if you’re not interested in visualisations. Pivot tables let you run basic functions – summing, averaging, grouping, etc – on spreadsheet data. If you’ve ever wanted spreadsheets to be as powerful as databases, pivot tables can help. I could create a pivot table then create a chart from it, but Excel has an option to create a pivot chart directly that’ll also create a pivot table for you to see how it works.

For this exercise, you will need Excel and a copy of the sample data: tate_artist_data_cleaned_v1_groupedbybirthyearandgender.xlsx
(A plain text CSV version is also available for broader compatibility: tate_artist_data_cleaned_v1_groupedbybirthyearandgender.csv.)

Work out what data you’re interested in

In this example, I’m interested in when the artists in Tate’s collection were born, and the overall gender mix of the artists represented. To make it easier to see what’s going on, I’ve copied those two columns of data from the original ‘artists’ file and copied them over to a new spreadsheet. As a row by row list of births, these columns aren’t ideal for charting as they are, so I want a count of artists per year, broken down by gender.

Insert PivotChart

On the ‘Insert’ menu, click on PivotTable to open the menu and display the option for PivotCharts.
Excel pivot table Insert PivotChart detail

Excel will select our columns as being the most likely thing we want to chart. That all looks fine to me so click ‘OK’.

Excel pivot table OK detailConfigure the PivotChart

This screen asking you to ‘choose fields from the PivotTable Field List’ might look scary, but we’ve only got two columns of data so you can’t really go wrong. Excel pivot table Choose fields

The columns have already been added to the PivotTable Field List on the right, so go ahead and tick the box next to ‘gender’ and ‘yearofBirth’. Excel will probably put them straight into the ‘Axis Fields’ box.

Leave yearofBirth under Axis Fields and drag ‘gender’ over to the ‘Values’ box next to it. Excel automatically turns it into ‘count of gender’, assuming that we want to sum the number of births per year.

The final task is to drag ‘gender’ down from the PivotTable Field List to ‘Legend Fields’ to create a key for which colours represent which gender. You should now see the pivot table representing the calculated values on the left and a graph in the middle.

Close-up of the pivot fields

 

When you click off the graph, the PivotTable options disappear – just click on the graph or the data again to bring them up.

Excel pivot table Results

You’ve made your first pivot chart!

You might want to drag it out a bit so the values aren’t so squished. Tate’s data covers about 500 years so there’s a lot to fit in.

Now you’ve made a pivot chart, have a play – if you get into a mess you can always start again!

Colophon: the screenshots are from Excel 2010 for Windows because that’s what I have.

About the data: this data was originally supplied by Tate. The full version on Tate’s website includes name, date of birth, place of birth, year of death, place of death and URL on Tate’s website. The latest versions of their data can be downloaded from http://www.tate.org.uk/about/our-work/digital/collection-data The source data for this file can be downloaded from https://github.com/tategallery/collection/blob/master/artist_data.csv This version was simplified so it only contains a list of years of birth and the gender of the artist. Some blank values for gender were filled in based on the artist’s name or a quick web search; groups of artists or artists of unknown gender were removed as were rows without a birth year. This data was prepared in March 2015 for a British Library course on ‘Data Visualisation for Analysis in Scholarly Research’ by Mia Ridge.

I’d love to hear if you found this useful or have any suggestions for tweaks.

My 25 most popular posts on Open Objects / Open Objects has moved

Sign saying 'Move on nothing to see here'
Time to move on…

After nine years with Blogger/Blogspot, the little niggles have become too much, and I’ve moved Open Objects over to a self-hosted WordPress blog. Once the dust has settled (and posts have been indexed by search engines) I’ll set-up a refresh on blogger so you’ll end up here automatically. If you’ve been redirected, you can use the search box to find specific posts.

Or check out the 25 most popular posts (since 2008, when I added stats):

And two bonus favourite posts:

A New Year’s resolution for start-ups, PRs and journalists writing about museums

Some technology in a museum

Dear journalists, start-ups, agencies and PR folk,

I get that you want to talk about how amazing some new app, product or company is, but can you please do so without resorting to lazy, outdated cliches?

I’ve seen far too many articles make un-evidenced claims like ‘museums don’t realise people have different preferences in their galleries’ or that museums are ‘repeatedly turning a blind eye to technology, rather than recognizing it could be used to deliver an experience unique to every visitor’. If your app, product or company is good enough, you shouldn’t need to do the ‘competition’ down to stand out, and besides, sometimes my eyes hurt from rolling so hard.

I know that traditionally everyone makes New Years resolutions for themselves, but in the spirit of disruption (ha! not really) I’d like to suggest a New Years resolution for you:  leave those cliches about dusty old museums behind and find out what people in your city love about their museums. Find a new angle for your piece, one that recognises that museums don’t always get it right but that they’ve probably been thinking about the best uses of technology for their audiences longer than you have.

Museums have been experimenting with new technologies for decades. The post-2008 financial cuts might have reduced the number of digital pilot projects across the sector as a whole but most museums are still investing in improving the visitor experience, engaging wider audiences and making a difference in the lives of their communities. You probably don’t need to lecture them on what they could be doing – they already know, and wish they had more resources to do cool things.

You could even check out past papers and discussions at conferences and groups like the Museum Computer Network (MCN), Museums and the Web, the Museums Computer Group (MCG), MuseumNext, the Visitor Studies Group (VSG), the many fantastic museum technology, design and audience research blogs, the #musetech hashtag (when agencies aren’t spamming it) and much, much more if you wanted some inspiration or to learn what’s been tried in the past and how it worked out…

Yours in museums,

Mia

The rise of interpolated content?

One thing that might stand out when we look back at 2014 is the rise of interpolated content. We’ve become used to translating around auto-correct errors in texts and emails but we seem to be at a tipping point where software is going ahead and rewriting content rather than prompting you to notice and edit things yourself.

iOS doesn’t just highlight or fix typos, it changes the words you’ve typed. To take one example, iOS users might use ‘ill’ more than they use ‘ilk’, but if I typed ‘ilk’ I’m not happy when it’s replaced by an algorithmically-determined ‘ill’. As a side note, understanding the effect of auto-correct on written messages will be a challenge for future historians (much as it is for us sometimes now).

And it’s not only text. In 2014, Adobe previewed GapStop, ‘a new video technology that eases transitions and removes pauses from video automatically’. It’s not just editing out pauses, it’s creating filler images from existing images to bridge the gaps so the image doesn’t jump between cuts. It makes it a lot harder to tell when someone’s words have been edited to say something different to what they actually said – again, editing audio and video isn’t new, but making it so easy to remove the artefacts that previously provided clues to the edits is.

Photoshop has long let you edit the contrast and tone in images, but now their Content-Aware Move, Fill and Patch tools can seamlessly add, move or remove content from images, making it easy to create ‘new’ historical moments. The images on extrapolated-art.com, which uses ‘[n]ew techniques in machine learning and image processing […] to extrapolate the scene of a painting to see what the full scenery might have looked like’ show the same techniques applied to classic paintings.

But photos have been manipulated since they were first used, so what’s new? As one Google user reported in It’s Official: AIs are now re-writing history, ‘Google’s algorithms took the two similar photos and created a moment in history that never existed, one where my wife and I smiled our best (or what the algorithm determined was our best) at the exact same microsecond, in a restaurant in Normandy.’ The important difference here is that he did not create this new image himself: Google’s scripts did, without asking or specifically notifying him. In twenty years time, this fake image may become part of his ‘memory’ of the day. Automatically generated content like this also takes the question of intent entirely out of the process of determining ‘real’ from interpolated content. And if software starts retrospectively ‘correcting’ images, what does that mean for our personal digital archives, for collecting institutions and for future historians?

Interventions between the act of taking a photo and posting it on social media might be one of the trends of 2015. Facebook are about to start ‘auto-enhancing’ your photos, and apparently, Facebook Wants To Stop You From Uploading Drunk Pictures Of Yourself. Apparently this is to save your mum and boss seeing them; the alternative path of building a social network that don’t show everything you do to your mum and boss was lost long ago. Would the world be a better place if Facebook or Twitter had a ‘this looks like an ill-formed rant, are you sure you want to post it?’ function?

So 2014 seems to have brought the removal of human agency from the process of enhancing, and even creating, text and images. Algorithms writing history? Where do we go from here? How will we deal with the increase of interpolated content when looking back at this time? I’d love to hear your thoughts.

Three ways you can help with ‘In their own words: collecting experiences of the First World War’ (and a CENDARI project update)

Somehow it’s a month since I posted about my CENDARI research project (in Moving forward: modelling and indexing WWI battalions) on this site. That probably reflects the rhythm of the project – less trying to work out what I want to do and more getting on with doing it. A draft post I started last month simply said, ‘A lot of battalions were involved in World War One’. I’ll do a retrospective post soon, and here’s a quick summary of on-going work.

First, a quick recap. My project has two goals – one, to collect a personal narrative for each battalion in the Allied armies of the First World War; two, to create a service that would allow someone to ask ‘where was a specific battalion at a specific time?’. Together, they help address a common situation for people new to WWI history who might ask something like ‘I know my great-uncle was in the 27th Australian battalion in March 1916, where would he have been and what would he have experienced?’.

I’ve been working on streamlining and simplifying the public-facing task of collecting a personal narrative for each battalion, and have written a blog post, Help collect soldiers’ experiences of WWI in their own words, that reduces it to three steps:

  1. Take one of the diaries, letters and memoirs listed on the Collaborative Collections wiki, and
  2. Match its author with a specific regiment or battalion.
  3. Send in the results via this form.

If you know of a local history society, family historian or anyone else who might be interested in helping, please send them along to this post: Help collect soldiers’ experiences of WWI in their own words.

Work on specifying the relevant data structures to support a look-up service to answer questions about a specific units location and activities at a specific time largely moved to the wiki:

You can see the infobox structures in progress by flipping from the talk to the Template tabs. You’ll need to request an account to join in but more views, sample data and edge cases would be really welcome.

Populating the list of battalions and other units has been a huge task in itself, partly because very few cultural institutions have definitive lists of units they can (or want to) share, but it’s necessary to support both core goals. I’ve been fortunate to have help (see ‘Thanks and recent contributions’ on ‘How you can help‘) but the task is on-going so get in touch if you can help!

So there are three different ways you can help with ‘In their own words: collecting experiences of the First World War':

Finally, last week I was in New Zealand to give a keynote on this work at the National Digital Forum. The video for ‘Collaborative collections through a participatory commons‘ is online, so you can catch up on the background for my project if you’ve got 40 minutes or so to spare. Should you be in Dublin, I’m giving a talk on ‘A pilot with public participation in historical research: linking lived experiences of the First World War’ at the Trinity Long Room Hub today (thus the poster).

And if you’ve made it this far, perhaps you’d like to apply for a CENDARI Visiting Research Fellowships 2015 yourself?

All the things I didn’t say in my welcome to UKMW14 ‘Museums beyond the web’…

Here are all the things I (probably) didn’t say in my Chair’s welcome for the Museums Computer Group annual conference… Other notes, images and tweets from the day are linked from ‘UKMW14 round-up: posts, tweets, slides and images‘.

Welcome to MCG’s UKMW14: Museums beyond the web! We’ve got great speakers lined up, and we’ve built in lots of time to catch up and get to know your peers, so we hope you’ll enjoy the day.

It’s ten years since the MCG’s Museums on the Web became an annual event, and it’s 13 years since it was first run in 2001. It feels like a lot has changed since then, but, while the future is very definitely here, it’s also definitely not evenly distributed across the museum sector. It’s also an interesting moment for the conference, as ‘the web’ has broadened to include ‘digital’, which in turn spans giant distribution networks and tiny wearable devices. ‘The web’ has become a slightly out-dated shorthand term for ‘audience-facing technologies’.

When looking back over the last ten years of programmes, I found myself thinking about planetary orbits. Small planets closest to the sun whizz around quickly, while the big gas giants move incredibly slowly. If technology start-ups are like Mercury, completing a year in just 88 Earth days, and our audiences are firmly on Earth time, museum time might be a bit closer to Mars, taking two Earth years for each Mars year, or sometimes even Jupiter, completing a circuit once every twelve years or so.

But museums aren’t planets, so I can only push that metaphor so far. Different sections of a museum move at different speeds. While heroic front of house staff can observe changes in audience behaviours on a daily basis and social media platforms can be adopted overnight, websites might be redesigned every few years, but galleries are only updated every few decades (if you’re lucky). For a long time it felt like museums were using digital platforms to broadcast at audiences without really addressing the challenges of dialogue or collaborating with external experts.

But at this point, it seems that, finally, working on digital platforms like the web has pushed museums to change how they work. On a personal level, the need for specific technical skills hasn’t changed, but more content, education and design jobs work across platforms, are consciously ‘multi-channel’ and audience rather than platform-centred in their focus. Web teams seem to be settling into public engagement, education, marketing etc departments as the idea of a ‘digital’ department slowly becomes an oxymoron. Frameworks from software development are slowly permeating organisations that use to think in terms of print runs and physical gallery construction. Short rounds of agile development are replacing the ‘build and abandon after launch’ model, voices from a range of departments are replacing the disembodied expert voice, and catalogues are becoming publications that change over time.

While many of us here are comfortable with these webby methods, how will we manage the need to act as translators between digital and museums while understanding the impact of new technologies? And how can we help those who are struggling to keep up, particularly with the impact of the cuts?

Today is a chance to think about the technologies that will shape the museums of the future. What will audiences want from us? Where will they go looking for information and expertise, and how much of that information and expertise should be provided by museums? How can museums best provide access to their collections and knowledge over the next five, ten years?

We’re grateful to our sponsors, particularly as their support helps keep ticket prices affordable. Firstly I’d like to thank our venue sponsors, the Natural History Museum. Secondly, I’d like to thank Faversham & Moss for their sponsorship of this conference. Go chat to them and find out more about their work!

Moving forward: modelling and indexing WWI battalions

A super-quick update from my CENDARI Fellowship this week. I set up the wiki for In their own words: linking lived experiences of the First World War a week ago but only got stuck into populating it with lists of various national battalions this week. My current task list, copied from the front page is to:

If you can help with any of that, let me know! Or just get stuck in and edit the site.
I’ve started another Google Doc with very sketchy Notes towards modelling information about World War One Battalions. I need to test it with more battalion histories and update it iteratively. At this stage my thinking is to turn it into an InfoBox format to create structured data via the wiki. It’s all very lo-fi and much less designed than my usual projects, but I’m hoping people will be able to help regardless.
So, in this phase of the project, the aim is find a personal narrative – a diary, letters, memoirs or images – for each military unit in the British Army. Can you help? 

Looking for (crowdsourcing) love in all the right places

One of the most important exercises in the crowdsourcing workshops I run is the ‘speed dating’ session. The idea is to spend some time looking at a bunch of crowdsourcing projects until you find a project you love. Finding a project you enjoy gives you a deeper insight into why other people participate in crowdsourcing, and will see you through the work required to get a crowdsourcing project going. I think making a personal connection like this helps reduce some of the cynicism I occasionally encounter about why people would volunteer their time to help cultural heritage collections. Trying lots of projects also gives you a much better sense of the types of barriers projects can accidentally put in the way of participation. It’s also a good reminder that everyone is a nerd about something, and that there’s a community of passion for every topic you can think of.

If you want to learn more about designing history or cultural heritage crowdsourcing projects, trying out lots of project is a great place to start. The more time you can spend on this the better – an hour is ideal – but trying just one or two projects is better than nothing. In a workshop I get people to note how a project made them feel – what they liked most and least about a project, and who they’d recommend it to. You can also note the input and output types to help build your mental database of relevant crowdsourcing projects.

The list of projects I suggest varies according to the background of workshop participants, and I’ll often throw in suggestions tailored to specific interests, but here’s a generic list to get you started.

10 Most Wanted http://10most.org.uk/ Research object histories
Ancient Lives http://ancientlives.org/ Humanities, language, text transcription
British Library Georeferencer http://www.bl.uk/maps/ Locating and georeferencing maps (warning: if it’s running, only hard maps may be left!)
Children of the Lodz Ghetto http://online.ushmm.org/lodzchildren/ Citizen history, research
Describe Me http://describeme.museumvictoria.com.au/ Describe objects
DIY History http://diyhistory.lib.uiowa.edu/ Transcribe historical letters, recipes, diaries
Family History Transcription Project http://www.flickr.com/photos/statelibrarync/collections/ Document transcription (Flickr/Yahoo login required to comment)
Herbaria@home http://herbariaunited.org/atHome/ (for bonus points, compare it with Notes from Nature https://www.zooniverse.org/project/notes_from_nature) Transcribing specimen sheets (or biographical research)
HistoryPin Year of the Bay ‘Mysteries’ https://www.historypin.org/attach/project/22-yearofthebay/mysteries/index/ Help find dates, locations, titles for historic photographs; overlay images on StreetView
iSpot http://www.ispotnature.org/ Help ‘identify wildlife and share nature’
Letters of 1916 http://dh.tcd.ie/letters1916/ Transcribe letters and/or contribute letters
London Street Views 1840 http://crowd.museumoflondon.org.uk/lsv1840/ Help transcribe London business directories
Micropasts http://crowdsourced.micropasts.org/app/photomasking/newtask Photo-masking to help produce 3D objects; also structured transcription
Museum Metadata Games: Dora http://museumgam.es/dora/ Tagging game with cultural heritage objects (my prototype from 2010)
NYPL Building Inspector http://buildinginspector.nypl.org/ A range of tasks, including checking building footprints, entering addresses
Operation War Diary http://operationwardiary.org/ Structured transcription of WWI unit diaries
Papers of the War Department http://wardepartmentpapers.org/ Document transcription
Planet Hunters http://planethunters.org/ Citizen science; review visualised data
Powerhouse Museum Collection Search http://www.powerhousemuseum.com/collection/database/menu.php Tagging objects
Reading Experience Database http://www.open.ac.uk/Arts/RED/ Text selection, transcription, description.
Smithsonian Digital Volunteers: Transcription Center https://transcription.si.edu/ Text transcription
Tiltfactor Metadata Games http://www.metadatagames.org/ Games with cultural heritage images
Transcribe Bentham http://www.transcribe-bentham.da.ulcc.ac.uk/ History; text transcription
Trove http://trove.nla.gov.au/newspaper?q= Correct OCR errors, transcribe text, tag or describe documents
US National Archives http://www.amara.org/en/teams/national-archives/ Transcribing videos
What’s the Score at the Bodleian http://www.whats-the-score.org/ Music and text transcription, description
What’s on the menu http://menus.nypl.org/ Structured transcription of restaurant menus
What’s on the menu? Geotagger http://menusgeo.herokuapp.com/ Geolocating historic restaurant menus
Wikisource – random item link http://en.wikisource.org/wiki/Special:Random/Index Transcribing texts
Worm Watch http://www.wormwatchlab.org Citizen science; video
Your Paintings Tagger http://tagger.thepcf.org.uk/ Paintings; free-text or structured tagging

NB: crowdsourcing is a dynamic field, some sites may be temporarily out of content or have otherwise settled in transit. Some sites require registration, so you may need to find another site to explore while you’re waiting for your registration email.

In which I am awed by the generosity of others, and have some worthy goals

Grand Canal Dock at night, DublinA quick update from my CENDARI fellowship working on a project that’s becoming ‘In their own words: linking lived experiences of the First World War‘. I’ve spent the week reading (again a mixture of original diaries and letters, technical stuff like ontology documentation and also WWI history forums and ‘amateur’ sites) and writing. I put together a document outlining a rang of possible goals and some very sketchy tech specs, and opened it up for feedback. The goals I set out are copied below for those who don’t want to delve into detail. The commentable document, ‘Linking lived experiences of the First World War': possible goals and a bunch of technical questions goes into more detail.

However, the main point of this post is to publicly thank those who’ve helped by commenting and sharing on the doc, on twitter or via email. Hopefully I’m not forgetting anyone, as I’ve been blown away by and am incredibly grateful for the generosity of those who’ve taken the time to at least skim 1600 words (!). It’s all helped me clarify my ideas and find solutions I’m able to start implementing next week. In no order at all – at CENDARI, Jennifer Edmond, Alex O’Connor, David Stuart, Benjamin Štular, Francesca Morselli, Deirdre Byrne; online Andrew Gray @generalising; Alex Stinson @ DHKState; jason webber @jasonmarkwebber; Alastair Dunning @alastairdunning; Ben Brumfield @benwbrum; Christine Pittsley; Owen Stephens @ostephens; David Haskiya @DavidHaskiya; Jeremy Ottevanger @jottevanger; Monika Lechner @lemondesign; Gavin Robinson ‏@merozcursed; Tom Pert @trompet2 – thank you all!

Worthy goals (i.e. things I’m hoping to accomplish, with the help of historians and the public; only some of which I’ll manage in the time)

At the end of this project, someone who wants to research a soldier in WWI but doesn’t know a thing about how armies were structured should be able to find a personal narrative from a soldier in the same bit of the army, to help them understand experiences of the Great War.

Hopefully these personal accounts will provide some context, in their own words, for the lived experiences of WWI. Some goals listed are behind-the-scenes stuff that should just invisibly make personal diaries, letters and memoirs more easily discoverable. It needs datasets that provide structures that support relationships between people and documents; participatory interfaces for creating or enhancing information about contemporary materials (which feed into those supporting structures), and interfaces that use the data created.

More specifically, my goals include:

    • A personal account by someone in each unit linked to that unit’s record, so that anyone researching a WWI name would have at least one account to read. To populate this dataset, personal accounts (diaries, letters, etc) would need to be linked to specific soldiers, who can then be linked to specific units. Linking published accounts such as official unit histories would be a bonus. [Semantic MediaWiki]
    • Researched links between individual men and the units they served in, to allow their personal accounts to be linked to the relevant military unit. I’m hoping I can find historians willing to help with the process of finding and confirming the military unit the writer was in. [Semantic MediaWiki]
    • A platform for crowdsourcing the transcription and annotation of digitised documents. The catch is that the documents for transcription would be held remotely on a range of large and small sites, from Europeana’s collection to library sites that contain just one or two digitised diaries. Documents could be tagged/annotated with the names of people, places, events, or concepts represented in them. [Semantic MediaWiki??]
    • A structured dataset populated with the military hierarchy (probably based on The British order of battle of 1914-1918) that records the start and end dates of each parent-child relationship (an example of how much units moved within the hierarchy)
    • A published webpage for each unit, to hold those links to official and personal documents about that unit in WWI. In future this page could include maps, timelines and other visualisations tailored to the attributes of a unit, possibly including theatres of war, events, campaigns, battles, number of privates and officers, etc. (Possibly related to CENDARI Work Package 9?) [Semantic MediaWiki]
    • A better understanding of what people want to know at different stages of researching WWI histories. This might include formal data gathering, possibly a combination of interviews, forum discussions or survey

 

Goals that are more likely to drop off, or become quick experiments to see how far you can get with accessible tools:
    • Trained ‘named entity recognition’ and ‘natural language processing’ tools that could be run over transcribed text to suggest possible people, places, events, concepts, etc [this might drop off the list as the CENDARI project is working on a tool called Pineapple (PDF poster). That said, I’ll probably still experiment with the Stanford NER tool to see what the results are like]
    • A way of presenting possible matches from the text tools above for verification or correction by researchers. Ideally, this would be tied in with the ability to annotate documents
    • The ability to search across different repositories for a particular soldier, to help with the above.