RDFa – Open Objects

Linking museums: machine-readable data in cultural heritage – meetup in London July 7

Somehow I've ended up organising an (very informal) event about 'Linking museums: machine-readable data in cultural heritage' on Wednesday, July 7, at a pub near Liverpool St Station. I have no real idea what to expect, but I'd love some feisty sceptics to show up and challenge people to make all these geeky acronyms work in the real museum world.

As I posted to the MCG list: "A very informal meetup to discuss 'Linking museums: machine-readable data in cultural heritage' is happening next Wednesday. I'm hoping for a good mix of people with different levels of experience and different perspectives on the issue of publishing data that can be re-used outside the institution that created it. … please do pass this on to others who may be interested. If you would like to come but can't get down to that London, please feel free to send me your questions and comments (or beer money)."

The basic details are: July 7, 2010, Shooting Star pub, London. 7:30 – 10pm-ish. More information is available at http://museum-api.pbworks.com/July-2010-meetup and you can let me know you're coming or register your interest.

In more detail…

Why?
I'm trying to cut through the chicken and egg problem – as a museum technologist, I can work towards getting machine-readable data available, but I'm not sure which formats and what data would be most useful for developers who might use it. Without a critical mass of take-up for any one type, the benefits of any one data source are more limited for developers. But museums seem to want a sense of where the critical mass is going to be so they can build for that. How do we cut through this and come up with a sensible roadmap?

Who?
You! If you're interested in using museum data in mashups but find it difficult to get started or find the data available isn't easily usable; if you have data you want to publish; if you work in a museum and have a
data publication problem you'd like help in solving; if you are a cheerleader for your favourite acronym…

Put another way, this event is for you if you're interested in publishing and sharing data about their museums and collections through technologies such as linked data and microformats.

It'll be pretty informal! I'm not sure how much we can get done but it'd be nice to put faces to names, and maybe start some discussions around the various problems that could be solved and tools that could be
created with machine-readable data in cultural heritage.

Google updates search, playing catch-up?

A quick post in case you've missed it elsewhere – whether in response to the ridiculously-titled 'Wolfram Alpha' or to Yahoo's 'open strategy' (YOS) and work on enhancing search engine results pages (SERPs) with structured data, Google have announced a "new set of features that we call Search Options, which are a collection of tools that let you slice and dice your results and generate different views to find what you need faster and easier" and "rich snippets" that "show more useful information from web pages than the preview text". Searchengineland have compared Google and Yahoo's offerings.

Update: some of the criticism rumbling on twitter yesterday has been neatly summarised by Ian Davis in 'Google's RDFa a Damp Squib':

However, a closer look reveals that Google have basically missed the point of RDFa. The RDFa support is limited to the properties and classes defined on a hastily thrown together site called data-vocabulary.org. There you will find classes for Person and Organization and properties for names and addresses, completely ignoring the millions of pieces of data using well established terms from FOAF and the like. That means everyone has to rewrite all their data to use Google's schema if they want to be featured on Google's search engine. Its like saying you have to write your pages using Google's own version of html where all the tags have slightly different spellings to be listed in their search engine!

The result is a hobbled implementation of RDFa. They've taken the worst part – the syntax – and thrown away the best – the decentralized vocabularies of terms. It's like using microformats without the one thing they do well: the simplicity.

Further, in the comments:

the point of decentralization is not to encourage fragmentation and isolation, but to allow people to collaborate without needing permission from a middleman. Google's approach imposes a centralized authority.

There's also a (slightly disingenuous, IMO) response from Google:

For Rich Snippets, Google search need to understand what the data means in order to render it appropriately. We will start incorporating existing vocabularies like FOAF, but there's no way for us to have a decent user experience for brand-new vocabularies that someone defines. We also need a single place where a webmaster can come and find all the terms that Google understands. Which is why we have data-vocabulary.org.

Isn't the point of Google that it can figure stuff out without needing to be told?

Rasmus Lerdorf on Hacking with PHP – tech talk at Open Hack London

Same deal as my first post from today's Open Hack London event – these are (very) rough notes, please let me know of clarifications, questions or comments.

Hacking with PHP, Rasmus Lerdorf

Goal of talk: copy and pastable snippets that just work so you don't have to fight to get things that work [there's not enough of this to help beginners get over that initial hump]. The slides are available at http://talks.php.net/show/openhack and these notes are probably best read as commentary alongside the code examples.

[Since it's a hack day, some] Hack ideas: fix something you use every day; build your own targeted search engine; improve the look of search results; play with semantic web tools to make the web more semantic; tell the world what kind of data you have – if a resume, use hResume or other appropriate microformats/markup; go local – tools for helping your local community; hack for good – make the world a better place.

SearchMonkey and BOSS are blending together a little bit.

What we need to learn
With PHP – enough to handle simple requests; talk to backend datastore; how to parse XML with PHP, how to generate JSON, some basic javasccript, a JavaScript utility library like YUI or jquery.

parsing XML: simpleXML_load_file() – can load entire URL or local file.

Attributes on node show up as array. Namespace attributes call children of node, name namespace as argument.

Now know how to parse XML, can get lots of other stuff.
Context extraction service, Yahoo – doesn't get enough attention. Post all text, gives you back four or five key terms – can then do an image search off them. Or match ads to webpages.

Can use get or post (curl) – usually too much for get.

PHP to JavaScript on initial page load: JSON_encode -> javascript.

Javascript to PHP (and back)
If you can figure out these six lines of code, you can write anything in the world. How every modern web application works.
Server-side php, client-side javascript.

'There's nothing to building web applications, you just have to break everything down into small enough chunks that it all becomes trivial'.

AJAX in 30 seconds.
Inline comments in code would help for people reading it without hearing the talk at the same time.

JavaScript libraries to the rescue
load maps API, create container (div) for the map, then fill it.

Form – on submit call return updateMap(); with new location.

YGeoRSS – if have GeoRSS file… can point to it.

GeoPlanet – assigns a WOE ID to a place. Locations are more than just a lat long – carry way more information. Basically gives you a foreign key. YQL is starting to make the web a giant database. Can make joins across APIs – woeid works as fk.

YQL – 'combines all the APIs on the web into a single API'.

Add a cache – nice to YQL, and also good for demos etc. Copy and paste cache function from his slides – does a local cache on URL. Hashed with md5. Using PHP streams – #defn. Adding a cache speeds up developing when hacking (esp as won't be waiting for the wifi). [This is a pretty damn good tip cos it's really useful and not immediately obvious.]

XPath on URL using PHP's OAuth extension

SearchMonkey – social engineering people into caring about semantic data on the web. For non-geeks, search plug-in mechanism that will spruce up search results page. Encourages people to add semantic data so their search result is as sexy as their competitors – so goal is that people will start adding semantic data.

'If you're doing web stuff, and don't know about microformats, and your resume doesn't have hResume, you're not getting a job with Yahoo.'

Question: how are microformats different to RDFa?
Answer: there are different types of microformats – some very specific ones, eg hResume, hCal. RDFa – adding arbitrary tags to page. even if no specific way to describe your data. But there's a standard set of mark-ups for a resume so can use that. if your data doesn't match anything at microfomats.org then use RDFa or erdf (?).

RDFa, SearchMonkey – tech talks at Open Hack London

While today's Open Hack London event is mostly about the 24-hour hackathon, I signed up just for the Tech Talks because I couldn't afford to miss a whole weekend's study in the fortnight before my exams (stupid exams). I went to the sessions on 'Guardian Data Store and APIs', 'RDFa SearchMonkey', Arduino, 'Hacking with PHP', 'BBC Backstage', Dopplr's 'mashups made of messages' and lightning talks including 'SPARQL and semantic web' stuff you can do now.

I'm putting my rough and ready notes online so that those who couldn't make it can still get some of the benefits. Apologies for any mishearings or mistakes in transcription – leave me a comment with any questions or clarifications.

One of the reasons I was going was to push my thinking about the best ways to provide API-like access to museum information and collections, so my notes will reflect that but I try to generalise where I can. And if you have thoughts on what you'd like cultural heritage institutions to do for developers, let us know! (For background, here's a lightning talk I did at another hack event on happy museums + happy developers = happy punters).

RDFa – now everyone can have an API.
Mark Birkbeck

Going to cover some basic mark-up, and talk about why RDFa is a good thing. [The slides would be useful for the syntax examples, I'll update if they go online.]

RDFa is a new syntax from W3C – a way of embedding metadata (RDF) in HTML documents using attributes.

e.g. <span property="dc:title"> – value of property is the text inside the span.

Because it's inline you don't need to point to another document to provide source of metadata and presentation HTML.

One big advance is that can provide metadata for other items e.g. images, so you can e.g. attach licence info to the image rather than page it's in – e.g. <img src="" rel="licence" resource="[creative commons licence]">

Putting RDFa into web pages means you've now got a feed (the web page is the RSS feed), and a simple static web page can become an API that can be consumed in the same way as stuff from a big expensive system. 'Growing adoption'.

Government department Central Office of Information [?] is quite big on RDFa, have a number of projects with it. [I'd come across the UK Civil Service Job Service API while looking for examples for work presentations on APIs.]

RDFa allows for flexible publishing options. If you're already publishing HTML, you can add RDFa mark-up then get flexible publishing models – different departments can keep publishing data in their own way, a central website can go and request from each of them and create its own database of e.g. jobs. Decentralised way of approaching data distribution.

Can be consumed by: smarter browsers; client-side AJAX, other servers such as SearchMonkey.

He's interested where browsers can do something with it – either enhanced browsers that could e.g. store contact info in a page into your address book; or develop JavaScript libraries that can parse page and do something with it. [screen shot of jobs data in search monkey with enhanced search results]

RDFa might be going into Drupal core.

Example of putting isbn in RDFa in page, then a parser can go through the page, pull out the triples [some explanation of them as mini db?], pull back more info about the book from other APIs e.g. Amazon – full title, thumbnail of cover. e.g. pipes.

Example of FOAF – twitter account marked up in page, can pull in tweets. Could presumably pull in newer services as more things were added, without having to re-mark-up all the pages.

Example of chemist writing a blog who mentions a chemical compound in blog post, a processor can go off and retrieve more info – e.g. add icon for mouseover info – image of molecule, or link to more info.

Next plan is to link with BOSS. Can get back RDFa from search results – augment search results with RDFa from the original page.

Search Monkey (what it is and what you can do with it)
Neil Crosby (European frontend architect for search at Yahoo).

SearchMonkey is (one of) Yahoo's open search platforms (along with BOSS). Uses structured data to enhance search results. You get to change stuff on Yahoo search results page.

SearchMonkey lets you: style results for certain URL patterns; brand those results; make the results more useful for users.

[examples of sites that have done it to see how their results look in Yahoo? I thought he mentioned IMDb but it doesn't look any different – a film search that returns a wikipedia result, OTOH, does.]

Make life better for users – not just what Yahoo thinks results should be, you can say 'actually this is the important info on the page'

Three ways to do it [to change the SERP [search engine results page]: mark up data in a way that Yahoo knows about – 'just structure your data nicely'. e.g. video mark-up; enhance a result directly; make an infobar.

Infobar – doesn't change result see immediately on the page, but it opens on the page. e.g. of auto-enhanced result- playcrafter. Link to developer start page – how to mark it up, with examples, and what it all means.

User-enhanced result – Facebook profile pages are marked up with microformats – can add as friend, poke, send message, view friends, etc from the search results page. Can change the title and abstract, add image, favicon, quicklinks, key/value pairs. Create at [link I can't see but is on slides] Displayed in screen, you fill it out on a template.

Infobar – dropdown in grey bar under results. Can do a lot more, as it's hidden in the infobar and doesn't have to worry people.

Data from: microformats, RDF, XSLT, Yahoo's index, and soon, top tags from delicious.

If no machine data, can write an XSLT. 'isn't that hard'. Lots of documentation on the web.

Examples of things that have been made – a tool that exposes all the metadata known for a page. URL on slide. can install on Yahoo search page, add it in. Use location data to make a map – any page on web with metadata about locations on it – map monkey. Get qype results for anything you search for.

There's a mailing list (people willing and wanting to answer questions) and a tutorial.

Questions

Question: do you need to use a special doctype [for RDFa]?
Answer: added to spec that 'you should use this doctype' but the spec allows for RDFa to be used in situations when can't change doctype e.g. RDFa embedded in blogger blogpost. Most parsers walk the DOM rather than relying on the doctype.

Jim O'D – excited that SearchMonkey supports XSLT – if have website with correctly marked up tables, could expose those as key/value pairs?
Answer: yes. XSLT fantastic tool for when don't have data marked up – can still get to it.

Frankie – question I couldn't hear. About info out to users?
Answer: if you've built a monkey, up to you to tell people about it for the moment. Some monkeys are auto-on e.g. Facebook, wikipedia… possibly in future, if developed a monkey for a site you own, might be able to turn it auto-on in the results for all users… not sure yet if they'll do it or not.
Frankie: plan that people get monkeys they want, or go through gallery?
Answer: would be fantastic if could work out what people are using them for and suggest ones appropriate to people doing particular kinds of searches, rather than having to go to a gallery.

The BBC, accessibility, the hCalendar microformat and RDFa

The BBC have announced (in 'Removing Microformats from bbc.co.uk/programmes') that they'll stop using the hCalendar microformat because of concerns about accessibility, specifically the use of the HTML abbreviation element (the abbr tag):

Our concerns were:

the effect on blind users using screen readers with abbreviation expansion turned on where abbreviations designed for machines would be read out

the effect on partially sighted users using screen readers where tool tips of abbreviations designed for machines would be read out

the effect of incomprehensible tooltips on users with cognitive disabilities

the potential fencing off of abbreviations to domains that need them

Until these issues are resolved the BBC semantic markup standards have been updated to prevent the use of non-human-readable text in abbreviations.

They're looking at using RDFa, which they describe as 'a slightly bigger S semantic web technology similar to microformats but without some of the more unexpected side-effects'.

Their support for RDFa is timely in light of Lee Iverson's presentation at the UK Museums on the Web conference (my notes). It's also an interesting study of what can happen when geek enthusiasm meets existing real world users.

More generally, does the fact that an organisation as big as the BBC hasn't yet produced an API mean that creating an API is not a simple task, or that the organisational issues are bigger than the technical issues?

Notes from 'UK Museums on the Web Conference 2008'

I'm back in London after
UK Museums on the Web Conference 2008 and the mashed museum day.

In the interests of getting my notes up quickly I'm putting them up pretty much 'as is', so they're still rough around the edges. I'll add links to the speaker slides when they are all online. Some photos from the two days are online – a general search for ukmw08 on Flickr will find some. I have some in a set online now, others are still to come, including some photos of slides so I'll update this as I check the text from the slides. These are my notes from the first session.

The keynote speech was given by Tom Loosemore of Ofcom on the Future of Public Service Content.

[For context, Ofcom is the 'independent regulator and competition authority for the UK communications industries' and their recently second review of public service broadcasting, 'The Digital Opportunity', caused a stir in the digital cultural heritage world for its assessment of the extent to which public sector websites delivered on 'public service purposes and characteristics'. You can read the summary or download the full report.]

'How many of you are on the main board of your institution?'

Leadership doesn't have the vision in place to take advantage of the internet.

Sees the internet as platform for public service, [most importantly] enlightenment. He's here today to enlist our help.

We view the internet through lens of expectations from the past, definitely in public service broadcasting – 'let's get our programs on the internet'.

What is value for money?

Would that other sectors did the same soul searching

[On the Ofcom review:] 'You can't really review the web, it's bonkers'

Public service characteristics to create a report card. Of the public service characteristics in the online market (high quality, original, innovative, challenging, engaging, discoverable and accessible), 'challenging' is the hardest.

Museums and cultural sector have amazing potential. What are the barriers between the people here who get it and being able to take that opportunity and redefine public service broadcasting?

It's not skills. Maybe ten years ago, not today. And it's not technology. The crucial missing link is leadership and vision, the lack of recognition by people who govern direction of institutions of the huge potential.

[Which does translate into 'more resources', eventually, but perhaps the missing gap right now is curatorial/interpretative resources? Every online project we do generates more enquiries, stretching these people further, and they don't have time to proactively create content for ad hoc projects as it is, especially as their time tends to be allocated a long time in advance.]

What's behind that reluctance, what can you do to help people on your board understand the opportunities? We can ask 'what business are we in? what's the purpose of our institution?'.

Tate recognise they're not just in the business of getting people to go to the Tate venues, they're in the business of informing people about art. Compare that to the Royal Shakespeare Company which is using its online site purely to get bums on seats.

Next opportunity… how do you take opportunity to digitise your collections and reach a whole new audience? How can you make better use of cultural objects that were previously constrained by physicalty.

What opportunities are native to the internet, can only happen there? How can it help your institution to deliver its purpose?

Recognise that you are in the (public service) media business.

How do you measure enlightenment? You could be changing the way people see the world, etc. but you need to measure it to make a case, to know whether you're succeeding. Metrics really really matter in public service arena.

BBC used to look at page views, but developers gamed the system. Then the metric was 'time online', but it stopped people thinking externally. Metric as proxy for quality.

Value = reach x quality. What kind of experience did they have?

Quality is the really hard part. As defined by BBC: quality is in the eye of the beholder. Did the user have an excellent experience?

BBC measure 'net promoter' – how likely are you to recommend this to a friend or colleague, on a scale of 1 – 10?

[But for our sector, what if you don't have any friends with the same interest in x? Would people extrapolate from their specific page on a Roman buckle to recommend the site generally?]

Throw away the 'soggy British middle' – the 7, 8s (out of ten).

Group them as Promoters (9-10/10), Passive (7-8/10), Detractors (0 – 6/10). The key measure is the difference between how many Promoters and how many Detractors. This was 'fabulously useful' at the BBC. 30% is good benchmark.

They mapped whole BBC portfolio against 'net promoters' % and reach, bubbles show cost.

It's not necessarily about reaching mass audiences. But when producing for niche audiences – they must love it, and it shouldn't cost that much.

He's telling us this because it's the language of funders, of KPIs, this is hard evidence with real people. You might use a different measure of quality but you can't talk about opportunities in abstract, must have numbers behind them.

Suggested the BBC's 15 Web Principles, including 'fall forward, fast'.

A measure of personal success for him would be that in x years when he asked 'who here is on the board of your institution, at least x should put hands up'.

[I really liked this keynote speech as a kick up the arse in case we started to get too complacent about having figured out what matters to us, as museum geeks. It doesn't count unless we can get through our organisations and get that content out to audiences in ways they can use (and re-use).]

In linking the sessions, Ross Parry mused about the legacy of 18th, 19th century ideas of how to build a museum, how would they be different if museums were created today?

Lee Iverson, How does the web connect content? "Semantic Pragmatics"
'Profoundly disagreed' with some of the things Tom was talking about, wants to have a dialogue.
He asked how many know the background to semantic web stuff? Quite a few hands were raised.

Talking about how the web works now and where it's going. Museums have significant opportunity to push things forward, but must understand possibilities and limitations.

Changing classic relationship – museum websites as face of institution to users. Huge opportunity for federating and aggregating content (between museums) – an order of magnitude better.

He's working with 13 museums, with north west native American artefacts. Communities are co-developers, virtually repatriating their (land).

Possibility to connect outside the museum. Powerhouse Museum as an excellent example of why (and how) you should connect.

Becoming connected:
Expose own data from behind presentation layers
Find other data
Integrate – creating a cohesive (situation)
Engage with users

Access to data is core business, curatorial stuff.

RDFa
Pragmatics of standards – get a sense of what it is you're doing [and start, don't try and create the system of everything first], it'll never work. Use existing standards if possible, grab chunks if you can. Never standardise what you minimally need to do to get the utility you need at the moment. Then extend, layers, version 2. A standard is an agreement between a minimum of two people [and doesn't have to be more complicated than that].

"Just do it" – make agreements, get it to work, then engage in the standardisation process.

Relationship between this and semantic web? Semantic web as 'data web'. Competing definitions.

Slide on Tim Berners-Lee on the semantic web in 1999.

Why hasn't it appeared? It's vapourware, you can't make effective standards for it.

Syntax – capability of being interpreted. Semantic – ability to interpret, and to connect interpretations.

Finding data – how much easier would it be if we could just grab the data we want directly from where we want it?

Key is relating what you're doing to what they're doing.

XML vs RDF
Semantic web built on RDF, it's designed for representing metadata. It's substantially different to XML. Lots of reaction against RDF has been reaction against XML encoding, syntactic resistance.

RDF is designed to be manipulated as data, XML is about annotating text. In XML, syntax is the thing, with RDF the data is the thing.

Grab entire XML doc before you can figure out how to smoosh then together. RDF works by reference, you can just build on it.

RDFa. A way of embedding RDF content directly in XHTML, relies on same strategies as microformats. Will be ignored by presentation oriented systems but readable by RDF parsers.

[RDF triples vs machine tags? RDF vs microformats? How RDF-like is OAI PMH?]

You can talk about things you don't have a representation for e.g. people.

Ignore the term 'ontology' – it's just a way of talking about a vocabulary.

Four steps for widespread adoption:
Promote practical applications
Develop applications now
[and the slide was gone and I missed the last two steps!]

There was also some stuff on limitations of lightweight approaches, and hermetically sealed museum data, user experiences. Also a bit on 'give away structured data' but with a good awareness of the need to keep some data private – object location and value, for example.

Ross – we've had the media context and technical context, now for the sector context.

Paul Marty, Engaging Audiences by connecting to collections online.
Vital connections…

What does it mean to say x% of your collection is online? For whom is it useful?

How to engage audiences around your collections? Not just presenting information.

Goes beyond providing access to data. Research shows audiences want engagement. Surveyed 1200 museum visitors about their requirements. [I would love to see the research] Virtuous circle between museum visits and website visits.

Build on interest, give experience that grabs people.

Romans in Sussex website – multiple museums offering collections for multiple audiences. Re-presenting same content in different ways on the fly.

Audiences
Don't just give general public a list of stuff. Give them a way to engage.

"Engaging a community around a collection is harder than providing access to data about a collection"

Photo of the week – says "What do you know about this photo? Please share your thoughts with us" But no link or instructions on how to do it. But at least they're trying…

Discussion – Tom, Lee and Paul.

"Why do you digitise collections before had need in mind?" [Because the driver is internal, not external, needs, would be the generous answer; because they could get funding to do it would be my ungenerous answer].

Tom on RDF – how seriously engaged with it to build audiences, tell stories.

BBC licence terms – couldn't re-use data for commercial purposes/at all.

Leadership need to understand opportunities because otherwise they won't support geek stuff.

Qu: terms of engagement – how is it defined?

Paul – US has made same mistakes re digitisation of collections and websites that don't have reusable data.

Participants must be involved in process from the beginning, need input at start from intended users on how it can engage them.

Fiona: why not use existing resources, go to existing sites with established audiences?

Lee: how did YouTube succeed – people were brought by embedded content. [This issue of using 'wrappers' around your content to help it go viral by being embeddable elsewhere was raised in another session too.]

Tom: letting go is how you win, but it's a profound challenge to institutions and their desire to maintain authority.