RDF – Open Objects

Tom Morris, SPARQL and semweb stuff – tech talk at Open Hack London

Tom Morris gave a lightning talk on 'How to use Semantic Web data in your hack' (aka SPARQL and semantic web stuff).

He's since posted his links and queries – excellent links to endpoints you can test queries in.

Semantic web often thought of as long-promised magical elixir, he's here to say it can be used now by showing examples of queries that can be run against semantic web services. He'll demonstrate two different online datasets and one database that can be installed on your own machine.

First – dbpedia – scraped lots of wikipedia, put it into a database. dbpedia isn't like your averge database, you can't draw a UML diagram of wikipedia. It's done in RDF and Linked Data. Can be queried in a language that looks like SQL but isn't. SPARQL – is a w3c standard, they're currently working on SPARQL 2.

Go to dbpedia.org/sparql – submit query as post. [Really nice – I have a thing about APIs and platforms needing a really easy way to get you to 'hello world' and this does it pretty well.]

[Line by line comments on the syntax of the queries might be useful, though they're pretty readable as it is.]

'select thingy, wotsit where [the slightly more complicated stuff]'

Can get back results in xml, also HTML, 'spreadsheet', JSON. Ugly but readable. Typed.

[Trying a query challenge set by others could be fun way to get started learning it.]

One problem – fictional places are in Wikipedia e.g. Liberty City in Grand Theft Auto.

Libris – how library websites should be
[I never used to appreciate how much most library websites suck until I started back at uni and had to use one for more than one query every few years]

Has a query interface through SPARQL

Comment from the audience BBC – now have SPARQL endpoint [as of the day before? Go BBC guy!].

Playing with mulgara, open source java triple store. [mulgara looks like a kinda faceted search/browse thing] Has own query language called TQL which can do more intresting things than SPARQL. Why use it? Schemaless data storage. Is to SQL what dynamic typing is to static typing. [did he mean 'is to sparql'?]

Question from audence: how do you discover what you can query against?
Answer: dbpedia website should list the concepts they have in there. Also some documentation of categories you can look at. [Examples and documentation are so damn important for the update of your API/web service.]

Coming soon [?] SPARUL – update language, SPARQL2: new features

The end!

[These are more (very) rough notes from the weekend's Open Hack London event – please let me know of clarifications, questions, links or comments. My other notes from the event are tagged openhacklondon.

Quick plug: if you're a developer interested in using cultural heritage (museums, libraries, archives, galleries, archaeology, history, science, whatever) data – a bunch of cultural heritage geeks would like to know what's useful for you (more background here). You can comment on the #chAPI wiki, or tweet @miaridge (or @mia_out). Or if you work for a company that works with cultural heritage organisations, you can help us work better with you for better results for our users.]

There were other lightning talks on Pachube (pronounced 'patchbay', about trying to build the internet of things, making an API for gadgets because e.g. connecting hardware to the web is hard for small makers) and Homera (an open source 3d game engine).

RDFa, SearchMonkey – tech talks at Open Hack London

While today's Open Hack London event is mostly about the 24-hour hackathon, I signed up just for the Tech Talks because I couldn't afford to miss a whole weekend's study in the fortnight before my exams (stupid exams). I went to the sessions on 'Guardian Data Store and APIs', 'RDFa SearchMonkey', Arduino, 'Hacking with PHP', 'BBC Backstage', Dopplr's 'mashups made of messages' and lightning talks including 'SPARQL and semantic web' stuff you can do now.

I'm putting my rough and ready notes online so that those who couldn't make it can still get some of the benefits. Apologies for any mishearings or mistakes in transcription – leave me a comment with any questions or clarifications.

One of the reasons I was going was to push my thinking about the best ways to provide API-like access to museum information and collections, so my notes will reflect that but I try to generalise where I can. And if you have thoughts on what you'd like cultural heritage institutions to do for developers, let us know! (For background, here's a lightning talk I did at another hack event on happy museums + happy developers = happy punters).

RDFa – now everyone can have an API.
Mark Birkbeck

Going to cover some basic mark-up, and talk about why RDFa is a good thing. [The slides would be useful for the syntax examples, I'll update if they go online.]

RDFa is a new syntax from W3C – a way of embedding metadata (RDF) in HTML documents using attributes.

e.g. <span property="dc:title"> – value of property is the text inside the span.

Because it's inline you don't need to point to another document to provide source of metadata and presentation HTML.

One big advance is that can provide metadata for other items e.g. images, so you can e.g. attach licence info to the image rather than page it's in – e.g. <img src="" rel="licence" resource="[creative commons licence]">

Putting RDFa into web pages means you've now got a feed (the web page is the RSS feed), and a simple static web page can become an API that can be consumed in the same way as stuff from a big expensive system. 'Growing adoption'.

Government department Central Office of Information [?] is quite big on RDFa, have a number of projects with it. [I'd come across the UK Civil Service Job Service API while looking for examples for work presentations on APIs.]

RDFa allows for flexible publishing options. If you're already publishing HTML, you can add RDFa mark-up then get flexible publishing models – different departments can keep publishing data in their own way, a central website can go and request from each of them and create its own database of e.g. jobs. Decentralised way of approaching data distribution.

Can be consumed by: smarter browsers; client-side AJAX, other servers such as SearchMonkey.

He's interested where browsers can do something with it – either enhanced browsers that could e.g. store contact info in a page into your address book; or develop JavaScript libraries that can parse page and do something with it. [screen shot of jobs data in search monkey with enhanced search results]

RDFa might be going into Drupal core.

Example of putting isbn in RDFa in page, then a parser can go through the page, pull out the triples [some explanation of them as mini db?], pull back more info about the book from other APIs e.g. Amazon – full title, thumbnail of cover. e.g. pipes.

Example of FOAF – twitter account marked up in page, can pull in tweets. Could presumably pull in newer services as more things were added, without having to re-mark-up all the pages.

Example of chemist writing a blog who mentions a chemical compound in blog post, a processor can go off and retrieve more info – e.g. add icon for mouseover info – image of molecule, or link to more info.

Next plan is to link with BOSS. Can get back RDFa from search results – augment search results with RDFa from the original page.

Search Monkey (what it is and what you can do with it)
Neil Crosby (European frontend architect for search at Yahoo).

SearchMonkey is (one of) Yahoo's open search platforms (along with BOSS). Uses structured data to enhance search results. You get to change stuff on Yahoo search results page.

SearchMonkey lets you: style results for certain URL patterns; brand those results; make the results more useful for users.

[examples of sites that have done it to see how their results look in Yahoo? I thought he mentioned IMDb but it doesn't look any different – a film search that returns a wikipedia result, OTOH, does.]

Make life better for users – not just what Yahoo thinks results should be, you can say 'actually this is the important info on the page'

Three ways to do it [to change the SERP [search engine results page]: mark up data in a way that Yahoo knows about – 'just structure your data nicely'. e.g. video mark-up; enhance a result directly; make an infobar.

Infobar – doesn't change result see immediately on the page, but it opens on the page. e.g. of auto-enhanced result- playcrafter. Link to developer start page – how to mark it up, with examples, and what it all means.

User-enhanced result – Facebook profile pages are marked up with microformats – can add as friend, poke, send message, view friends, etc from the search results page. Can change the title and abstract, add image, favicon, quicklinks, key/value pairs. Create at [link I can't see but is on slides] Displayed in screen, you fill it out on a template.

Infobar – dropdown in grey bar under results. Can do a lot more, as it's hidden in the infobar and doesn't have to worry people.

Data from: microformats, RDF, XSLT, Yahoo's index, and soon, top tags from delicious.

If no machine data, can write an XSLT. 'isn't that hard'. Lots of documentation on the web.

Examples of things that have been made – a tool that exposes all the metadata known for a page. URL on slide. can install on Yahoo search page, add it in. Use location data to make a map – any page on web with metadata about locations on it – map monkey. Get qype results for anything you search for.

There's a mailing list (people willing and wanting to answer questions) and a tutorial.

Questions

Question: do you need to use a special doctype [for RDFa]?
Answer: added to spec that 'you should use this doctype' but the spec allows for RDFa to be used in situations when can't change doctype e.g. RDFa embedded in blogger blogpost. Most parsers walk the DOM rather than relying on the doctype.

Jim O'D – excited that SearchMonkey supports XSLT – if have website with correctly marked up tables, could expose those as key/value pairs?
Answer: yes. XSLT fantastic tool for when don't have data marked up – can still get to it.

Frankie – question I couldn't hear. About info out to users?
Answer: if you've built a monkey, up to you to tell people about it for the moment. Some monkeys are auto-on e.g. Facebook, wikipedia… possibly in future, if developed a monkey for a site you own, might be able to turn it auto-on in the results for all users… not sure yet if they'll do it or not.
Frankie: plan that people get monkeys they want, or go through gallery?
Answer: would be fantastic if could work out what people are using them for and suggest ones appropriate to people doing particular kinds of searches, rather than having to go to a gallery.

Notes from 'UK Museums on the Web Conference 2008'

I'm back in London after
UK Museums on the Web Conference 2008 and the mashed museum day.

In the interests of getting my notes up quickly I'm putting them up pretty much 'as is', so they're still rough around the edges. I'll add links to the speaker slides when they are all online. Some photos from the two days are online – a general search for ukmw08 on Flickr will find some. I have some in a set online now, others are still to come, including some photos of slides so I'll update this as I check the text from the slides. These are my notes from the first session.

The keynote speech was given by Tom Loosemore of Ofcom on the Future of Public Service Content.

[For context, Ofcom is the 'independent regulator and competition authority for the UK communications industries' and their recently second review of public service broadcasting, 'The Digital Opportunity', caused a stir in the digital cultural heritage world for its assessment of the extent to which public sector websites delivered on 'public service purposes and characteristics'. You can read the summary or download the full report.]

'How many of you are on the main board of your institution?'

Leadership doesn't have the vision in place to take advantage of the internet.

Sees the internet as platform for public service, [most importantly] enlightenment. He's here today to enlist our help.

We view the internet through lens of expectations from the past, definitely in public service broadcasting – 'let's get our programs on the internet'.

What is value for money?

Would that other sectors did the same soul searching

[On the Ofcom review:] 'You can't really review the web, it's bonkers'

Public service characteristics to create a report card. Of the public service characteristics in the online market (high quality, original, innovative, challenging, engaging, discoverable and accessible), 'challenging' is the hardest.

Museums and cultural sector have amazing potential. What are the barriers between the people here who get it and being able to take that opportunity and redefine public service broadcasting?

It's not skills. Maybe ten years ago, not today. And it's not technology. The crucial missing link is leadership and vision, the lack of recognition by people who govern direction of institutions of the huge potential.

[Which does translate into 'more resources', eventually, but perhaps the missing gap right now is curatorial/interpretative resources? Every online project we do generates more enquiries, stretching these people further, and they don't have time to proactively create content for ad hoc projects as it is, especially as their time tends to be allocated a long time in advance.]

What's behind that reluctance, what can you do to help people on your board understand the opportunities? We can ask 'what business are we in? what's the purpose of our institution?'.

Tate recognise they're not just in the business of getting people to go to the Tate venues, they're in the business of informing people about art. Compare that to the Royal Shakespeare Company which is using its online site purely to get bums on seats.

Next opportunity… how do you take opportunity to digitise your collections and reach a whole new audience? How can you make better use of cultural objects that were previously constrained by physicalty.

What opportunities are native to the internet, can only happen there? How can it help your institution to deliver its purpose?

Recognise that you are in the (public service) media business.

How do you measure enlightenment? You could be changing the way people see the world, etc. but you need to measure it to make a case, to know whether you're succeeding. Metrics really really matter in public service arena.

BBC used to look at page views, but developers gamed the system. Then the metric was 'time online', but it stopped people thinking externally. Metric as proxy for quality.

Value = reach x quality. What kind of experience did they have?

Quality is the really hard part. As defined by BBC: quality is in the eye of the beholder. Did the user have an excellent experience?

BBC measure 'net promoter' – how likely are you to recommend this to a friend or colleague, on a scale of 1 – 10?

[But for our sector, what if you don't have any friends with the same interest in x? Would people extrapolate from their specific page on a Roman buckle to recommend the site generally?]

Throw away the 'soggy British middle' – the 7, 8s (out of ten).

Group them as Promoters (9-10/10), Passive (7-8/10), Detractors (0 – 6/10). The key measure is the difference between how many Promoters and how many Detractors. This was 'fabulously useful' at the BBC. 30% is good benchmark.

They mapped whole BBC portfolio against 'net promoters' % and reach, bubbles show cost.

It's not necessarily about reaching mass audiences. But when producing for niche audiences – they must love it, and it shouldn't cost that much.

He's telling us this because it's the language of funders, of KPIs, this is hard evidence with real people. You might use a different measure of quality but you can't talk about opportunities in abstract, must have numbers behind them.

Suggested the BBC's 15 Web Principles, including 'fall forward, fast'.

A measure of personal success for him would be that in x years when he asked 'who here is on the board of your institution, at least x should put hands up'.

[I really liked this keynote speech as a kick up the arse in case we started to get too complacent about having figured out what matters to us, as museum geeks. It doesn't count unless we can get through our organisations and get that content out to audiences in ways they can use (and re-use).]

In linking the sessions, Ross Parry mused about the legacy of 18th, 19th century ideas of how to build a museum, how would they be different if museums were created today?

Lee Iverson, How does the web connect content? "Semantic Pragmatics"
'Profoundly disagreed' with some of the things Tom was talking about, wants to have a dialogue.
He asked how many know the background to semantic web stuff? Quite a few hands were raised.

Talking about how the web works now and where it's going. Museums have significant opportunity to push things forward, but must understand possibilities and limitations.

Changing classic relationship – museum websites as face of institution to users. Huge opportunity for federating and aggregating content (between museums) – an order of magnitude better.

He's working with 13 museums, with north west native American artefacts. Communities are co-developers, virtually repatriating their (land).

Possibility to connect outside the museum. Powerhouse Museum as an excellent example of why (and how) you should connect.

Becoming connected:
Expose own data from behind presentation layers
Find other data
Integrate – creating a cohesive (situation)
Engage with users

Access to data is core business, curatorial stuff.

RDFa
Pragmatics of standards – get a sense of what it is you're doing [and start, don't try and create the system of everything first], it'll never work. Use existing standards if possible, grab chunks if you can. Never standardise what you minimally need to do to get the utility you need at the moment. Then extend, layers, version 2. A standard is an agreement between a minimum of two people [and doesn't have to be more complicated than that].

"Just do it" – make agreements, get it to work, then engage in the standardisation process.

Relationship between this and semantic web? Semantic web as 'data web'. Competing definitions.

Slide on Tim Berners-Lee on the semantic web in 1999.

Why hasn't it appeared? It's vapourware, you can't make effective standards for it.

Syntax – capability of being interpreted. Semantic – ability to interpret, and to connect interpretations.

Finding data – how much easier would it be if we could just grab the data we want directly from where we want it?

Key is relating what you're doing to what they're doing.

XML vs RDF
Semantic web built on RDF, it's designed for representing metadata. It's substantially different to XML. Lots of reaction against RDF has been reaction against XML encoding, syntactic resistance.

RDF is designed to be manipulated as data, XML is about annotating text. In XML, syntax is the thing, with RDF the data is the thing.

Grab entire XML doc before you can figure out how to smoosh then together. RDF works by reference, you can just build on it.

RDFa. A way of embedding RDF content directly in XHTML, relies on same strategies as microformats. Will be ignored by presentation oriented systems but readable by RDF parsers.

[RDF triples vs machine tags? RDF vs microformats? How RDF-like is OAI PMH?]

You can talk about things you don't have a representation for e.g. people.

Ignore the term 'ontology' – it's just a way of talking about a vocabulary.

Four steps for widespread adoption:
Promote practical applications
Develop applications now
[and the slide was gone and I missed the last two steps!]

There was also some stuff on limitations of lightweight approaches, and hermetically sealed museum data, user experiences. Also a bit on 'give away structured data' but with a good awareness of the need to keep some data private – object location and value, for example.

Ross – we've had the media context and technical context, now for the sector context.

Paul Marty, Engaging Audiences by connecting to collections online.
Vital connections…

What does it mean to say x% of your collection is online? For whom is it useful?

How to engage audiences around your collections? Not just presenting information.

Goes beyond providing access to data. Research shows audiences want engagement. Surveyed 1200 museum visitors about their requirements. [I would love to see the research] Virtuous circle between museum visits and website visits.

Build on interest, give experience that grabs people.

Romans in Sussex website – multiple museums offering collections for multiple audiences. Re-presenting same content in different ways on the fly.

Audiences
Don't just give general public a list of stuff. Give them a way to engage.

"Engaging a community around a collection is harder than providing access to data about a collection"

Photo of the week – says "What do you know about this photo? Please share your thoughts with us" But no link or instructions on how to do it. But at least they're trying…

Discussion – Tom, Lee and Paul.

"Why do you digitise collections before had need in mind?" [Because the driver is internal, not external, needs, would be the generous answer; because they could get funding to do it would be my ungenerous answer].

Tom on RDF – how seriously engaged with it to build audiences, tell stories.

BBC licence terms – couldn't re-use data for commercial purposes/at all.

Leadership need to understand opportunities because otherwise they won't support geek stuff.

Qu: terms of engagement – how is it defined?

Paul – US has made same mistakes re digitisation of collections and websites that don't have reusable data.

Participants must be involved in process from the beginning, need input at start from intended users on how it can engage them.

Fiona: why not use existing resources, go to existing sites with established audiences?

Lee: how did YouTube succeed – people were brought by embedded content. [This issue of using 'wrappers' around your content to help it go viral by being embeddable elsewhere was raised in another session too.]

Tom: letting go is how you win, but it's a profound challenge to institutions and their desire to maintain authority.

Yahoo! SearchMonkey, the semantic web – an example from last.fm

I had meant to blog about SearchMonkey ages ago, but last.fm's post 'Searching with my co-monkey' about a live example they've created on the SearchMonkey platform has given me the kick I needed. They say:

The first version of our application deals with artist, album and track pages giving you a useful extract of the biography, links to listen to the artist if we have them available, tags, similar artists and the best picture we can muster for the page in question.

Some background on SearchMonkey from ReadWriteWeb:

At the same time, it was clear that enhancing search results and cross linking them to other pieces of information on the web is compelling and potentially disruptive. Yahoo! realized that in order to make this work, they need to incentivize and enable publishers to control search result presentation.

…

SearchMonkey is a system that motivates publishers to use semantic annotations, and is based on existing semantic standards and industry standard vocabularies. It provides tools for developers to create compelling applications that enhance search results. The main focus of these applications is on the end user experience – enhanced results contain what Yahoo! calls an "infobar" – a set of overlays to present additional information.

…

SearchMonkey's aim is to make information presentation more intelligent when it comes to search results by enabling the people who know each result best – the publishers – to define what should be presented and how.

(From Making the Web Searchable: The Story of SearchMonkey)

And from Yahoo!'s search blog:

This new developer platform, which we're calling SearchMonkey, uses data web standards and structured data to enhance the functionality, appearance and usefulness of search results. Specifically, with SearchMonkey:

Site owners can build enhanced search results that will provide searchers with a more useful experience by including links, images and name-value pairs in the search results for their pages (likely resulting in an increase in traffic quantity and quality)

Developers can build SearchMonkey apps that enhance search results, access Yahoo! Search's user base and help shape the next generation of search

Users can customize their search experience with apps built by or for their favorite sites

This could be an interesting new development – the question is, how well does the data we currently output play with it; could we easily adapt our pages so they're compatible with SearchMonkey; should we invest the time it might take? Would a simple increase in the visibility and usefulness of search results be enough? Could there be a greater benefit in working towards federated searches across the cultural heritage sector or would this require a coordinated effort and agreement on data standards and structure?

Update to link to the Yahoo! Search Blog post ;The Yahoo! Search Gallery is Open for Business' which has a few more examples.

WSG London Findability 'introduction to findability'

Last night I went to the WSG London Findability event at Westminster University. The event was part of London Web Week. As always, apologies for any errors; corrections and comments are welcome.

First up was Cyril Doussin with an 'introduction to findability'.

A lot of it is based on research by Peter Morville, particularly Ambient Findability.

So what do people search for?
Knowledge – about oneself; about concepts/meaning; detailed info (product details, specs); entities in society (people, organisations, etc.)
Opinions – to validate a feeling or judgement; establish trust relationships; find complementary judgements.

What is information? From simple to complex – data -> information -> knowledge.

Findability is 'the quality of being locatable or navigatable'.
Item level – to what degree is a particular object easy to discover or locate?
System level – how well does the environment support navigation and retrieval?

Wayfinding requires: knowing where you are; knowing your destination; following the best route; being able to recognise your destination; being able to find your way back.

The next section was about how to make something findable:
The "in your face" discovery principle – expose the item in places known to be frequented by the target audience. He showed an example of a classic irritating Australian TV ad, a Brisbane carpet store in this case. It's disruptive and annoying, but everyone knows it exists. [Sadly, it made me a little bit homesick for Franco Cozzo. 'Megalo megalo megalo' is also a perfect example of targeting a niche audience, in this case the Greek and Italian speakers of Melbourne.]

Hand-guided navigation – sorting/ordering (e.g. sections of a restaurant menu); sign-posting.

Describe and browse (e.g. search engines) – similar to asking for directions or asking random questions; get a list of entry points to pages.

Mixing things up – the Google 'search within a search' and Yahoo!'s 'search assist' box both help users refine searches.

Recommendations (communication between peers) – the searcher describes intent; casual discussions; advice; past experiences.
The web is a referral system. Links are entry doors to your site. There's a need for a relevancy system whether search engines (PageRank) or peer-based systems (Digg).

Measuring relevance (effectiveness):
Precision – if it retrieves only relevant documents
Recall – whether it retrieves all relevant documents.

Good tests for the effectiveness of your relevance mechanism:
Precision = number of relevant and retrieved documents divided by the total number retrieved.
Recall = number of relevant and retrieved documents divided by the total number of relevant documents.

Relevance – need to identify the type of search:
Sample search – small number of documents are sufficient (e.g. first page of Google results)
Existence search – search for a specific document
Exhaustive search – full set of relevant data is needed.
Sample and existence searches require precision; exhaustive searches require recall.

Content organisation:
Taxonomy – organisation through labelling [but it seems in this context there's no hierarchy, the taxon are flat tags].
Ontology – taxonomy and inference rules.
Folksonomy – a social dimension.

[In the discussion he mentioned eRDF (embedded RDF) and microformats. Those magic words – subject : predicate : object.]

Content organisation is increasingly important because of the increasing volume of information and sharing of information. It's also a very good base for search engines.

Measuring findability on the web: count the number of steps to get there. There are many ways to get to data – search engines, peer-based lists and directories.

Recommendations:
Aim to strike a balance between sources e.g. search engine optimisation and peer-based.
Know the path(s) your audience(s) will follow (user testing)
Understand the types of search
Make advertising relevant (difficult, as it's so context-dependent)
Make content rich and relevant
Make your content structured

I've run out of lunch break now, but will write up the talks by Stuart Colville and Steve Marshall later.