Tom Morris, SPARQL and semweb stuff – tech talk at Open Hack London

Tom Morris gave a lightning talk on 'How to use Semantic Web data in your hack' (aka SPARQL and semantic web stuff).

He's since posted his links and queries – excellent links to endpoints you can test queries in.

Semantic web often thought of as long-promised magical elixir, he's here to say it can be used now by showing examples of queries that can be run against semantic web services. He'll demonstrate two different online datasets and one database that can be installed on your own machine.

First – dbpedia – scraped lots of wikipedia, put it into a database. dbpedia isn't like your averge database, you can't draw a UML diagram of wikipedia. It's done in RDF and Linked Data. Can be queried in a language that looks like SQL but isn't. SPARQL – is a w3c standard, they're currently working on SPARQL 2.

Go to dbpedia.org/sparql – submit query as post. [Really nice – I have a thing about APIs and platforms needing a really easy way to get you to 'hello world' and this does it pretty well.]

[Line by line comments on the syntax of the queries might be useful, though they're pretty readable as it is.]

'select thingy, wotsit where [the slightly more complicated stuff]'

Can get back results in xml, also HTML, 'spreadsheet', JSON. Ugly but readable. Typed.

[Trying a query challenge set by others could be fun way to get started learning it.]

One problem – fictional places are in Wikipedia e.g. Liberty City in Grand Theft Auto.

Libris – how library websites should be
[I never used to appreciate how much most library websites suck until I started back at uni and had to use one for more than one query every few years]

Has a query interface through SPARQL

Comment from the audience BBC – now have SPARQL endpoint [as of the day before? Go BBC guy!].

Playing with mulgara, open source java triple store. [mulgara looks like a kinda faceted search/browse thing] Has own query language called TQL which can do more intresting things than SPARQL. Why use it? Schemaless data storage. Is to SQL what dynamic typing is to static typing. [did he mean 'is to sparql'?]

Question from audence: how do you discover what you can query against?
Answer: dbpedia website should list the concepts they have in there. Also some documentation of categories you can look at. [Examples and documentation are so damn important for the update of your API/web service.]

Coming soon [?] SPARUL – update language, SPARQL2: new features

The end!

[These are more (very) rough notes from the weekend's Open Hack London event – please let me know of clarifications, questions, links or comments. My other notes from the event are tagged openhacklondon.

Quick plug: if you're a developer interested in using cultural heritage (museums, libraries, archives, galleries, archaeology, history, science, whatever) data – a bunch of cultural heritage geeks would like to know what's useful for you (more background here). You can comment on the #chAPI wiki, or tweet @miaridge (or @mia_out). Or if you work for a company that works with cultural heritage organisations, you can help us work better with you for better results for our users.]

There were other lightning talks on Pachube (pronounced 'patchbay', about trying to build the internet of things, making an API for gadgets because e.g. connecting hardware to the web is hard for small makers) and Homera (an open source 3d game engine).

RDFa, SearchMonkey – tech talks at Open Hack London

While today's Open Hack London event is mostly about the 24-hour hackathon, I signed up just for the Tech Talks because I couldn't afford to miss a whole weekend's study in the fortnight before my exams (stupid exams). I went to the sessions on 'Guardian Data Store and APIs', 'RDFa SearchMonkey', Arduino, 'Hacking with PHP', 'BBC Backstage', Dopplr's 'mashups made of messages' and lightning talks including 'SPARQL and semantic web' stuff you can do now.

I'm putting my rough and ready notes online so that those who couldn't make it can still get some of the benefits. Apologies for any mishearings or mistakes in transcription – leave me a comment with any questions or clarifications.

One of the reasons I was going was to push my thinking about the best ways to provide API-like access to museum information and collections, so my notes will reflect that but I try to generalise where I can. And if you have thoughts on what you'd like cultural heritage institutions to do for developers, let us know! (For background, here's a lightning talk I did at another hack event on happy museums + happy developers = happy punters).

RDFa – now everyone can have an API.
Mark Birkbeck

Going to cover some basic mark-up, and talk about why RDFa is a good thing. [The slides would be useful for the syntax examples, I'll update if they go online.]

RDFa is a new syntax from W3C – a way of embedding metadata (RDF) in HTML documents using attributes.

e.g. <span property="dc:title"> – value of property is the text inside the span.

Because it's inline you don't need to point to another document to provide source of metadata and presentation HTML.

One big advance is that can provide metadata for other items e.g. images, so you can e.g. attach licence info to the image rather than page it's in – e.g. <img src="" rel="licence" resource="[creative commons licence]">

Putting RDFa into web pages means you've now got a feed (the web page is the RSS feed), and a simple static web page can become an API that can be consumed in the same way as stuff from a big expensive system. 'Growing adoption'.

Government department Central Office of Information [?] is quite big on RDFa, have a number of projects with it. [I'd come across the UK Civil Service Job Service API while looking for examples for work presentations on APIs.]

RDFa allows for flexible publishing options. If you're already publishing HTML, you can add RDFa mark-up then get flexible publishing models – different departments can keep publishing data in their own way, a central website can go and request from each of them and create its own database of e.g. jobs. Decentralised way of approaching data distribution.

Can be consumed by: smarter browsers; client-side AJAX, other servers such as SearchMonkey.

He's interested where browsers can do something with it – either enhanced browsers that could e.g. store contact info in a page into your address book; or develop JavaScript libraries that can parse page and do something with it. [screen shot of jobs data in search monkey with enhanced search results]

RDFa might be going into Drupal core.

Example of putting isbn in RDFa in page, then a parser can go through the page, pull out the triples [some explanation of them as mini db?], pull back more info about the book from other APIs e.g. Amazon – full title, thumbnail of cover. e.g. pipes.

Example of FOAF – twitter account marked up in page, can pull in tweets. Could presumably pull in newer services as more things were added, without having to re-mark-up all the pages.

Example of chemist writing a blog who mentions a chemical compound in blog post, a processor can go off and retrieve more info – e.g. add icon for mouseover info – image of molecule, or link to more info.

Next plan is to link with BOSS. Can get back RDFa from search results – augment search results with RDFa from the original page.

Search Monkey (what it is and what you can do with it)
Neil Crosby (European frontend architect for search at Yahoo).

SearchMonkey is (one of) Yahoo's open search platforms (along with BOSS). Uses structured data to enhance search results. You get to change stuff on Yahoo search results page.

SearchMonkey lets you: style results for certain URL patterns; brand those results; make the results more useful for users.

[examples of sites that have done it to see how their results look in Yahoo? I thought he mentioned IMDb but it doesn't look any different – a film search that returns a wikipedia result, OTOH, does.]

Make life better for users – not just what Yahoo thinks results should be, you can say 'actually this is the important info on the page'

Three ways to do it [to change the SERP [search engine results page]: mark up data in a way that Yahoo knows about – 'just structure your data nicely'. e.g. video mark-up; enhance a result directly; make an infobar.

Infobar – doesn't change result see immediately on the page, but it opens on the page. e.g. of auto-enhanced result- playcrafter. Link to developer start page – how to mark it up, with examples, and what it all means.

User-enhanced result – Facebook profile pages are marked up with microformats – can add as friend, poke, send message, view friends, etc from the search results page. Can change the title and abstract, add image, favicon, quicklinks, key/value pairs. Create at [link I can't see but is on slides] Displayed in screen, you fill it out on a template.

Infobar – dropdown in grey bar under results. Can do a lot more, as it's hidden in the infobar and doesn't have to worry people.

Data from: microformats, RDF, XSLT, Yahoo's index, and soon, top tags from delicious.

If no machine data, can write an XSLT. 'isn't that hard'. Lots of documentation on the web.

Examples of things that have been made – a tool that exposes all the metadata known for a page. URL on slide. can install on Yahoo search page, add it in. Use location data to make a map – any page on web with metadata about locations on it – map monkey. Get qype results for anything you search for.

There's a mailing list (people willing and wanting to answer questions) and a tutorial.

Questions

Question: do you need to use a special doctype [for RDFa]?
Answer: added to spec that 'you should use this doctype' but the spec allows for RDFa to be used in situations when can't change doctype e.g. RDFa embedded in blogger blogpost. Most parsers walk the DOM rather than relying on the doctype.

Jim O'D – excited that SearchMonkey supports XSLT – if have website with correctly marked up tables, could expose those as key/value pairs?
Answer: yes. XSLT fantastic tool for when don't have data marked up – can still get to it.

Frankie – question I couldn't hear. About info out to users?
Answer: if you've built a monkey, up to you to tell people about it for the moment. Some monkeys are auto-on e.g. Facebook, wikipedia… possibly in future, if developed a monkey for a site you own, might be able to turn it auto-on in the results for all users… not sure yet if they'll do it or not.
Frankie: plan that people get monkeys they want, or go through gallery?
Answer: would be fantastic if could work out what people are using them for and suggest ones appropriate to people doing particular kinds of searches, rather than having to go to a gallery.

Yahoo! SearchMonkey, the semantic web – an example from last.fm

I had meant to blog about SearchMonkey ages ago, but last.fm's post 'Searching with my co-monkey' about a live example they've created on the SearchMonkey platform has given me the kick I needed. They say:

The first version of our application deals with artist, album and track pages giving you a useful extract of the biography, links to listen to the artist if we have them available, tags, similar artists and the best picture we can muster for the page in question.

Some background on SearchMonkey from ReadWriteWeb:

At the same time, it was clear that enhancing search results and cross linking them to other pieces of information on the web is compelling and potentially disruptive. Yahoo! realized that in order to make this work, they need to incentivize and enable publishers to control search result presentation.

SearchMonkey is a system that motivates publishers to use semantic annotations, and is based on existing semantic standards and industry standard vocabularies. It provides tools for developers to create compelling applications that enhance search results. The main focus of these applications is on the end user experience – enhanced results contain what Yahoo! calls an "infobar" – a set of overlays to present additional information.

SearchMonkey's aim is to make information presentation more intelligent when it comes to search results by enabling the people who know each result best – the publishers – to define what should be presented and how.

(From Making the Web Searchable: The Story of SearchMonkey)

And from Yahoo!'s search blog:

This new developer platform, which we're calling SearchMonkey, uses data web standards and structured data to enhance the functionality, appearance and usefulness of search results. Specifically, with SearchMonkey:

  • Site owners can build enhanced search results that will provide searchers with a more useful experience by including links, images and name-value pairs in the search results for their pages (likely resulting in an increase in traffic quantity and quality)
  • Developers can build SearchMonkey apps that enhance search results, access Yahoo! Search's user base and help shape the next generation of search
  • Users can customize their search experience with apps built by or for their favorite sites

This could be an interesting new development – the question is, how well does the data we currently output play with it; could we easily adapt our pages so they're compatible with SearchMonkey; should we invest the time it might take? Would a simple increase in the visibility and usefulness of search results be enough? Could there be a greater benefit in working towards federated searches across the cultural heritage sector or would this require a coordinated effort and agreement on data standards and structure?

Update to link to the Yahoo! Search Blog post ;The Yahoo! Search Gallery is Open for Business' which has a few more examples.

WSG London Findability 'introduction to findability'

Last night I went to the WSG London Findability event at Westminster University. The event was part of London Web Week. As always, apologies for any errors; corrections and comments are welcome.

First up was Cyril Doussin with an 'introduction to findability'.

A lot of it is based on research by Peter Morville, particularly Ambient Findability.

So what do people search for?
Knowledge – about oneself; about concepts/meaning; detailed info (product details, specs); entities in society (people, organisations, etc.)
Opinions – to validate a feeling or judgement; establish trust relationships; find complementary judgements.

What is information? From simple to complex – data -> information -> knowledge.

Findability is 'the quality of being locatable or navigatable'.
Item level – to what degree is a particular object easy to discover or locate?
System level – how well does the environment support navigation and retrieval?

Wayfinding requires: knowing where you are; knowing your destination; following the best route; being able to recognise your destination; being able to find your way back.

The next section was about how to make something findable:
The "in your face" discovery principle – expose the item in places known to be frequented by the target audience. He showed an example of a classic irritating Australian TV ad, a Brisbane carpet store in this case. It's disruptive and annoying, but everyone knows it exists. [Sadly, it made me a little bit homesick for Franco Cozzo. 'Megalo megalo megalo' is also a perfect example of targeting a niche audience, in this case the Greek and Italian speakers of Melbourne.]

Hand-guided navigation – sorting/ordering (e.g. sections of a restaurant menu); sign-posting.

Describe and browse (e.g. search engines) – similar to asking for directions or asking random questions; get a list of entry points to pages.

Mixing things up – the Google 'search within a search' and Yahoo!'s 'search assist' box both help users refine searches.

Recommendations (communication between peers) – the searcher describes intent; casual discussions; advice; past experiences.
The web is a referral system. Links are entry doors to your site. There's a need for a relevancy system whether search engines (PageRank) or peer-based systems (Digg).

Measuring relevance (effectiveness):
Precision – if it retrieves only relevant documents
Recall – whether it retrieves all relevant documents.

Good tests for the effectiveness of your relevance mechanism:
Precision = number of relevant and retrieved documents divided by the total number retrieved.
Recall = number of relevant and retrieved documents divided by the total number of relevant documents.

Relevance – need to identify the type of search:
Sample search – small number of documents are sufficient (e.g. first page of Google results)
Existence search – search for a specific document
Exhaustive search – full set of relevant data is needed.
Sample and existence searches require precision; exhaustive searches require recall.

Content organisation:
Taxonomy – organisation through labelling [but it seems in this context there's no hierarchy, the taxon are flat tags].
Ontology – taxonomy and inference rules.
Folksonomy – a social dimension.

[In the discussion he mentioned eRDF (embedded RDF) and microformats. Those magic words – subject : predicate : object.]

Content organisation is increasingly important because of the increasing volume of information and sharing of information. It's also a very good base for search engines.

Measuring findability on the web: count the number of steps to get there. There are many ways to get to data – search engines, peer-based lists and directories.

Recommendations:
Aim to strike a balance between sources e.g. search engine optimisation and peer-based.
Know the path(s) your audience(s) will follow (user testing)
Understand the types of search
Make advertising relevant (difficult, as it's so context-dependent)
Make content rich and relevant
Make your content structured

I've run out of lunch break now, but will write up the talks by Stuart Colville and Steve Marshall later.