Tom Morris, SPARQL and semweb stuff – tech talk at Open Hack London

Tom Morris gave a lightning talk on 'How to use Semantic Web data in your hack' (aka SPARQL and semantic web stuff).

He's since posted his links and queries – excellent links to endpoints you can test queries in.

Semantic web often thought of as long-promised magical elixir, he's here to say it can be used now by showing examples of queries that can be run against semantic web services. He'll demonstrate two different online datasets and one database that can be installed on your own machine.

First – dbpedia – scraped lots of wikipedia, put it into a database. dbpedia isn't like your averge database, you can't draw a UML diagram of wikipedia. It's done in RDF and Linked Data. Can be queried in a language that looks like SQL but isn't. SPARQL – is a w3c standard, they're currently working on SPARQL 2.

Go to dbpedia.org/sparql – submit query as post. [Really nice – I have a thing about APIs and platforms needing a really easy way to get you to 'hello world' and this does it pretty well.]

[Line by line comments on the syntax of the queries might be useful, though they're pretty readable as it is.]

'select thingy, wotsit where [the slightly more complicated stuff]'

Can get back results in xml, also HTML, 'spreadsheet', JSON. Ugly but readable. Typed.

[Trying a query challenge set by others could be fun way to get started learning it.]

One problem – fictional places are in Wikipedia e.g. Liberty City in Grand Theft Auto.

Libris – how library websites should be
[I never used to appreciate how much most library websites suck until I started back at uni and had to use one for more than one query every few years]

Has a query interface through SPARQL

Comment from the audience BBC – now have SPARQL endpoint [as of the day before? Go BBC guy!].

Playing with mulgara, open source java triple store. [mulgara looks like a kinda faceted search/browse thing] Has own query language called TQL which can do more intresting things than SPARQL. Why use it? Schemaless data storage. Is to SQL what dynamic typing is to static typing. [did he mean 'is to sparql'?]

Question from audence: how do you discover what you can query against?
Answer: dbpedia website should list the concepts they have in there. Also some documentation of categories you can look at. [Examples and documentation are so damn important for the update of your API/web service.]

Coming soon [?] SPARUL – update language, SPARQL2: new features

The end!

[These are more (very) rough notes from the weekend's Open Hack London event – please let me know of clarifications, questions, links or comments. My other notes from the event are tagged openhacklondon.

Quick plug: if you're a developer interested in using cultural heritage (museums, libraries, archives, galleries, archaeology, history, science, whatever) data – a bunch of cultural heritage geeks would like to know what's useful for you (more background here). You can comment on the #chAPI wiki, or tweet @miaridge (or @mia_out). Or if you work for a company that works with cultural heritage organisations, you can help us work better with you for better results for our users.]

There were other lightning talks on Pachube (pronounced 'patchbay', about trying to build the internet of things, making an API for gadgets because e.g. connecting hardware to the web is hard for small makers) and Homera (an open source 3d game engine).

Google updates search, playing catch-up?

A quick post in case you've missed it elsewhere – whether in response to the ridiculously-titled 'Wolfram Alpha' or to Yahoo's 'open strategy' (YOS) and work on enhancing search engine results pages (SERPs) with structured data, Google have announced a "new set of features that we call Search Options, which are a collection of tools that let you slice and dice your results and generate different views to find what you need faster and easier" and "rich snippets" that "show more useful information from web pages than the preview text". Searchengineland have compared Google and Yahoo's offerings.

Update: some of the criticism rumbling on twitter yesterday has been neatly summarised by Ian Davis in 'Google's RDFa a Damp Squib':

However, a closer look reveals that Google have basically missed the point of RDFa. The RDFa support is limited to the properties and classes defined on a hastily thrown together site called data-vocabulary.org. There you will find classes for Person and Organization and properties for names and addresses, completely ignoring the millions of pieces of data using well established terms from FOAF and the like. That means everyone has to rewrite all their data to use Google's schema if they want to be featured on Google's search engine. Its like saying you have to write your pages using Google's own version of html where all the tags have slightly different spellings to be listed in their search engine!

The result is a hobbled implementation of RDFa. They've taken the worst part – the syntax – and thrown away the best – the decentralized vocabularies of terms. It's like using microformats without the one thing they do well: the simplicity.

Further, in the comments:

the point of decentralization is not to encourage fragmentation and isolation, but to allow people to collaborate without needing permission from a middleman. Google's approach imposes a centralized authority.

There's also a (slightly disingenuous, IMO) response from Google:

For Rich Snippets, Google search need to understand what the data means in order to render it appropriately. We will start incorporating existing vocabularies like FOAF, but there's no way for us to have a decent user experience for brand-new vocabularies that someone defines. We also need a single place where a webmaster can come and find all the terms that Google understands. Which is why we have data-vocabulary.org.

Isn't the point of Google that it can figure stuff out without needing to be told?

Rasmus Lerdorf on Hacking with PHP – tech talk at Open Hack London

Same deal as my first post from today's Open Hack London event – these are (very) rough notes, please let me know of clarifications, questions or comments.

Hacking with PHP, Rasmus Lerdorf

Goal of talk: copy and pastable snippets that just work so you don't have to fight to get things that work [there's not enough of this to help beginners get over that initial hump]. The slides are available at http://talks.php.net/show/openhack and these notes are probably best read as commentary alongside the code examples.

[Since it's a hack day, some] Hack ideas: fix something you use every day; build your own targeted search engine; improve the look of search results; play with semantic web tools to make the web more semantic; tell the world what kind of data you have – if a resume, use hResume or other appropriate microformats/markup; go local – tools for helping your local community; hack for good – make the world a better place.

SearchMonkey and BOSS are blending together a little bit.

What we need to learn
With PHP – enough to handle simple requests; talk to backend datastore; how to parse XML with PHP, how to generate JSON, some basic javasccript, a JavaScript utility library like YUI or jquery.

parsing XML: simpleXML_load_file() – can load entire URL or local file.

Attributes on node show up as array. Namespace attributes call children of node, name namespace as argument.

Now know how to parse XML, can get lots of other stuff.
Context extraction service, Yahoo – doesn't get enough attention. Post all text, gives you back four or five key terms – can then do an image search off them. Or match ads to webpages.

Can use get or post (curl) – usually too much for get.

PHP to JavaScript on initial page load: JSON_encode -> javascript.

Javascript to PHP (and back)
If you can figure out these six lines of code, you can write anything in the world. How every modern web application works.
Server-side php, client-side javascript.

'There's nothing to building web applications, you just have to break everything down into small enough chunks that it all becomes trivial'.

AJAX in 30 seconds.
Inline comments in code would help for people reading it without hearing the talk at the same time.

JavaScript libraries to the rescue
load maps API, create container (div) for the map, then fill it.

Form – on submit call return updateMap(); with new location.

YGeoRSS – if have GeoRSS file… can point to it.

GeoPlanet – assigns a WOE ID to a place. Locations are more than just a lat long – carry way more information. Basically gives you a foreign key. YQL is starting to make the web a giant database. Can make joins across APIs – woeid works as fk.

YQL – 'combines all the APIs on the web into a single API'.

Add a cache – nice to YQL, and also good for demos etc. Copy and paste cache function from his slides – does a local cache on URL. Hashed with md5. Using PHP streams – #defn. Adding a cache speeds up developing when hacking (esp as won't be waiting for the wifi). [This is a pretty damn good tip cos it's really useful and not immediately obvious.]

XPath on URL using PHP's OAuth extension

SearchMonkey – social engineering people into caring about semantic data on the web. For non-geeks, search plug-in mechanism that will spruce up search results page. Encourages people to add semantic data so their search result is as sexy as their competitors – so goal is that people will start adding semantic data.

'If you're doing web stuff, and don't know about microformats, and your resume doesn't have hResume, you're not getting a job with Yahoo.'

Question: how are microformats different to RDFa?
Answer: there are different types of microformats – some very specific ones, eg hResume, hCal. RDFa – adding arbitrary tags to page. even if no specific way to describe your data. But there's a standard set of mark-ups for a resume so can use that. if your data doesn't match anything at microfomats.org then use RDFa or erdf (?).

RDFa, SearchMonkey – tech talks at Open Hack London

While today's Open Hack London event is mostly about the 24-hour hackathon, I signed up just for the Tech Talks because I couldn't afford to miss a whole weekend's study in the fortnight before my exams (stupid exams). I went to the sessions on 'Guardian Data Store and APIs', 'RDFa SearchMonkey', Arduino, 'Hacking with PHP', 'BBC Backstage', Dopplr's 'mashups made of messages' and lightning talks including 'SPARQL and semantic web' stuff you can do now.

I'm putting my rough and ready notes online so that those who couldn't make it can still get some of the benefits. Apologies for any mishearings or mistakes in transcription – leave me a comment with any questions or clarifications.

One of the reasons I was going was to push my thinking about the best ways to provide API-like access to museum information and collections, so my notes will reflect that but I try to generalise where I can. And if you have thoughts on what you'd like cultural heritage institutions to do for developers, let us know! (For background, here's a lightning talk I did at another hack event on happy museums + happy developers = happy punters).

RDFa – now everyone can have an API.
Mark Birkbeck

Going to cover some basic mark-up, and talk about why RDFa is a good thing. [The slides would be useful for the syntax examples, I'll update if they go online.]

RDFa is a new syntax from W3C – a way of embedding metadata (RDF) in HTML documents using attributes.

e.g. <span property="dc:title"> – value of property is the text inside the span.

Because it's inline you don't need to point to another document to provide source of metadata and presentation HTML.

One big advance is that can provide metadata for other items e.g. images, so you can e.g. attach licence info to the image rather than page it's in – e.g. <img src="" rel="licence" resource="[creative commons licence]">

Putting RDFa into web pages means you've now got a feed (the web page is the RSS feed), and a simple static web page can become an API that can be consumed in the same way as stuff from a big expensive system. 'Growing adoption'.

Government department Central Office of Information [?] is quite big on RDFa, have a number of projects with it. [I'd come across the UK Civil Service Job Service API while looking for examples for work presentations on APIs.]

RDFa allows for flexible publishing options. If you're already publishing HTML, you can add RDFa mark-up then get flexible publishing models – different departments can keep publishing data in their own way, a central website can go and request from each of them and create its own database of e.g. jobs. Decentralised way of approaching data distribution.

Can be consumed by: smarter browsers; client-side AJAX, other servers such as SearchMonkey.

He's interested where browsers can do something with it – either enhanced browsers that could e.g. store contact info in a page into your address book; or develop JavaScript libraries that can parse page and do something with it. [screen shot of jobs data in search monkey with enhanced search results]

RDFa might be going into Drupal core.

Example of putting isbn in RDFa in page, then a parser can go through the page, pull out the triples [some explanation of them as mini db?], pull back more info about the book from other APIs e.g. Amazon – full title, thumbnail of cover. e.g. pipes.

Example of FOAF – twitter account marked up in page, can pull in tweets. Could presumably pull in newer services as more things were added, without having to re-mark-up all the pages.

Example of chemist writing a blog who mentions a chemical compound in blog post, a processor can go off and retrieve more info – e.g. add icon for mouseover info – image of molecule, or link to more info.

Next plan is to link with BOSS. Can get back RDFa from search results – augment search results with RDFa from the original page.

Search Monkey (what it is and what you can do with it)
Neil Crosby (European frontend architect for search at Yahoo).

SearchMonkey is (one of) Yahoo's open search platforms (along with BOSS). Uses structured data to enhance search results. You get to change stuff on Yahoo search results page.

SearchMonkey lets you: style results for certain URL patterns; brand those results; make the results more useful for users.

[examples of sites that have done it to see how their results look in Yahoo? I thought he mentioned IMDb but it doesn't look any different – a film search that returns a wikipedia result, OTOH, does.]

Make life better for users – not just what Yahoo thinks results should be, you can say 'actually this is the important info on the page'

Three ways to do it [to change the SERP [search engine results page]: mark up data in a way that Yahoo knows about – 'just structure your data nicely'. e.g. video mark-up; enhance a result directly; make an infobar.

Infobar – doesn't change result see immediately on the page, but it opens on the page. e.g. of auto-enhanced result- playcrafter. Link to developer start page – how to mark it up, with examples, and what it all means.

User-enhanced result – Facebook profile pages are marked up with microformats – can add as friend, poke, send message, view friends, etc from the search results page. Can change the title and abstract, add image, favicon, quicklinks, key/value pairs. Create at [link I can't see but is on slides] Displayed in screen, you fill it out on a template.

Infobar – dropdown in grey bar under results. Can do a lot more, as it's hidden in the infobar and doesn't have to worry people.

Data from: microformats, RDF, XSLT, Yahoo's index, and soon, top tags from delicious.

If no machine data, can write an XSLT. 'isn't that hard'. Lots of documentation on the web.

Examples of things that have been made – a tool that exposes all the metadata known for a page. URL on slide. can install on Yahoo search page, add it in. Use location data to make a map – any page on web with metadata about locations on it – map monkey. Get qype results for anything you search for.

There's a mailing list (people willing and wanting to answer questions) and a tutorial.

Questions

Question: do you need to use a special doctype [for RDFa]?
Answer: added to spec that 'you should use this doctype' but the spec allows for RDFa to be used in situations when can't change doctype e.g. RDFa embedded in blogger blogpost. Most parsers walk the DOM rather than relying on the doctype.

Jim O'D – excited that SearchMonkey supports XSLT – if have website with correctly marked up tables, could expose those as key/value pairs?
Answer: yes. XSLT fantastic tool for when don't have data marked up – can still get to it.

Frankie – question I couldn't hear. About info out to users?
Answer: if you've built a monkey, up to you to tell people about it for the moment. Some monkeys are auto-on e.g. Facebook, wikipedia… possibly in future, if developed a monkey for a site you own, might be able to turn it auto-on in the results for all users… not sure yet if they'll do it or not.
Frankie: plan that people get monkeys they want, or go through gallery?
Answer: would be fantastic if could work out what people are using them for and suggest ones appropriate to people doing particular kinds of searches, rather than having to go to a gallery.

One step closer to intelligent searching?

The BBC have a story on a new search engine, Search site aims to rival Google:

Called Cuil [pronounced 'cool'], from the Gaelic for knowledge and hazel, its founders claim it does a better and more comprehensive job of indexing information online.

The technology it uses to index the web can understand the context surrounding each page and the concepts driving search requests, say the founders.

But analysts believe the new search engine, like many others, will struggle to match and defeat Google.

Instead of just looking at the number and quality of links to and from a webpage as Google's technology does, Cuil attempts to understand more about the information on a page and the terms people use to search. Results are displayed in a magazine format rather than a list.

From the Cuil FAQ:

So Cuil searches the Web for pages with your keywords and then we analyze the rest of the text on those pages. This tells us that the same word has several different meanings in different contexts. Are you looking for jaguar the cat, the car or the operating system?

We sort out all those different contexts so that you don't have to waste time rephrasing your query when you get the wrong result.

Different ideas are separated into tabs; we add images and roll-over definitions for each page and then make suggestions as to how you might refine your search. We use columns so you can see more results on one page.

They also provide 'drill-downs' on the results page.

Cuil will direct you to this additional information. By looking at these suggestions, you may discover search data, concepts, or related areas of interest that you hadn’t expected. This is particularly useful when you are researching a subject you don't know much about and aren't sure how to compose the "right" query to find the information you need.

I haven't used it enough to work out exactly how it differentiates concepts (tabs) and 'additional information' (drill-downs/categories).

It does a good job on something like the Cutty Sark. Under 'Explore by Category' it offered:

  • Buildings And Structures In Greenwich
  • Sailboat Names
  • Museums In London
  • Neighbourhoods Of Greenwich
  • School Ships

It picked up search results for Cutty Sark whisky and news of the Cutty Sark fire but they weren't reflected in the categories, and the search term didn't trigger the tabs. The tabs kick in when you search for something like 'orange'.

It didn't do as well with 'samian ware' – the categories picked up all sorts of places and peoples, (and randomly 'American Films'), but while the search results all say that it's 'a kind of bright red Roman pottery' that's not reflected in the categories. Fair enough, there may not be enough information easily available online so that 'Types of Roman pottery' registers as a category.

Incidentally, most of the results listed for 'samian ware' are just recycled entries from Wikipedia. It's a shame the results aren't filtered to remove entries that have just duplicated Wikipedia text. The FAQ says they don't index duplicate content I guess the overall site or page is just different enough to be retained.

It might take a while for museum content to appear in the most useful ways, but it looks like it might be a useful search engine for niche content. From the FAQ again:

We've found that a lot of Web pages have been designed with a small audience in mind—perhaps they are blogs or academic papers with specific interests or pages with family photos. We think that even though these pages aren't necessarily for a wide audience, they contain content that one day you might need.

Our job is to index all these pages and examine their content for relevancy to your search. If they contain information you need, then they should be available to you.

It's all sounding a bit semantic web-ish (and quite a bit 'reacting to Google-ish') and I'll use it for a while to see how it compared to Google. The webmaster information doesn't give any indication of how you could mark up content so the relationships between terms in different contexts is clear, but I guess nice semantic markup would help.

Refreshingly, it doesn't retain search info – privacy is one of their big differentiators from Google.

Yahoo! SearchMonkey, the semantic web – an example from last.fm

I had meant to blog about SearchMonkey ages ago, but last.fm's post 'Searching with my co-monkey' about a live example they've created on the SearchMonkey platform has given me the kick I needed. They say:

The first version of our application deals with artist, album and track pages giving you a useful extract of the biography, links to listen to the artist if we have them available, tags, similar artists and the best picture we can muster for the page in question.

Some background on SearchMonkey from ReadWriteWeb:

At the same time, it was clear that enhancing search results and cross linking them to other pieces of information on the web is compelling and potentially disruptive. Yahoo! realized that in order to make this work, they need to incentivize and enable publishers to control search result presentation.

SearchMonkey is a system that motivates publishers to use semantic annotations, and is based on existing semantic standards and industry standard vocabularies. It provides tools for developers to create compelling applications that enhance search results. The main focus of these applications is on the end user experience – enhanced results contain what Yahoo! calls an "infobar" – a set of overlays to present additional information.

SearchMonkey's aim is to make information presentation more intelligent when it comes to search results by enabling the people who know each result best – the publishers – to define what should be presented and how.

(From Making the Web Searchable: The Story of SearchMonkey)

And from Yahoo!'s search blog:

This new developer platform, which we're calling SearchMonkey, uses data web standards and structured data to enhance the functionality, appearance and usefulness of search results. Specifically, with SearchMonkey:

  • Site owners can build enhanced search results that will provide searchers with a more useful experience by including links, images and name-value pairs in the search results for their pages (likely resulting in an increase in traffic quantity and quality)
  • Developers can build SearchMonkey apps that enhance search results, access Yahoo! Search's user base and help shape the next generation of search
  • Users can customize their search experience with apps built by or for their favorite sites

This could be an interesting new development – the question is, how well does the data we currently output play with it; could we easily adapt our pages so they're compatible with SearchMonkey; should we invest the time it might take? Would a simple increase in the visibility and usefulness of search results be enough? Could there be a greater benefit in working towards federated searches across the cultural heritage sector or would this require a coordinated effort and agreement on data standards and structure?

Update to link to the Yahoo! Search Blog post ;The Yahoo! Search Gallery is Open for Business' which has a few more examples.