It’s a wonderful, wonderful web

First, the news that Google are starting to crawl the deep or invisible web via html forms on a sample of ‘high quality’ sites (via The Walker Art Center’s New Media Initiatives blog):

This experiment is part of Google’s broader effort to increase its coverage of the web. In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines. The terms Deep Web, Hidden Web, or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. By crawling using HTML forms (and abiding by robots.txt), we are able to lead search engine users to documents that would otherwise not be easily found in search engines, and provide webmasters and users alike with a better and more comprehensive search experience.

You’re probably already well indexed if you have a browsable interface that leads to every single one of your collection records and images and whatever; but if you’ve got any content that was hidden behind a search form (and I know we have some in older sites), this could give it much greater visibility.

Secondly, Mike Ellis has done a sterling job synthesising some of the official, backchannel and informal conversations about the semantic web at MW2008 and adding his own perspective on his blog.

Talking about Flickr’s 20 gazillion tags:

To take an example: at the individual tag level, the flaws of misspellings and inaccuracies are annoying and troublesome, but at a meta level these inaccuracies are ironed out; flattened by sheer mass: a kind of bell-curve peak of correctness. At the same time, inferences can be drawn from the connections and proximity of tags. If the word “cat” appears consistently – in millions and millions of data items – next to the word “kitten” then the system can start to make some assumptions about the related meaning of those words. Out of the apparent chaos of the folksonomy – the lack of formal vocabulary, the anti-taxonomy – comes a higher-level order. Seb put it the other way round by talking about the “shanty towns” of museum data: “examine order and you see chaos”.

The total “value” of the data, in other words, really is way, way greater than the sum of the parts.

So far, so ace. We’ve been excited about using the implicit links created between data as people consciously record information with tags, or unconsciously with their paths between data to create those ‘small ontologies, loosely joined’; the possibilities of multilingual tagging, etc, before. Tags are cool.

But the applications of this could go further:

I got thinking about how this can all be applied to the Semantic Web. It increasingly strikes me that the distributed nature of the machine processable, API-accessible web carries many similar hallmarks. Each of those distributed systems – the Yahoo! Content Analysis API, the Google postcode lookup, Open Calais – are essentially dumb systems. But hook them together; start to patch the entire thing into a distributed framework, and things take on an entirely different complexion.

Here’s what I’m starting to gnaw at: maybe it’s here. Maybe if it quacks like a duck, walks like a duck (as per the recent Becta report by Emma Tonkin at UKOLN) then it really is a duck. Maybe the machine-processable web that we see in mashups, API’s, RSS, microformats – the so-called “lightweight” stuff that I’m forever writing about – maybe that’s all we need. Like the widely accepted notion of scale and we-ness in the social and tagged web, perhaps these dumb synapses when put together are enough to give us the collective intelligence – the Semantic Web – that we have talked and written about for so long.

I’d say those capital letters in ‘Semantic Web’ might scare some of the hardcore SW crowd, but that’s ok, isn’t it? Semantics (sorry) aside, we’re all working towards the same goal – the machine-processable web.

And in the meantime, if we can put our data out there so others can tag it, and so that we’re exposing our internal ‘tags’ (even if they have fancier names in our collections management systems), we’re moving in the right direction.

(Now I’ve got Black’s “Wonderful Life” stuck in my head, doh. Luckily it’s the cover version without the cheesy synths).

Right, now I’m off to the Museum in Docklands to talk about MultiMimsy database extractions and repositories. Rock.

I came across this really nice definition of ontologies while browsing the Digital Curation Centre site:

Ontologies provide rich semantics as well as the structured relationships needed to interpret data. As interoperability between information, metadata and standards becomes more important, it will become increasingly relevant that digital curators have a means of understanding the wide range of information associated with digital objects. Ontologies are created by a community of people who want to provide tools for describing and querying resources within a particular domain. This might include metadata schemas and classification systems, and are useful to specify concepts of information within a domain of interest. Interoperability between various ontologies will also become increasingly important in enabling members of disparate communities to re-use and understand digital information over time.