search – Open Objects

It's a good week for search engine gossip

Dare Obasanjo quotes Nick Carr as a lead in to a post on Google's Assault on Wikipedia:

Clearly Nick Carr wasn't the only one that realized that Google was slowly turning into a Wikipedia redirector. Google wants to be the #1 source for information or at least be serving ads on the #1 sites on the Internet in specific area. Wikipedia was slowly eroding the company's effectivenes at achieving both goals. So it is unsurprising that Google has launched Knol and is trying to entice authors away from Wikipedia by offering them a chance to get paid.

What is surprising is that Google is tipping it's search results to favor Knol. Or at least that is the conclusion of several search engine optimization (SEO) experts and also jibes with my experiences.

After looking at some test cases he concludes:

Google is clearly favoring Knol content over content from older, more highly linked sites on the Web. I won't bother with the question of whether Google is doing this on purpose or whether this is some innocent mistake. The important question is "What are they going to do about it now that we've found out?"

It's early days for Knol so maybe the placement of Google search results will settle down over time.

Via other links I found confirmation that '[f]or years, Google's link: command (and see here) has deliberately failed to show all the links to a website.' Old news but I missed it at the time, but since I'd always wondered why the link: thing never seemed to work properly I thought it was worth mentioning.

One step closer to intelligent searching?

The BBC have a story on a new search engine, Search site aims to rival Google:

Called Cuil [pronounced 'cool'], from the Gaelic for knowledge and hazel, its founders claim it does a better and more comprehensive job of indexing information online.

The technology it uses to index the web can understand the context surrounding each page and the concepts driving search requests, say the founders.

But analysts believe the new search engine, like many others, will struggle to match and defeat Google.

…

Instead of just looking at the number and quality of links to and from a webpage as Google's technology does, Cuil attempts to understand more about the information on a page and the terms people use to search. Results are displayed in a magazine format rather than a list.

From the Cuil FAQ:

So Cuil searches the Web for pages with your keywords and then we analyze the rest of the text on those pages. This tells us that the same word has several different meanings in different contexts. Are you looking for jaguar the cat, the car or the operating system?

We sort out all those different contexts so that you don't have to waste time rephrasing your query when you get the wrong result.

Different ideas are separated into tabs; we add images and roll-over definitions for each page and then make suggestions as to how you might refine your search. We use columns so you can see more results on one page.

They also provide 'drill-downs' on the results page.

Cuil will direct you to this additional information. By looking at these suggestions, you may discover search data, concepts, or related areas of interest that you hadn’t expected. This is particularly useful when you are researching a subject you don't know much about and aren't sure how to compose the "right" query to find the information you need.

I haven't used it enough to work out exactly how it differentiates concepts (tabs) and 'additional information' (drill-downs/categories).

It does a good job on something like the Cutty Sark. Under 'Explore by Category' it offered:

Buildings And Structures In Greenwich
Sailboat Names
Museums In London
Neighbourhoods Of Greenwich
School Ships

It picked up search results for Cutty Sark whisky and news of the Cutty Sark fire but they weren't reflected in the categories, and the search term didn't trigger the tabs. The tabs kick in when you search for something like 'orange'.

It didn't do as well with 'samian ware' – the categories picked up all sorts of places and peoples, (and randomly 'American Films'), but while the search results all say that it's 'a kind of bright red Roman pottery' that's not reflected in the categories. Fair enough, there may not be enough information easily available online so that 'Types of Roman pottery' registers as a category.

Incidentally, most of the results listed for 'samian ware' are just recycled entries from Wikipedia. It's a shame the results aren't filtered to remove entries that have just duplicated Wikipedia text. The FAQ says they don't index duplicate content I guess the overall site or page is just different enough to be retained.

It might take a while for museum content to appear in the most useful ways, but it looks like it might be a useful search engine for niche content. From the FAQ again:

We've found that a lot of Web pages have been designed with a small audience in mind—perhaps they are blogs or academic papers with specific interests or pages with family photos. We think that even though these pages aren't necessarily for a wide audience, they contain content that one day you might need.

Our job is to index all these pages and examine their content for relevancy to your search. If they contain information you need, then they should be available to you.

It's all sounding a bit semantic web-ish (and quite a bit 'reacting to Google-ish') and I'll use it for a while to see how it compared to Google. The webmaster information doesn't give any indication of how you could mark up content so the relationships between terms in different contexts is clear, but I guess nice semantic markup would help.

Refreshingly, it doesn't retain search info – privacy is one of their big differentiators from Google.

Microupdates and you (a.ka. 'twits in the museum')

I was trying to describe Twitter-esque applications for a presentation today, and I wasn't really happy with 'microblogging' so I described them as 'micro-updates'. Partly because I think of them as a bit like Facebook status updates for geeks, and partly because they're a lot more actively social than blog posts.

In case you haven't come across them, Twitter, Pownce, Jaiku, tumblr, etc, are services that let you broadcast short (140 characters) messages via a website or mobile device. I find them useful for finding like-minded people (or just those who also fancy a drink) at specific events (thanks to Brian Kelly for convincing me to try it).

You can promote a 'hash tag' for use at your event – yes, it's a tag with a # in front of it, low tech is cool. Ideally your tag should be short and snappy yet distinct, because it has to be typed in manually (mistakes happen easily, especially from a mobile device) and it's using up precious characters. You can use tools like Summize, hashtags, Quotably or Twemes to see if anyone else has used the same tag recently.

You can also ask people to use your event tag on blog posts, photos and videos to help bring together all the content about your event and create an ad hoc community of participants. Be aware that especially with Twitter-type services you may get fairly direct criticism as well as praise – incredibly useful, but it can seem harsh out of context (e.g. in a report to your boss).

More generally, you can use the same services above to search twitter conversations to find posts about your institution, events, venues or exhibitions. You can add in a search term and subscribe to an RSS feed to be notified when that term is used. For example, I tried http://summize.com/search?q="museum+of+london" and discovered a great review of the last 'Lates' event that described it as 'like a mini festival'. You should also search for common variations or misspellings, though they may return more false positives. When someone tweets (posts) using your search phrase it'll show up in your RSS reader and you can then reply to the poster or use the feedback to improve your projects.

This can be a powerful way to interact with your audience because you can respond directly and immediately to questions, complaints or praise. Of course you should also set up google alerts for blog posts and other websites but micro-update services allow for an incredible immediacy and directness of response.

As an example, yesterday I tweeted (or twitted, if you prefer):

me: does anyone know how to stop firefox 3 resizing pages? it makes images look crappy

I did some searching [1] and found a solution, and posted again:

me: aha, it's browser.zoom.full or "View → Zoom → Zoom Text Only" on windows, my firefox is sorted now

Then, to my surprise, I got a message from someone involved with Firefox [2]:

firefox_answers: Command/Control+0 (zero, not oh) will restore the default size for a page that's been zoomed. Also View->Zoom->Reset

me: Impressed with @firefox_answers providing the answer I needed. I'd been looking in the options/preferences tabs for ages

firefox_answers: Also, for quick zooming in & out use control plus or control minus. in Firefox 3, the zoom sticks per site until you change it.

Not only have I learnt some useful tips through that exchange, I feel much more confident about using Firefox 3 now that I know authoritative help is so close to hand, and in a weird way I have established a relationshp with them.

Finally, twitter et al have a social function – tonight I met someone who was at the same event I was last week who vaguely recognised me because of the profile pictures attached to Twitter profiles on tweets about the event. Incidentally, he's written a good explanation of twitter, so I needn't have written this!

[1] Folksonomies to the rescue! I'd been searching for variations on 'firefox shrink text', 'firefox fit screen', 'firefox screen resize' but since the article that eventually solved my problem called it 'zoom', it took me ages to find it. If the page was tagged with other terms that people might use to describe 'my page jumps, everything resizes and looks a bit crappy' in their own words, I'd have found the solution sooner.

[2] Anyone can create a username and post away, though I assume Downing Street is the real thing.

Nice information design/visualisation pattern browser

infodesignpatterns.com is a Flash-based site that presents over 50 design patterns 'that describe the functional aspects of graphic components for the display, behaviour and user interaction of complex infographics'.

The development of a design pattern taxonomy for data visualisation and information design is a work in progress, but the site already has a useful pattern search, based on order principle, user goal, graphic class and number of dimensions.

'Building websites with findability in mind' at WSG London Findability

Last night I went to the WSG London Findability event at Westminster University, part of London Web Week; here's part two of my notes.

Stuart Colville's session on 'Building websites with findability in mind' was full of useful, practical advice.

Who needs to find your content?

Basic requirements:
Understand potential audience(s)
Content
Semantic markup
Accessibility (for people and user agents)
Search engine friendly

Content [largely about blogs]:
Make it compelling for your audience
There's less competition in niche subjects
Originality (synthesising content, or representing existing content in new ways is also good)
Stay on topic
Provide free, useful information or tools
Comments and discussion (from readers, and interactions with readers) are good

Tagging:
Author or user-generated, or both
Good for searching
Replaces fixed categories
Enables arbitrary associations
Rich

Markup (how to make content more findable):
Use web standards. They're not a magic fix but they'll put you way ahead. The content:code ratio is improved, and errors are reduced.
Use semantic markup. Adds meaning to content.
Try the roundabout SEO test
Make your sites accessible. Accessible content is indexable content.

Markup meta:
Keywords versus descriptions. Tailor descriptions for each page; they can be automatically generated; they can be used as summaries in search results.
WordPress has good plugins – metadescription for auto-generated metadata, there are others for manual metadata.

Markup titles and headings:
Make them good – they'll appear as bookmark etc titles.
One H1 per page; the most meaningful title for that page
Separate look from heading structure.

Markup text:
Use semantically correct elements to describe content. Strong, em, etc.

Markup imagery:
Background images are fine if they're only a design element.
Use image replacement if the images have meaning. There are some accessibility issues.
Use attributes correctly, and make sure they're relevant.

Markup microformats:
Microformats are a simple layer of structure around content
They're easy to add to your site
Yahoo! search and technorati index them, Firefox 3 will increase exposure.

Markup Javascript:
Start unobtrusive and enhance according to the capabilities of the user agent.
Don't be stupid. Use onClick, don't kill href (e.g. href="#").
Use event delegation – no inline events. It's search engine accessible, has nice clean markup and you still get all the functionality.
[Don't break links! I like opening a bunch of search results in new tabs, and I can't do that on your online catalogue, I'll shop somewhere I can. Rant over.]

Performance and indexation:
Use 'last modified' headers – concentrate search agents on fresh content
Sites with Google Ads are indexed more often.

URLs:
Hackable URLs are good [yay!].
Avoid query strings, they won't be indexed
Put keywords in your URL path
Use mod_rewrite, etc.

URI permanence:
"They should be forever". But you need to think about them so they can be forever even if you change your mind about implementation or content structure.
Use rewrites if you do change them.

De-indexing (if you've moved content)
Put up a 404 page with proper http headers. 410 'intentionally gone' is nice.
There's a tool on Google to quickly de-index content.
Make 404s useful to users – e.g. run an internal search and display likely results from your site based on their search engine keywords [or previous page title].

Robots.txt – really good but use carefully. Robots-Nocontent – Yahoo! introduced 'don't index' for e.g. divs but it hasn't caught on.

Moving content:
Use 301. Redirect users and get content re-indexed by search engines.

Tools for analysing your findability:
Google webmaster tools, Google analytics, log files. It's worth doing, check for broken links etc.

Summary:
Think about findability before you write a line of code.
Start with good content, then semantic markup and accessibility.
Use sensible headings, titles, links.

WSG London Findability 'introduction to findability'

Last night I went to the WSG London Findability event at Westminster University. The event was part of London Web Week. As always, apologies for any errors; corrections and comments are welcome.

First up was Cyril Doussin with an 'introduction to findability'.

A lot of it is based on research by Peter Morville, particularly Ambient Findability.

So what do people search for?
Knowledge – about oneself; about concepts/meaning; detailed info (product details, specs); entities in society (people, organisations, etc.)
Opinions – to validate a feeling or judgement; establish trust relationships; find complementary judgements.

What is information? From simple to complex – data -> information -> knowledge.

Findability is 'the quality of being locatable or navigatable'.
Item level – to what degree is a particular object easy to discover or locate?
System level – how well does the environment support navigation and retrieval?

Wayfinding requires: knowing where you are; knowing your destination; following the best route; being able to recognise your destination; being able to find your way back.

The next section was about how to make something findable:
The "in your face" discovery principle – expose the item in places known to be frequented by the target audience. He showed an example of a classic irritating Australian TV ad, a Brisbane carpet store in this case. It's disruptive and annoying, but everyone knows it exists. [Sadly, it made me a little bit homesick for Franco Cozzo. 'Megalo megalo megalo' is also a perfect example of targeting a niche audience, in this case the Greek and Italian speakers of Melbourne.]

Hand-guided navigation – sorting/ordering (e.g. sections of a restaurant menu); sign-posting.

Describe and browse (e.g. search engines) – similar to asking for directions or asking random questions; get a list of entry points to pages.

Mixing things up – the Google 'search within a search' and Yahoo!'s 'search assist' box both help users refine searches.

Recommendations (communication between peers) – the searcher describes intent; casual discussions; advice; past experiences.
The web is a referral system. Links are entry doors to your site. There's a need for a relevancy system whether search engines (PageRank) or peer-based systems (Digg).

Measuring relevance (effectiveness):
Precision – if it retrieves only relevant documents
Recall – whether it retrieves all relevant documents.

Good tests for the effectiveness of your relevance mechanism:
Precision = number of relevant and retrieved documents divided by the total number retrieved.
Recall = number of relevant and retrieved documents divided by the total number of relevant documents.

Relevance – need to identify the type of search:
Sample search – small number of documents are sufficient (e.g. first page of Google results)
Existence search – search for a specific document
Exhaustive search – full set of relevant data is needed.
Sample and existence searches require precision; exhaustive searches require recall.

Content organisation:
Taxonomy – organisation through labelling [but it seems in this context there's no hierarchy, the taxon are flat tags].
Ontology – taxonomy and inference rules.
Folksonomy – a social dimension.

[In the discussion he mentioned eRDF (embedded RDF) and microformats. Those magic words – subject : predicate : object.]

Content organisation is increasingly important because of the increasing volume of information and sharing of information. It's also a very good base for search engines.

Measuring findability on the web: count the number of steps to get there. There are many ways to get to data – search engines, peer-based lists and directories.

Recommendations:
Aim to strike a balance between sources e.g. search engine optimisation and peer-based.
Know the path(s) your audience(s) will follow (user testing)
Understand the types of search
Make advertising relevant (difficult, as it's so context-dependent)
Make content rich and relevant
Make your content structured

I've run out of lunch break now, but will write up the talks by Stuart Colville and Steve Marshall later.

Another model for connecting repositories

Dr Klaus Werner has been working with Intelligent Cultural Resources Information Management (ICRIM) on connecting repositories or information silos from "different cultural heritage organizations – museums, superintendencies, environmental and architectural heritage organizations" to make "information resources accessible, searchable, re-usable and interchangeable via the internet".

You can read more on these CAA07 conference slides: ICRIM: Interconnectivity of information resources across a network of federated repositories (pdf download), and the abstract from the CAA07 paper might also provide some useful context:

The HyperRecord system, used by the Capitoline Museums (Rome) and the Bibliotheca Hertziana (Max-Planck Institute, Rome) and developed as Culture2000 project, is a framework for the inter-connectivity of information resources from museums, archives and cultural institutes.
…
The repositories offer both the usual human interface for research (fulltext, title, etc.) and a smart REST API with a powerful behind-the-scenes direct machine-to-machine facility for querying and retrieving data.
…
The different information resources use digital object identifiers in the form of URNs (up to now, mostly for museum objects) for identification and direct-access. These allow easy aggregation of contents (data, records, documents) not only inside a repository but also across boundaries using the REST API for serving XML over a plain HTTP connection, in fact creating a loosely coupled network of repositories.

Thanks to Leif Isaksen for putting Dr Werner in contact with me after he saw his paper at CAA07.

Friday fun(ner) bits

Cornelius Holtorf asks, "is Indiana Jones good or bad for archaeology?" in Hero! Real archaeology and "Indiana Jones and the Kingdom of the Crystall Skull".

And on YouTube, the SEO Rapper tells you how it's done.