Tag Archives: search

UK Archives Discovery Forum

    1 comment 
Estimated reading time 6–10 minutes

I very much enjoyed the UKAD UK Archives Discovery Forum event at the National Archives. There were three tracks as well as plenary sessions, so I couldn’t attend everything.

Linked Data and archives

After an introduction from Oliver Morley, John Sheridan opened by talking about the National Archives and Linked Data. Although not as detailed as the talk he gave at the Online Information Conference last December, he still gave the rallying call for opening up data and spoke of a “new breed” of IT professionals who put the emphasis on the I rather than the T. He spoke about Henry Maudslay who invented the screw-cutting lathe, which enabled standardisation of nuts and bolts. This basically enabled the industrial revolution to happen. Previously, all nuts and bolts were made individually as matching pairs, but because the process was manual, each pair was unique and not interchangeable. If you lost the bolt, you needed a new pair. This created huge amounts of management and cataloguing of individual pairs, especially if a machine had to be taken apart and re-assembled, and meant interoperability of machinery was almost impossible. Sheridan asserted that we are at that stage with data – all our data ought to fit together but at the moment, all the nuts and bolts have to be hand crafted. Linked Data is a way of standardising so that we can make our data interchangeable with other people’s. (I like the analogy because it makes clear the importance of interoperability, but obviously getting the nuts and bolts to fit is only a very small part of what makes a successful machine, let alone a whole factory or production line. Similarly Linked Data isn’t going to solve broad publishing or creative and design problems, but it makes those big problems easy to work on collaboratively.)

Richard Wallis from Talis spoke about Linked Data. He likes to joke that you haven’t been to a Linked Data presentation unless you’ve seen the Linked Open Data cloud diagram. My version is that you haven’t been to a Linked Data event unless at least one of the presenters was from Talis! Always an engaging speaker, his descriptions of compartmentalisation of content and distinctions between Linked Data, Open Data, and Linked Open Data were very helpful. He likes to predict evangelically that the effects of linking data will be more profound to the way we do business than the changes brought about by the web itself. Chatting to him over tea, he has the impression that a year ago people were curious about Linked Data and just wanted to find out what it could do, but this year they are now feeling a bit more comfortable with the concepts and are starting to ask about how they can put them into practice. There certainly seemed to be a lot of enthusiasm in the archive sector, which is generally cash-strapped, but highly co-operative, with a lot of people passionate about their collections and their data and eager to reach as wide an audience as possible.

A Vision of Britain

Humphrey Southall introduced us to A Vision of Britain, which is a well-curated online gazetteer of Britain, with neat functions for providing alternative spellings of placenames, and ways of tackling the problems of boundaries, especially of administrative divisions, that move over time. I’m fascinated by maps, and they have built in some interesting historical map functionality too.

JISC and federated history archives

David Flanders from JISC talked about how JISC and its Resource Discovery Task Force can provide help and support to educational collections especially in federation and Linked Data projects. He called on archives managers to use hard times to skill up, so that when more money becomes available staff are full of knowledge, skills, and ideas and ready to act. He also pointed out how much can be done in the Linked Data arena with very little investment in technology.

I really enjoyed Mike Pidd’s talk about the JISC-funded Connected Histories Project. They have adopted a very pragmatic approach to bringing together various archives and superimposing a federated search system based on metadata rationalisation. Although all they are attempting in terms of search and browse functionality is a simple set of concept extractions to pick out people, places, and dates, they are having numerous quality control issues even with those. However, getting all the data into a single format is a good start. I was impressed that one of their data sets took 27 days to process and they still take delivery of data on drives through the post. They found this was much easier to manage than ftp or other electronic transfer, just because of the terabyte volumes involved (something that many people tend to forget when scaling up from little pilot projects to bulk processes). Mike cautioned against using RDF and MySql as processing formats. They found that MySql couldn’t handle the volumes, and RDF they found too “verbose”. They chose to use a fully Lucene solution, which enabled them to bolt in new indexes, rather than reprocess whole data sets when they wanted to make changes. They can still publish out to RDF.

Historypin

Nick Stanhope enchanted the audience with Historypin, an offering from wearewhatwedo.org. Historypin allows people to upload old photos, and soon also audio and video, and set them in Google streetview. Although flickr has some similar functions, historypin has volunteers who help to place the image in exactly the right place, and Google have been offering support and are working on image recognition techniques to help place photos precisely. This allows rich historical street views to be built up. What impressed me most, however, was that Nick made the distinction between subjective and objective metadata, with his definition being objective metadata is metadata that can be corrected and subjective metadata is data that can’t. So, he sees objective metadata as the time and the place that a photo was taken – if it is wrong someone might know better and be able to correct it, and subjective metadata as the stories, comments, and opinions that people have about the content, which others cannot correct – if you upload a story or a memory, no-one else can tell you that it is wrong. We could split hairs over this definition, but the point is apposite when it comes to provenance tracking. He also made the astute observation that people very often note the location that a photo is “of”, but it is far more unusual for them to note where it was taken “from”. However, where it was taken from is often more use for augmented reality and other applications that try to create virtual models or images of the world. Speaking to him afterwards, I asked about parametadata, provenance tracking, etc. and he said these are important issues they are striving to work through.

Women’s history

Theresa Doherty from the Women’s Library ended the day with a call to stay enthusiastic and committed despite the recession, pointing out that it is an achievement that archives are still running despite the cuts, and that this shows how valued data and archives are in the national infrastructure, how important recording our history is, and that while archivists continue to value their collections, enjoy their visitors and users, and continue to want their data to reach a wider audience the sector will continue to progress. She described how federating the Genesis project within the Archives hub had boosted use of their collections, but pointed out that funders of archives need to recognise that online usage of collections is just as valid as getting people to physically turn up. At the moment funding typically is allocated on visitor numbers through the doors, and that this puts too much emphasis on trying to drag people in off the street at the expense of trying to reach a potentially vast global audience online.

Serendipity and large video collections

    Start a conversation 
Estimated reading time 2–2 minutes

I enjoyed this blog post: On Serendipity. Ironically, it was recommended to me, and I am now recommending it!

Serendipity is rarely of use to the asset manager, who wants to find exactly what they expect to find, but is a delight for the consumer or leisure searcher. People sometimes cite serendipity as a being a reason to abandon classification, but in my experience classification often enhances serendipity and can be lost in simple online search systems.

For example, when browsing an alphabetically ordered collection in print, such as an encyclopedia or dictionary, you just can’t help noticing the entries that sit next to the one you were looking for. This can lead you to all sorts of interesting connections – for example, looking up crescendo, I couldn’t help noticing that crepuscular means relating to twilight, and that there is a connection between crepe paper and the crepes you can eat (from the French for “wrinkly”), but crepinette has a different derivation (from the French for “caul”). What was really interesting was the fact that there was no connection, other than an accident of alphabetical order. I wasn’t interested in things crepuscular, or crepes and crepinettes, and I can’t imagine anyone deliberately modelling connections between all these things as “related concepts”.

Wikipedia’s “random article” function is an attempt to generate serendipity alogrithmically. On other sites the “what people are reading/borrowing/watching now” functions use chronological order to throw out unsought items from a collection in the hope that they will be interesting. Twitter’s “trending topics” use a combination of chronological order and statistics on the assumption that what is popular just now is intrinsically interesting. These techniques look for “interestingness” out of what can be calculated and it is easy to see how they work, but the semantic web enthusiasts aim to open up to automated processing the kind of free associative links that human brains are so good at generating.

For digital asset management, search is not enough

    Start a conversation 
< 1 minute

I was very flattered to be asked by Kate Simpson to write another article – For digital asset management, search is not enough – for the excellent resource FUMSI.

The article sums up some of the latest DAM trends and technologies with the intention of demystifying some of the services that software vendors are offering. It owes much to Theresa Regli‘s excellent work as a “professional cynic”.

Google is not perfect

    9 comments 
Estimated reading time 6–10 minutes

Perhaps I am starting to suffer from “deformation professionelle”, but I am constantly surprised by how often I am still asked “Why do we need classification now we have free text search and Google?”. This post is designed to answer the question. If you are an info pro, it won’t tell you anything you don’t already know, but as always I’d appreciate suggestions and additions.

The question seems to me a bit like asking “Why do we need scalpels now we have invented scissors?”. Scissors are a brilliant invention and they do many wonderful things – just like Google – they make all sorts of cutting quick and easy, but there are also many situations when they are not the right tool for the job. I don’t want a surgeon cutting me open with scissors except in a real emergency.

Google is excellent when searching text for something specific and known – pdf of a tube map of London, “Ode to Autumn by John Keats”; documents that contain the phrase “small furry creatures from alpha centauri”. However, you may get poor results if you don’t spell all the words correctly (or they have not been spelled correctly in your source material) or you get the form of the words wrong (“The Tales of the Arabian Nights”; “The Tales of the Arabian Knights”; “1001 Arabian Nights”; “A Thousand and One Arabian Nights”; etc.). So in order to get good results, you already need to know quite a lot about what you are looking for.

Of course most people chuck in the first couple of words that occur to them and hope for the best. This works fine if you have plenty of time to wade through lots of irrelevant results, think up lots of alternative words if the first ones you tried didn’t work, are prepared to chase around to get to where you are trying to go (sometimes misspellings are linked to correct spellings), and are not particularly fussy about the source (if you just want a rough idea of what the main exports of Ecuador are to settle a pub bet, rather than the most up-to-date analysis to help you to decide whether or not to invest a large sum in a trading company). The sheer volume of information in Google means that almost every search throws up far more results than the casual searcher will need. They may not be the best results, but they’ll usually do.

Disambiguation
It gets messier when the words you are searching on refer to a number of different things (do you mean Titanic the ship, the film, the song, etc.; “budget” and “Spain” as in the Spanish economy, not budget holidays in Spain). This sort of search can produce thousands, if not millions of irrelevant results, so classification that can provide disambiguation – sorting Spanish holiday pages from Spanish economy pages – has real value in terms of saved time. This is why enterprise search solutions – where employees’ wasted time is an expense to the company – offer classification as a fundamental aspect of the service. This is why dictionaries and encyclopedias make clear the difference between Mercury the metal, the Roman god, the planet, etc., depression in economics, meteorology, geography, psychiatry, etc., and is why Wikipedia’s disambiguation pages are so useful.

Imperfect prior knowledge
Google is not very helpful when you don’t know the exact title or an exact phrase in a document (was it Birmingham City Council’s guide to recycling, Birmingham Council guide to waste and recycling, West Midlands waste management policy…?) and practically no help at all when you only have circumstantial information relating to a subject area (what’s that story where they are captured by aliens and only get let out when they build a cage and catch a little animal in it to prove they are intelligent too? are their any laws about importing pet parrots from France? what was that sad music I heard on the radio last night?).

It is a laborious process of elimination to try different sets of search terms in Google, but a classification narrows the scope of your search so making it more likely you will find what you need (short stories >science fiction immediately means you are not searching the whole of literature, a set of documents under the heading EU>laws>animals>pets means you don’t have to wade through all EU agricultural law; radio>date of broadcast>soundtracks means you are not trawling through all the recorded music available on the Internet).

If you are researching an unfamiliar topic you probably don’t know the sort of words that are likely to have been used, so classifications are invaluable in showing you what other things are related to that topic, whether or not they use the only words or phrases you have previously encountered. Educational products have always used classification to aid knowledge discovery.

Aboutness
The words contained within the text may not give a full sense of what that text is about. If you are looking for a poem to read at a wedding, the best poems may never use the word “wedding” or “marriage” or even “love”. You’d be more likely to find a suitable poem using a classification poems>weddings. Synonym and thesaurus functions offer associated results as well as direct searching. Ontologies cluster vocabularies and taxonomies to create concept-based classifications.

Free text search on its own cannot provide the richness of suggestions that a classified system can offer. As far as I know, Google relies on source material to provide useful synonyms. (Incidentally I’ve found it remarkably tricky to find good references to how Google works via searching on Google…)

Complex queries
Google is also not helpful at answering complex queries (what is the fourth largest city in the EU by population? how many countries have majority Muslim populations?) that require combinations of sources. This is a gap spotted by “answer engines” such as True Knowledge and Wolfram Alpha, but both their systems depend on highly crafted classifications (taxonomies and ontologies). +Google Squared is Google’s own version.

Comprehensiveness
Google is not a management system. Because of the vagaries described above, you can’t use Google to tell you how many documents you hold about a particular subject, or which document is the most authoritative or up to date, unless you have been very careful to add consistent metadata to each one. Even then, Google might miss the most up-to-date document because its Page Rank is mainly based on popularity, and popularity takes time to cultivate, especially in niche areas. This is why digital asset management systems have metadata functions that provide controlled and filtered searching.

Sound and vision
Google still is a bit patchy in still image, video, and audio search. Technologies are improving all the time, but we still have to be patient. Most still rely on text attached to images or captured from audio tracks, so all the problems already mentioned with free text searching apply. Companies such as imense are using an interesting range of options in generating keywords to tag images, but still use taxonomies for specialist terminology.

Summary
In short, Google is great when you know what you are looking for, when it’s not that important, and when you have plenty of time. In other words, for casual leisure searching. For any search that requires discovery and exploration, certainty, completeness, and precision, and when you want the right results quickly, you need classification.

The future of classification will be one of increasing automation, but that means the indexer or cataloguer’s job becomes more sophisticated and complex. Indexers of the future will be constructing rules for ontology and taxonomy building, training systems for specialised domains, and investigating errors in the automated systems. This may mark a change in the nature of traditional jobs, but it certainly does not mean the end of classification. Taxonomies have been around for millennia, they aren’t likely to disappear overnight.

The very fact that Google engineers are busily working on content analysis, language processing, and other new methods in order to increase the amount of classification Google can apply to its results (e.g. How can we improve our understanding of low level representations of images that goes beyond bag of words modeling?) shows that even the master of the free text search recognises more can be done.

ISKO UK | Google Ups its Stakes

    Start a conversation 
Estimated reading time 1–2 minutes

ISKO UK’s KOnnect blog notes that at least Google is taking metadata seriously.

Chatting about Wolfram Alpha the other day, it was pointed out to me that specialist knowledge for a general audience is actually a very niche area, and this is the source of the hype. You need to persuade your VC funders you are revolutionary, when actually you have a very tricky business model. Serious researchers will be using specialised systems already and most people want to look up things like train times rather than atomic weights of elements, so your market is people like students and journalists, who have an intermediate level of interest. Perhaps there are enough of them in the world to generate plenty of advertising revenue, but it seems like a tough call.

I hope the funders are happy with the old reference publishing model – lots of investment up front, in the hope not that the finished product will generate huge initial profits, but will have a long steady life. Wolfram Alpha employed 150 people in essentially traditional content creation roles and it will be interesting to see how they get their money back. Google doesn’t have to pay for its own content or metadata creation!

BBC NEWS | Wolfram Alpha ‘as important as Google’

    1 comment 
Estimated reading time 2–2 minutes

BBC NEWS | Technology | Web tool ‘as important as Google’. Here’s a new search tool that will – apparently – be “like interacting with an expert, it will understand what you’re talking about, do the computation, and then present you with the results”. Dr Wolfram says: “Wolfram Alpha is like plugging into a vast electronic brain…It computes answers – it doesn’t merely look them up in a big database.”

It is clearly a very sophisticated search engine – I imagine it has a bit of natural language processing with some “mashup” algorithms – and all such developments are very exciting. I am sure it will be very, very useful in relevant contexts and will have lots of very productive applications. It just seems to me to be ironic that the experts who devote themselves to promoting knowledge and understanding are so bad at picking words to describe in a sensible way what they have achieved. Is it marketing departments gone mad? Are they all misquoted by mischievous journalists? I hope if I spoke about this to Dr Wolfram he would understand what I’m talking about…

UPDATE: There’s a New Scientist preview of Wolfram Alpha, which explains a bit more about how it works. As far as I can work out, they have built a big database and are promoting its “authoritativeness” – so back to the “quality information has been mediated by experts” model.

ANOTHER UPDATE: First impressions from BBC technology: Wolfram Alpha first impressions.

Karen Blakeman’s Blog » Blog Archive » Wolfram Alpha is out – hmmm…

More on semantics

    Start a conversation 
< 1 minute

Here’s a straightforward mini-review of the state of semantic search from the Truevert search engine, dividing semantic search techniques into four groups, with no mention of digital essences or other mysticism.

Now keyword search is dead…

    Start a conversation 
Estimated reading time 1–2 minutes

I can’t help thinking the information world has become very morbid. There was Green Chameleon’s Dead KM Walking debate, CMS Watch’s Taxonomies are dead punt, and now keyword search is dead, according to the Enterprise Search Center (via Taxonomy Watch).

Stephen Arnold says “Established system vendors and newcomers promise silver bullets that will kill the werewolves plaguing enterprise search. Taxonomies resonate in some vendors’ marketing spiels. Others focus on natural language processing… ” This makes taxonomies sound like they are some new fangled techie trick, rather than the traditional sorting out we’re all used to. He then states that users expect “a search system to … Offer a web page that gives users specific suggestions and options with hotlinks to topics, categories, and key subjects … provide the user with point­ and-click options … Allow the user to drill down or jump across topics.” Are those not taxonomies for navigation?

Truevert: What is semantic about semantic search?

    Start a conversation 
Estimated reading time 1–2 minutes

Truevert: What is semantic about semantic search? is an easy introduction to the thinking behind the Truevert semantic search engine. I was heartened by the references to Wittgenstein and the attention Truevert have paid to the work of linguists and philosophers. So much commercial search seems to have been driven by computer scientists with little interest in philosophy, or if they did they kept quiet about it (any counter examples out there?)! Perhaps philosophers have not been so good at promoting themselves either. Perhaps the Chomskyian attempt to divide linguistics itself into “hard scientific” linguistics and “fuzzy” linguistic disciplines like sociolinguistics has not helped.

As a believer in interdisciplinary and collaborative approaches, I have always wondered why we seemed to be so bad at building these bridges and information science has always struck me as a natural crossing point. Of course, there has been a lot of collaboration, but my impression is that academia has been rather better at this than the commercial world, with organisations like ISKO UK working hard to forge links. Herbert Roitblat at Truevert is obviously proud of their philosophical and linguistic awareness, and more interestingly, thinks it is worth broadcasting in a promotional blog post.