KO

More on mapping

    Start a conversation 
Estimated reading time 1–2 minutes

When trying to integrate diverse vocabularies and repositories, the way to go is mapping – metadata crosswalks as they are known in the US. I’ve been looking for software that can handle mappings between taxonomies, of which there are a range on the market, but what is really exciting is the development of automated mapping tools to take much of the “heavy lifting” out of the work (for example Synaptica’s AutoMatch).

It seems to me that there is a convergence between semi-automated mapping (we’ll be needing human editorial oversight for some time) and the semantic web project. A combination of auto-mapping and RDF/OWL/SKOS should enable us to cross-navigate repositories using our own terminologies. This is the realisation of the “many perspectives, one repository” approach that should get round many problems of the subjective/objective divide. If you can’t agree on which viewpoint to adopt, why not have them all and save the arguments for the nuances of the mapping process. Within organisations this has immediate benefits, in removing a lot of politicking that surrounds information and knowledge management. However, there is also huge cultural potential when it comes to opening up public repositories and making them interoperate. The Europeana project is a good example.

Re-intermediating research

    Start a conversation 
Estimated reading time 2–2 minutes

A fine example of how much inspiration you can get from randomly talking to the people who are actually engaging with customers was given to me by our Research Guide last week.

She wants a video-tagging tool that includes chat functionality, some kind of interactive “pointing” facility, and plenty of metadata fields for adding and describing tags. When she is helping a customer to find the perfect bit of footage, she often finds herself in quite detailed discussions trying to explain why she thinks a shot meets their needs or in trying to understand what it is they don’t like about a particular scene. If they could both view the same footage in real time linked by some sort of online meeting functionality, they would be able to show each other what they meant and discuss and explain requirements far more easily and precisely.

This struck me as exactly how we should as information professionals be seizing new technologies to “re-intermediate” ourselves into the search process. Discussing bits of video footage is a particularly rich example, but what if an expert information professional could have a look at your search results and give you guidance via a little instant chat window? You could call up a real person to help you when you needed it without leaving your desk, in just the same way that online tech support chats work (I’ve had mixed experiences with those, but the principle is sound). I’m thinking especially of corporate settings, but wouldn’t it be a fantastic service for public libraries to offer?

It seems such a good idea I can’t believe it’s not already being done and would be very pleased to hear from anyone out there who is offering those sorts of services and in particular if there are any tools that support real time remote discussion around audio visual research.

‘We Like Lists Because We Don’t Want to Die’

    Start a conversation 
Estimated reading time 2–3 minutes

I heard Umberto Eco lecture on the search for a perfect language about 20 years ago and still find myself referencing him (trying to create a taxonomy that suits everyone would seem to be a similar quest). The lectures were nothing to do with my course really, so I benefited from that serendipitous knowledge discovery that just happens when you have time and space to explore ideas. So I was pleased when a few weeks ago this interview with Eco in der Spiegel happened upon me in the twittersphere (what’s the protocol for referencing tweets?). In the interview, Eco asserts that ‘We Like Lists Because We Don’t Want to Die’ .

It’s arguable that we do most things because we don’t want to die, but I was struck by the depiction of how fundamental the urge to collect and classify is to culture. At the LIKE dinner in early December, Cerys Hearsy said “we like hierarchies. We understand how they work” and she was talking about modern records management. Jan Wyllie in Taxonomies: Frameworks for Corporate Knowledge points out that taxonomies have been used for millennia (something I also reference frequently). Perhaps we like dualities because our brain has two hemispheres and we dream of a taxonomy of everything because then we would have conquered infinity and death itself, but such ideas are way beyond what I can speculate sensibly about. What I can say is that lists and taxonomies have been useful for so long that anyone who bets they are going to vanish anytime soon is facing very long odds. We will create them differently as technology advances, and we will manage without them in many situations where they would be helpful (if New Scientist had a taxonomy, I might have found the article about duality and the brain), but when we really need to be sure, we will create them.

Google is not perfect

    9 comments 
Estimated reading time 6–10 minutes

Perhaps I am starting to suffer from “deformation professionelle”, but I am constantly surprised by how often I am still asked “Why do we need classification now we have free text search and Google?”. This post is designed to answer the question. If you are an info pro, it won’t tell you anything you don’t already know, but as always I’d appreciate suggestions and additions.

The question seems to me a bit like asking “Why do we need scalpels now we have invented scissors?”. Scissors are a brilliant invention and they do many wonderful things – just like Google – they make all sorts of cutting quick and easy, but there are also many situations when they are not the right tool for the job. I don’t want a surgeon cutting me open with scissors except in a real emergency.

Google is excellent when searching text for something specific and known – pdf of a tube map of London, “Ode to Autumn by John Keats”; documents that contain the phrase “small furry creatures from alpha centauri”. However, you may get poor results if you don’t spell all the words correctly (or they have not been spelled correctly in your source material) or you get the form of the words wrong (“The Tales of the Arabian Nights”; “The Tales of the Arabian Knights”; “1001 Arabian Nights”; “A Thousand and One Arabian Nights”; etc.). So in order to get good results, you already need to know quite a lot about what you are looking for.

Of course most people chuck in the first couple of words that occur to them and hope for the best. This works fine if you have plenty of time to wade through lots of irrelevant results, think up lots of alternative words if the first ones you tried didn’t work, are prepared to chase around to get to where you are trying to go (sometimes misspellings are linked to correct spellings), and are not particularly fussy about the source (if you just want a rough idea of what the main exports of Ecuador are to settle a pub bet, rather than the most up-to-date analysis to help you to decide whether or not to invest a large sum in a trading company). The sheer volume of information in Google means that almost every search throws up far more results than the casual searcher will need. They may not be the best results, but they’ll usually do.

Disambiguation
It gets messier when the words you are searching on refer to a number of different things (do you mean Titanic the ship, the film, the song, etc.; “budget” and “Spain” as in the Spanish economy, not budget holidays in Spain). This sort of search can produce thousands, if not millions of irrelevant results, so classification that can provide disambiguation – sorting Spanish holiday pages from Spanish economy pages – has real value in terms of saved time. This is why enterprise search solutions – where employees’ wasted time is an expense to the company – offer classification as a fundamental aspect of the service. This is why dictionaries and encyclopedias make clear the difference between Mercury the metal, the Roman god, the planet, etc., depression in economics, meteorology, geography, psychiatry, etc., and is why Wikipedia’s disambiguation pages are so useful.

Imperfect prior knowledge
Google is not very helpful when you don’t know the exact title or an exact phrase in a document (was it Birmingham City Council’s guide to recycling, Birmingham Council guide to waste and recycling, West Midlands waste management policy…?) and practically no help at all when you only have circumstantial information relating to a subject area (what’s that story where they are captured by aliens and only get let out when they build a cage and catch a little animal in it to prove they are intelligent too? are their any laws about importing pet parrots from France? what was that sad music I heard on the radio last night?).

It is a laborious process of elimination to try different sets of search terms in Google, but a classification narrows the scope of your search so making it more likely you will find what you need (short stories >science fiction immediately means you are not searching the whole of literature, a set of documents under the heading EU>laws>animals>pets means you don’t have to wade through all EU agricultural law; radio>date of broadcast>soundtracks means you are not trawling through all the recorded music available on the Internet).

If you are researching an unfamiliar topic you probably don’t know the sort of words that are likely to have been used, so classifications are invaluable in showing you what other things are related to that topic, whether or not they use the only words or phrases you have previously encountered. Educational products have always used classification to aid knowledge discovery.

Aboutness
The words contained within the text may not give a full sense of what that text is about. If you are looking for a poem to read at a wedding, the best poems may never use the word “wedding” or “marriage” or even “love”. You’d be more likely to find a suitable poem using a classification poems>weddings. Synonym and thesaurus functions offer associated results as well as direct searching. Ontologies cluster vocabularies and taxonomies to create concept-based classifications.

Free text search on its own cannot provide the richness of suggestions that a classified system can offer. As far as I know, Google relies on source material to provide useful synonyms. (Incidentally I’ve found it remarkably tricky to find good references to how Google works via searching on Google…)

Complex queries
Google is also not helpful at answering complex queries (what is the fourth largest city in the EU by population? how many countries have majority Muslim populations?) that require combinations of sources. This is a gap spotted by “answer engines” such as True Knowledge and Wolfram Alpha, but both their systems depend on highly crafted classifications (taxonomies and ontologies). +Google Squared is Google’s own version.

Comprehensiveness
Google is not a management system. Because of the vagaries described above, you can’t use Google to tell you how many documents you hold about a particular subject, or which document is the most authoritative or up to date, unless you have been very careful to add consistent metadata to each one. Even then, Google might miss the most up-to-date document because its Page Rank is mainly based on popularity, and popularity takes time to cultivate, especially in niche areas. This is why digital asset management systems have metadata functions that provide controlled and filtered searching.

Sound and vision
Google still is a bit patchy in still image, video, and audio search. Technologies are improving all the time, but we still have to be patient. Most still rely on text attached to images or captured from audio tracks, so all the problems already mentioned with free text searching apply. Companies such as imense are using an interesting range of options in generating keywords to tag images, but still use taxonomies for specialist terminology.

Summary
In short, Google is great when you know what you are looking for, when it’s not that important, and when you have plenty of time. In other words, for casual leisure searching. For any search that requires discovery and exploration, certainty, completeness, and precision, and when you want the right results quickly, you need classification.

The future of classification will be one of increasing automation, but that means the indexer or cataloguer’s job becomes more sophisticated and complex. Indexers of the future will be constructing rules for ontology and taxonomy building, training systems for specialised domains, and investigating errors in the automated systems. This may mark a change in the nature of traditional jobs, but it certainly does not mean the end of classification. Taxonomies have been around for millennia, they aren’t likely to disappear overnight.

The very fact that Google engineers are busily working on content analysis, language processing, and other new methods in order to increase the amount of classification Google can apply to its results (e.g. How can we improve our understanding of low level representations of images that goes beyond bag of words modeling?) shows that even the master of the free text search recognises more can be done.

KO

Many to many

    2 comments 
Estimated reading time 2–2 minutes

A wise taxonomist once said to me “taxonomies are technology agnostic” and I’ve been thinking about why systems are not taxonomy agnostic. If you underpin a taxonomy with a thesaurus, can you use that to map one taxonomy to another, without altering either taxonomy? You can keep both taxonomies as metadata attached to your asset and expose one or the other depending on user choice. It’s just an interface issue. The mapping would enable cross navigation, so you could wander down one taxonomy, skip to another, then pop back to the first one if you wanted.

You could attach folksonomies too if you wanted to, and just store those as extra metadata.

I can see that there might be terminology issues that need resolving (no small task), or perhaps software or storage issues, but I can’t see why the system itself couldn’t work in theory.

I’ve spent a lot of time thinking about mediating stakeholder needs to get the best taxonomy, and that is still a valid approach when you need management and control, but I don’t see any reason not to attach other taxonomies to your core taxonomy. Those satellite taxonomies can then serve minority interests or specialised needs. As long as you collect metadata about your taxonomies and make it clear to your user the provenance of the taxonomy or folksonomy they are viewing, you can offer a range of viewpoints.

Perhaps I am missing something obvious, but it seems there is still debate about getting the best taxonomy, or choosing to implement one instead of another. That debate seems to be based on the presumption that you can only have one taxonomy at a time, but why not have lots?

From Walled Garden to Amazon Jungle

    Start a conversation 
Estimated reading time 2–4 minutes

I enjoyed the LIKE dinner the other Thursday. The speaker Tim Buckley-Owen spoke on the theme “From Walled Garden to Amazon Jungle” describing the changing environment that information professionals find themselves in. He spoke of how disintermediation is often perceived as a threat in the information world, but that this is a mistake, because out in the jungle, the services of an expert guide become indispensable if you are to avoid getting completely lost and falling prey to posionous snakes and other hazards. He pointed out that at least one other profession is facing a similarly shifting environment – the legal profession. We, however, should be in a better position than lawyers because they believe they are masters of the universe, whereas we see ourselves as merely useful. The Trafigura affair showed that information can act as a force that even the lawyers can’t contain.

Although I would never have dreamt of comparing myself to a lawyer, I could see the similarity in the way that disintermediation enabled by an online world is affecting the two professions. For lawyers, distintermediation arises out of the increasing ease of self-representation – e.g. the availability of online forms so that you can manage your own simple legal processes. As Tim pointed out, going to small claims court can already be handled online by the claimant alone. Conveyancing is becoming increasingly straightforward for non-lawyers, as it is largely a question of being able to search effectively (anybody need an information specialist – cheaper than a solicitor?). Perhaps even the processing of divorces and wills can be administered via online forms. (That might not prevent family disputes, but would certainly make them cheaper!) The smart lawyers are, of course, responding by focusing on tailor-made specialised services for unusual cases or one-off situations. This is exactly what information professionals are doing too. Librarians have always offered bespoke research services and the value they add over and above trawling through millions of results on Google is their knowledge of which sources are the best and what are the best sources to answer your specific question (and figuring out the question you really want the answer to, instead of the one you actually asked, which is much harder than it sounds). In a world where information is proliferating while the quality of sources is not necessarily improving, the knowledge of where to look is increasingly rather than decreasingly valuable.

Tim described some research indicating that the people who are least likely to delegate their research are the most senior executives (middle managers are too busy and like having people do things for them). In particular, top execs like to do their own competitor research. His hot tip for the information profession was to work with software developers to produce really effective competitior research services and tools.

Virginia Henry and David Holme have also blogged about the evening.

Like 9 is on December 3rd.

World Audio Visual Archives Heritage day

    1 comment 
Estimated reading time 2–2 minutes

I went to an interesting event last Monday night for UNESCO World Audio Visual Archives Heritage day, held at BAFTA in London.

Professor John Ellis (Department of Media Arts, Royal Holloway, University of London) talked about the growing use of TV archives, particularly news footage, in academia, pointing out that over time such material becomes increasingly valuable in such diverse areas as physiology – for example in studying the effects of ageing by analysing footage of presenters and actors who have had long careers, and town planning, as footage can reveal the buildings that previously occupied a site being considered for redevelopment.

As UK law permits academic institutions to record and keep TV and radio broadcasts for purely educational purposes, a database of material has been collected. Academia remains currently a verbal rather than visual culture, but this seems to be changing. All politicians, for example, are now so TV literate that to study them without reference to their TV appearances would be strange.

Fiona Maxwell (Director of Operations at ITV Global Entertainment), then talked about the painstaking restoration of the 1948 film The Red Shoes. She provided lots of technical details about removing mould and correcting registration errors, but also showed “before and after” clips so we could see the huge improvements.

KO

Classification at a Crossroads

    1 comment 
< 1 minute

I am sorry not to be able to make it to the UDC conference Classification at a Crossroads (via ISKO UK) in the Hague, as I am trying to get a better grasp of UDC at the moment. I would especially like to hear the paper on UDC and folksonomies. One of the issues I am thinking about is how to preserve the richness of a UDC-based classification while increasing usability. I am mulling over whether harvesting folksonomic tagging is a good way to get at the terminology that users prefer, and whether attaching additional thesaurus terms to a core classification is a good way of bridging the gap between the needs of long-term indexing and instant retrievability.

KO

Still trying to please everyone

    Start a conversation 
< 1 minute

I was very flattered to be mentioned by Bob Bater in this KOnnect post: Trying to please everyone. I wanted to spend my research time on something that would be of practical interest to taxonomy professionals, while avoiding the danger of becoming too philosophical. As Bob has such extensive experience in taxonomy work, I am delighted that he found my project interesting.