Tag Archives: taxonomy

KO

More on mapping

    Start a conversation 
Estimated reading time 1–2 minutes

When trying to integrate diverse vocabularies and repositories, the way to go is mapping – metadata crosswalks as they are known in the US. I’ve been looking for software that can handle mappings between taxonomies, of which there are a range on the market, but what is really exciting is the development of automated mapping tools to take much of the “heavy lifting” out of the work (for example Synaptica’s AutoMatch).

It seems to me that there is a convergence between semi-automated mapping (we’ll be needing human editorial oversight for some time) and the semantic web project. A combination of auto-mapping and RDF/OWL/SKOS should enable us to cross-navigate repositories using our own terminologies. This is the realisation of the “many perspectives, one repository” approach that should get round many problems of the subjective/objective divide. If you can’t agree on which viewpoint to adopt, why not have them all and save the arguments for the nuances of the mapping process. Within organisations this has immediate benefits, in removing a lot of politicking that surrounds information and knowledge management. However, there is also huge cultural potential when it comes to opening up public repositories and making them interoperate. The Europeana project is a good example.

Google is not perfect

    9 comments 
Estimated reading time 6–10 minutes

Perhaps I am starting to suffer from “deformation professionelle”, but I am constantly surprised by how often I am still asked “Why do we need classification now we have free text search and Google?”. This post is designed to answer the question. If you are an info pro, it won’t tell you anything you don’t already know, but as always I’d appreciate suggestions and additions.

The question seems to me a bit like asking “Why do we need scalpels now we have invented scissors?”. Scissors are a brilliant invention and they do many wonderful things – just like Google – they make all sorts of cutting quick and easy, but there are also many situations when they are not the right tool for the job. I don’t want a surgeon cutting me open with scissors except in a real emergency.

Google is excellent when searching text for something specific and known – pdf of a tube map of London, “Ode to Autumn by John Keats”; documents that contain the phrase “small furry creatures from alpha centauri”. However, you may get poor results if you don’t spell all the words correctly (or they have not been spelled correctly in your source material) or you get the form of the words wrong (“The Tales of the Arabian Nights”; “The Tales of the Arabian Knights”; “1001 Arabian Nights”; “A Thousand and One Arabian Nights”; etc.). So in order to get good results, you already need to know quite a lot about what you are looking for.

Of course most people chuck in the first couple of words that occur to them and hope for the best. This works fine if you have plenty of time to wade through lots of irrelevant results, think up lots of alternative words if the first ones you tried didn’t work, are prepared to chase around to get to where you are trying to go (sometimes misspellings are linked to correct spellings), and are not particularly fussy about the source (if you just want a rough idea of what the main exports of Ecuador are to settle a pub bet, rather than the most up-to-date analysis to help you to decide whether or not to invest a large sum in a trading company). The sheer volume of information in Google means that almost every search throws up far more results than the casual searcher will need. They may not be the best results, but they’ll usually do.

Disambiguation
It gets messier when the words you are searching on refer to a number of different things (do you mean Titanic the ship, the film, the song, etc.; “budget” and “Spain” as in the Spanish economy, not budget holidays in Spain). This sort of search can produce thousands, if not millions of irrelevant results, so classification that can provide disambiguation – sorting Spanish holiday pages from Spanish economy pages – has real value in terms of saved time. This is why enterprise search solutions – where employees’ wasted time is an expense to the company – offer classification as a fundamental aspect of the service. This is why dictionaries and encyclopedias make clear the difference between Mercury the metal, the Roman god, the planet, etc., depression in economics, meteorology, geography, psychiatry, etc., and is why Wikipedia’s disambiguation pages are so useful.

Imperfect prior knowledge
Google is not very helpful when you don’t know the exact title or an exact phrase in a document (was it Birmingham City Council’s guide to recycling, Birmingham Council guide to waste and recycling, West Midlands waste management policy…?) and practically no help at all when you only have circumstantial information relating to a subject area (what’s that story where they are captured by aliens and only get let out when they build a cage and catch a little animal in it to prove they are intelligent too? are their any laws about importing pet parrots from France? what was that sad music I heard on the radio last night?).

It is a laborious process of elimination to try different sets of search terms in Google, but a classification narrows the scope of your search so making it more likely you will find what you need (short stories >science fiction immediately means you are not searching the whole of literature, a set of documents under the heading EU>laws>animals>pets means you don’t have to wade through all EU agricultural law; radio>date of broadcast>soundtracks means you are not trawling through all the recorded music available on the Internet).

If you are researching an unfamiliar topic you probably don’t know the sort of words that are likely to have been used, so classifications are invaluable in showing you what other things are related to that topic, whether or not they use the only words or phrases you have previously encountered. Educational products have always used classification to aid knowledge discovery.

Aboutness
The words contained within the text may not give a full sense of what that text is about. If you are looking for a poem to read at a wedding, the best poems may never use the word “wedding” or “marriage” or even “love”. You’d be more likely to find a suitable poem using a classification poems>weddings. Synonym and thesaurus functions offer associated results as well as direct searching. Ontologies cluster vocabularies and taxonomies to create concept-based classifications.

Free text search on its own cannot provide the richness of suggestions that a classified system can offer. As far as I know, Google relies on source material to provide useful synonyms. (Incidentally I’ve found it remarkably tricky to find good references to how Google works via searching on Google…)

Complex queries
Google is also not helpful at answering complex queries (what is the fourth largest city in the EU by population? how many countries have majority Muslim populations?) that require combinations of sources. This is a gap spotted by “answer engines” such as True Knowledge and Wolfram Alpha, but both their systems depend on highly crafted classifications (taxonomies and ontologies). +Google Squared is Google’s own version.

Comprehensiveness
Google is not a management system. Because of the vagaries described above, you can’t use Google to tell you how many documents you hold about a particular subject, or which document is the most authoritative or up to date, unless you have been very careful to add consistent metadata to each one. Even then, Google might miss the most up-to-date document because its Page Rank is mainly based on popularity, and popularity takes time to cultivate, especially in niche areas. This is why digital asset management systems have metadata functions that provide controlled and filtered searching.

Sound and vision
Google still is a bit patchy in still image, video, and audio search. Technologies are improving all the time, but we still have to be patient. Most still rely on text attached to images or captured from audio tracks, so all the problems already mentioned with free text searching apply. Companies such as imense are using an interesting range of options in generating keywords to tag images, but still use taxonomies for specialist terminology.

Summary
In short, Google is great when you know what you are looking for, when it’s not that important, and when you have plenty of time. In other words, for casual leisure searching. For any search that requires discovery and exploration, certainty, completeness, and precision, and when you want the right results quickly, you need classification.

The future of classification will be one of increasing automation, but that means the indexer or cataloguer’s job becomes more sophisticated and complex. Indexers of the future will be constructing rules for ontology and taxonomy building, training systems for specialised domains, and investigating errors in the automated systems. This may mark a change in the nature of traditional jobs, but it certainly does not mean the end of classification. Taxonomies have been around for millennia, they aren’t likely to disappear overnight.

The very fact that Google engineers are busily working on content analysis, language processing, and other new methods in order to increase the amount of classification Google can apply to its results (e.g. How can we improve our understanding of low level representations of images that goes beyond bag of words modeling?) shows that even the master of the free text search recognises more can be done.

KO

Classification at a Crossroads

    1 comment 
< 1 minute

I am sorry not to be able to make it to the UDC conference Classification at a Crossroads (via ISKO UK) in the Hague, as I am trying to get a better grasp of UDC at the moment. I would especially like to hear the paper on UDC and folksonomies. One of the issues I am thinking about is how to preserve the richness of a UDC-based classification while increasing usability. I am mulling over whether harvesting folksonomic tagging is a good way to get at the terminology that users prefer, and whether attaching additional thesaurus terms to a core classification is a good way of bridging the gap between the needs of long-term indexing and instant retrievability.

Stories, effectiveness, and efficiency

    2 comments 
Estimated reading time 3–4 minutes

I’ve not been writing much lately, having finished my dissertation on September 1st and hours later having handed in my notice at work, to take up a new post as Taxonomy Manager for the BBC. I was delighted to be offered a role that follows on directly from my studies of taxonomy work, and I can’t wait to get started.

I have been very busy during September handing over to my successor, so inevitably thinking about knowledge transfer. Records management has been for the most part fairly straightforward mainly due to the nature of the business, which has enabled us to be reasonably efficient records managers, but I found it very hard to express my tacit knowledge well except through stories. This reminded me of a post by Ron Baker on effectiveness as opposed to efficiency.
Good records management is the “baseline efficiency” you need to keep functioning. It is hard to gain a competitive advantage simply by having decent records management, because if you don’t, you won’t even meet basic professional standards. Effectiveness, however, is a much more elusive beast – relying on slippery concepts like tacit knowledge, judgement calls based on experience and intuition, even artistry.
Storytelling in business has become popular because it is such a natural way of communicating expressively, as has the use of scenarios and personas in marketing and design. However, what surprised me was how formulaic my stories were – even though they applied to different areas of the business and different situations. The same characters (including myself) followed the same patterns of behaviour, through technology upgrades, changing customer needs, and other staff coming and going. I have been facing the same dilemmas and worrying about the same things over and over again, while at the time believing that things were changing and situations were different, probably because I focused on the differences not the similarities each time.

This reminded me that managing characters is just as important as managing situations (or technologies or products) and also how useful it would have been to have tried some storytelling earlier on. However, it takes time to see patterns, so you need storytellers to stick around long enough to be able to grasp what is a repeating dynamic and what is coincidence. The fast turnaround of knowledge managers is an obvious barrier to this. At the very least, it means the knowledge managers have to identify the people who have been around long enough to see the patterns in the stories, rather than expect to find it easy to pick up patterns themselves. In an organisation, there are many intertwined stories operating at different levels – from the stories of individual careers, single projects, to the overall corporate history. The conflicts and resolutions in these stories – how the tanking project was salvaged, the difficult client appeased, the divided team reunified – and between the levels of stories, seem to me to be where you will find the secrets of organisational effectiveness.

It is very easy to see taxonomies solely as mechanisms of efficiency – classifying documentation related to very linear processes such as stages in a project – but they also embody characters and stories, reflecting what is culturally important, for example. Taxonomies for knowledge discovery in particular are most effective when they are able to work with stories – if you are looking for paint does that suggest a story in which you also want paintbrushes, white spirit, an easel, etc.?

New browser tab concepts

    Start a conversation 
< 1 minute

I was very pleased to be sent this link: Mozilla design challenge showcases new browser tab concepts – Ars Technica. The winner is a lovely hierarchical visualisation that could work really well with concept maps/visual thesauruses/taxonomies. It preserves parent/child relationships using a radial format, which is more flexible than traditional trees, in that you can follow several pathways at once and maintain an overview.

Communities of Practice

    Start a conversation 
Estimated reading time 3–5 minutes

I found Communities of Practice (CoP) by Etienne Wenger to be one of those strange books that lots of people told me I must read – and it is relevant to taxonomy work (although this post digresses) – but when I did read it, it all seemed so totally obvious I could hardly believe it had taken until the 1980s to be formulated. Barbara Rogoff and Jean Lave also pioneered the thinking, but I feel sure the ideas must date back at least to medieval trade guilds. It is one of the odd features of academia that sometimes the obvious has simply not been noticed and it is the recognition of the obvious that is revolutionary.

The core ideas are that we don’t just learn about doing something or even how to do something, we learn to be a person that does those things, and this shapes our identities. So, I can get my editorial assistants to read Judith Butcher on copy editing to teach them about editing, I can give them practical exercises so they learn how to copy edit, but it is only after they have been given real copy editing work, amongst other copy editors, that they experience how copy editors behave, and so learn how to be copy editors. Learning is therefore a continuous lifelong process.

In the UK there has traditionally been a divide between learning about (academic) and learning how (vocational), with learning to be happening outside the educational system, in workplaces (e.g. via apprenticeships). Wenger emphasises the need to encourage learning to be, and of course it is vital, but politically it worries me that too much responsibility for this is currently falling on academia and not enough on employers (I’m probably misrepresenting Wenger here). As an employer I think I ought to invest in training new staff (and in ongoing staff development), mainly because I can train staff to be exactly the way they need to be in the specific employment context. There is no practical way that a national education system could be so specific, unless it only caters to a handful of big corporations, which don’t need the help or the additional social power. On the other hand, I really don’t want to have to teach new staff lots of learning about – grammar and spelling, for example – that can be taught perfectly well in the classroom.

I think a civilised society should be willing to pay collectively for some essentially uncommercialised public spaces (e.g. universities) where people can just think in order to get better at thinking. A vocational element is great (I have personally enjoyed and benefited from the vocational aspects of my course) but part of my motivation for returning to university was to have time to explore questions and experiment with ideas without limiting myself to only those that I could show in advance would bring in some cash.

How does all this relate to taxonomy work? A taxonomy may be needed within a single community of practice, in which case recognising the user group as a CoP may help make sense of the project and the terminology required. Conversely, a taxonomy may need to be a boundary object between CoPs, perhaps even linking numerous CoPs together. By recognising and identifying different CoPs in an organisation, a taxonomist can get a picture of the different dialects and practices that exist and need to be taken into account.

A new taxonomist also needs to learn to be a taxonomist, and the taxonomy communities of practice (both specific and theoretical) already out there play a vital role in this process.

Taxonomy to be banned

    Start a conversation 
< 1 minute

The FT reports that the Local Government Association has banned use of the word “taxonomy” in public documents! “Other words recommended for omission from public documents include “benchmarking”, “place shaping” and “taxonomy”.”

I know the General Public think it’s all about stuffed animals, but to classify taxonomy with “beaconicity” and “coterminious” just adds insult to injury!

An epistemological problem with folksonomies

    Start a conversation 
Estimated reading time 2–3 minutes

I’m still mulling over Helen Longino’s criteria for objectivity in scientific enquiry (see previous post: Science as Social Knowledge) and it occurred to me that folksonomies are not really open and democratic, but are actually obscure and impenetrable. The “viewpoint” of any given folksonomy might be an averaged out majority consensus or some other way of aggregating tags might have been used, and so you can’t tell if it is skewed by a numerically small but prolifically tagging group. This is the point Judith Simon made in relation to ratings and review software systems at the ISKO conference, but it seems to me the problem for folksonomies is even worse, because of the echo chamber effect of people amplifying popular tags. Without some way of showing who is tagging what and why, the viewpoint expressed in the folksonomy is a mystery. This is not necessarily the case, but I think you’d need to collect huge amounts of data from every tagger, then database it along with the tags, then run all sorts of analyses and publish them in order to show the background assumptions driving the majority tags.

If the folksonomic tags don’t help you find things, who could you complain to? How do you work out whether it doesn’t help you because you are a minority, or for some other reason? With a taxonomy, the structure is open – you may not like it but you can see what it is – and there will usually be someone “in charge” who you can challenge and criticise if you think your perspective has been overlooked. In many case the process of construction will be known too. I don’t see an obvious way of challenging or criticising a folksonomy in this way, so presumably it fails Longino’s criteria for objectivity.

You can just stick your own tags into a folksonomy and use them yourself so there is some trace of your viewpoint in there, but if the rest of the folksonomy doesn’t help you search, that means you can only find things once you have tagged them yourself, which would presumably rule out large content repositories. So, you have to learn and live with the imposed system – just like with a taxonomy – but it’s never quite clear exactly what that system is.