Language, thought, categorisation, and talking to yourself

26th September, 2010 Fran Start a conversation
Estimated reading time 3–5 minutes

The Voice of Reason (or What’s in a name? online) is a fascinating article by David Robson in New Scientist on one of my favourite topics – how language affects the way we think. The “linguistic relativity” theory of Edward Sapir and Benjamin Whorf – one of my favourite hypotheses – is blamed for the “fall from grace” of the idea that language shapes thought. The work of Eleanor Rosch – one of my favourite psychologists – on categorisation appeared to contradict the Sapir-Whorf hypothesis, by showing that categorisation rests more on physiological characteristics of humans – how we see, what size things are, whether or not something is edible – than on the names we have for things.

Noam Chomsky’s quest for a universal grammar made the notion that language and thought were essentially common to all humanity more popular than linguistic relativity. However, psychologists have started to note that having names for categories helps infants put things into those categories. Children’s spatial reasoning also seems to be improved when you remind them of spatial vocabulary (Dedre Gentner, Northwestern University, Evanston , Illinois: Cognitive Psychology, vol 50, p 315). People instinctively teach children by reminding them of what category words like “top”, “middle” and “bottom” mean. An experiment with “aliens” indicated that when people were given names for types of aliens they categorised them more quickly and accurately than when they weren’t given the names (Gary Lupyan, University of Wisconsin, Madison: Psychological Science vol 18, p 1077).

Although the strong version of the Sapir-Whorf hypothesis – that language dictates and constrains thought – appears unlikely to be true, on the grounds that you could never have a new idea or create a new category – the “weak” version – that having those words available will encourage people to think in those terms seems very plausible. An experiment has now indicated that Russian speakers – who have two different words for shades of blue – are faster at sorting out those shades than English speakers (Lera Boroditsky, Stanford University, California Proceedings of the National Academy of Sciences, vol 104, p 7780).

Labelling objects helps the memory take “shortcuts” so that minor details do not have to be remembered (Lupyan Journal of Experimental Psychology: General vol 137, p 348). Political activists in many areas have argued that language use encourages stereotyping – hence the attempts to break down stereotypes by changing names for groups. However, when applied to something like sets of documents, not bothering to see them all as individuals can be a useful shortcut. If you want to build a user-friendly taxonomy, using the categories people know and like will make your system quicker and easier to use. Of course they could learn other ways of categorising – they could break the stereotypes – if they spent a bit of time and effort thinking it all through – but in many contexts the job of the taxonomist is to give people what they want quickly and efficiently, not to enter into debates about whether or not they conceptualise things in the most politically appropriate way.

Language has also been shown to affect perception. If you use upwards-moving words (climb, rise, etc.) while showing people patterns of randomly moving dots, they are more likely to correctly detect the predominant direction of movement if the words match the direction (Psychological Science, vol 18 p 1007). Conversely, showing people upwardly-moving dots while saying “fall” confused them. The words seem to “prime” the visual system of the brain.

Another effect is that it is easier to see something if you say the name – so it really does help when you are looking for something to mutter the name of the object to help you find it. According to Andy Clark, a philosopher at the University of Edinburgh, language was the original form of “augmented reality” – “an overlay that changes how we think, reason and see”.

Top

Linked Data one-day conference

18th September, 2010 Fran Start a conversation
< 1 minute

I thoroughly enjoyed the Linked Data one-day conference organised by ISKO UK last week. You can find my summary of it on the ISKO UK blog.

Top

Librarians rock! Metadata will take over the world

9th September, 2010 Fran Start a conversation
Estimated reading time 3–4 minutes

The last presentation at the DAM conference back in June was by a very interesting DAM specialist – Mark Davey – of Cliffe Associates and Digital Asset Mangement.org.uk who spoke engagingly about the increasing pervasiveness of metadata and how it is opening up a whole new world of connections and possibilities.

Internet of Things

He talked about the emerging Internet of Things and how this will – essentially – be enabled by metadata. The more sophisticated our metadata management, the more use we will be able to make of links and connections. The semantic web is a bold attempt to link information resources, but if the semantic web and the Internet of Things can be linked, the web leaps out into the real world in unprecedented ways and this, according to Davey, is why librarians rock! It is the people who understand how metadata works who will be forging the links that will create an integrated Internet of Services. A smart fridge could send you text messages telling you what you need to buy when you are in the shops (although I worry about mine scolding me for eating too many cakes!), but there are many more business-focussed applications. For example, RFID tags are being used by Hollywood prop and costume hire companies to help them keep track of and retrieve stock. The location data is useful in that case to physically find missing items, but knowing where customers take their purchases may have all sorts of interesting implications for marketing. A car rental company could plot on a map where people have driven their cars. This would be a fun image in a brochure, but could support business decisions such as where to open branches or whether to provide more small cars for city trips or bigger cars for long distances.

The current Internet is not fit for purpose, because nobody agreed any standards, but if we can start developing standards, we can set digital assets and metadata free so that they can interact with each other in a machine-brokered way. This could be incredibly powerful – everything could end up everywhere! Ways to monetise this new world could involve micro payments – you are watching a video, you see a product you like, you click on it and behind the scenes your credit card or online account is charged and the product is shipped to your door. Meanwhile, the sales team know that you bought that product because it was in that particular scene in the film, they know where you live, and they know what else you have been buying.

Hive minds

This all seems rather scary. On the one hand, people are already referring to the distributed cognition aspects of social networking sites as the hive mind – we are, perhaps, creating some kind of Borg-like merging of identities into one big digital stew. Individuals could find themselves subsumed into a digitally imposed conformity to some overriding norm due to the panoptic, big brotherish combination of AI, RFID and CCTV, especially when you factor in neuromarketing and automated classification of behaviour (which will be subjects of future posts). On the other hand, people have always worried about the homogenising effects of anything that brings people together and the misuse of surveillance.

(I think in practice, metadata alone is just a tool that can be used for good or ill, like a knife. The Internet of Services is likely to be patchy, with some very worthy projects – applications in medicine where scarce resources like organs for transplants needed to be tracked and delivered and surgeons kept informed – and some just for fun – like the Tales of Things Project, where you can add memories, as metadata, to objects.)

Top

Content Identifiers for Digital Rights Persistence

1st August, 2010 Fran 4 comments
Estimated reading time 4–6 minutes

This is another write-up from the Henry Stewart DAM London conference.

Identity and identification

Robin Wilson discussed the issue of content identifiers, which are vitally important for digital rights management, but yet tend to be overlooked. He argued that although people become engaged in debates about titles and the language used in labels and classification systems, people overlook the need to achieve consensus on basic identification.

(I was quite surprised, as I have always thought that people would argue passionately about what something should be called and how using the wrong terminology affects usability, but that they would settle on machine-readable IDs quite happily. Perhaps it is the neutrality of such codes that makes the politics intractable. If you have invested huge amounts of money in a database that demands certain codes, you will argue that those codes are used by everyone else to save you the costs of translation or acquiring a compatible system, and there are no appeals to usability, or brokerage via editorial policy, that can be made. It simply becomes a matter of whoever shouts the loudest gets to spend the least money in the short term. )

Robin argued that the only way to create an efficient digital marketplace is to have a trusted authority oversee a system of digital identifiers that are tightly bound within the digital asset, so they cannot easily be stripped out even when an asset is divided, split, shared, and copied. The authority needs to be trusted by consumers and creators/publishers in terms of political neutrality, stability, etc.

(I could understand how this system would make it easier for people who are willing to pay for content to see what rights they need to buy and who they should pay, but I couldn’t see how the system could help content owners identify plagiarism without an active search mechanism. Presumably a digital watermark would persist throughout copies of an asset, provided that it wasn’t being deliberately stripped, but if the user simply decided not to pay, I don’t see how the system would help identify rights breaches. Robin mentioned in conversation Turnitin’s plagiarism management, which has become more lucrative than their original work on content analysis, but it requires an active process instigated by the content owner to search for unauthorised use of their content. This is fine for the major publishers of the world, who can afford to pay for such services, but is less appealing to individuals, whether professional freelances or amateur content creators, who would need a cheap and easy solution that would alert them to breaches of copyright without their having to spend time searching.)

The identifiers themselves need to be independent of any specific technology. At the moment, DAM systems are often proprietary and therefore identifiers and metadata cannot easily flow from one system to another. Some systems even strip away any metadata associated with a file on import and export.

Robin described five types of identifier currently being used or developed:

Uniform Resource Name (URN)
Handle System
Digital Object Identifier
Persistent URL (PURL)
ARK (Archival Resource Key).

He outlined three essential qualities for identifiers – that they be unique, globally registered, and locally resolved.

So why don’t we share?

Robin argued that it is easier for DAM vendors to build “safe” systems that lock all content within an enterprise environment, only those with a public service/archival remit tend to be collaborative and open. DAM vendors resist a federated approach online and prefer to use a one-to-one or directly intermediated transaction model. Federated identifier management services exist but vendors and customers don’t trust them. The problem is mainly social, not technological.

One of the problems is agreeing to share the costs of services, such as infrastructure, registration and validation, governance and development of the system, administration, and outreach and marketing.

(Efforts to standardise may well benefit the big players more than the small players and so there is a strong argument for them bearing the initial costs and offering support for smaller players to join. Once enough people opt in, the system gains critical mass and it becomes both easier to join and costs of joining become less of an unquantifiable risk – you can benefit from the experiences of others. The semantic web is currently attempting to acquire this “critical mass”. As marketers realise the potential of semantic web technology to make money, no doubt we will see an upsurge in interest. Facebook’s “like” button may well be heralding the advent of the ad-driven semantic web, which will probably drive uptake far faster than the worthy efforts of academics to improve the world by sharing research data!)

Top

Procuring a Digital Asset Management system

3rd July, 2010 Fran 1 comment
Estimated reading time 6–10 minutes

This is the first of a series of summaries of the Henry Stewart DAM London conference on June 30, chaired by David Lipsey. The panels (one of which included me) were a pleasing mix of very practical information and more theoretical discussion.

Classic DAM vendor “overstatements”

Theresa Regli, who does a great job as a “professional sceptic” stressed the need for a calm and considered approach to procurement with the most important stage being the testing stage. You wouldn’t buy a car without taking it for a test drive, but people buy software without finding out if it can handle their content. Nobody’s assets and business processes are exactly the same, and just because a system suited somebody else perfectly doesn’t mean it is right for you. Vendors will say that they can do anything, but that’s their job so don’t take their word for it. Don’t be distracted by the coolest of the cool new features or other bells and whistles. Cool costs – but may not make – money for your business. On the one hand, if the cool features don’t actually improve your specific business processes, they won’t benefit you, and on the other, vendors have become increasingly adept at marketing the same old features in new ways, so it is very important to dig beneath the surface to find out how they are doing what they claim. Surprisingly little has changed technologically in the DAM vendor landscape over the last five years. So, a wonderful new system for automatically indexing images directly may in fact just be the familiar territory of analysing textual metadata associated with images.

Speech to text

One area that has moved on is the technology to convert speech to text. This means that you can, to an extent, subtitle a film automatically (which isn’t quite the same as a system that can “watch a movie and understand what’s going on scene by scene”). This then gives you a chunk of textual metadata you can search and analyse (“understanding” what’s going on relies on sentiment analysis – looking up words in thesauruses, so, for example, if the dialogue mentions guns, shooting, and bullets a lot, the software could suggest it is a gunfight scene). However, accuracy rates are patchy and the systems require training, which could be labour intensive, so you need to make sure those training costs and the time required are included in budgets and schedules. The systems work best if you can get everything read by someone like Patrick Stewart, as he has very clear and even enunciation. Anyone with an unusual accent or who mumbles is far more difficult to process. As usual, the software is easiest to train if you are working within a specific context, so you can focus on relevant words and accents, rather than anything anyone anywhere in the world might happen to say.

A clever use of the technology is by the car industry to save time analysing focus group interviews. They asked interviewers to “audio index” their interviews by saying a key “trigger” word when somebody in the focus group said something interesting. The technology was set to clip out a section of video a few seconds before and after the trigger word, so the interviewers could then automatically generate “edited” versions of the interviews, saving a lot of time. I can see this being a great tool for anyone processing ethnographic data or conducting UX or similar testing based on interviews.

Zooming in on the detail

Another feature Theresa demonstrated was a high definition zooming tool, so that you can see very fine detail in your digital images – lovely for museums and art galleries but costly in terms of storage space and bandwidth. I could see it working well as an in-gallery interactive guide to certain collections. It wouldn’t be so good if you were trying to access it externally from a dodgy wifi or bandwidth-limited connection.

(The British Museum’s Magnificent Maps collection – which I saw on a London IA visit – has an interesting interactive zoom feature that works entirely differently, but was very popular. It worked by using a “magnifying glass” – actually a device with some LED transmitters that send an infrared signal to a webcam to trigger a zoom response through a special display interface.)

Procurement process tips

The other panel members talked through various DAM system procurement processes, from a huge global project for Cambridge University Press that began with a list of 452 vendors, through to a very detailed process for adidas with a smaller initial list but a large number of criteria to be fulfilled. It was pleasing that the panel agreed that cultural fit can be as important as any technical specifications. A state of the art or very large vendor who just doesn’t get your world is very unlikely to provide you with a good solution, but a mid-range vendor who really understands your particular context is much more likely to find or develop something that matches your business processes.

Although the use of personas (popular in the UX world) in procurement is quite unusual, Theresa suggested that user stories could be more effective than requirements spreadsheets. Vendors are likely to tick all the boxes in the spreadsheet without getting to grips with the business processes behind them. It is also hard to explain complex interactions as sets of requirements, but telling a story can make it clear what the system as a whole should provide, e.g. Sue has to research images for marketing campaigns and make sure that editors based in offices around the globe can see them to approve them and designers need to be able to access them remotely and then they need to be output in a variety of formats for publication both in print and online.

It is also worth making sure that any arrangements with outsourced suppliers are checked. Sometimes vendors will provide case studies of a successful implementation but not mention that they have never worked with your supplier before.

I noted the emphasis panellists placed on making sure taxonomies and vocabularies are user-friendly and effective in order to get the best out of any DAM system.

Manage your metadata

Sarah Saunders of Electric Lane discussed the importance of controlled vocabularies and managed metadata for image search and management. Speech-to-text software can’t help with stills collections, or when part of your collection is video without accompanying audio (e.g. a rushes collection – the “spare” footage that wasn’t used in a broadcast and which often has no associated dialogue or voiceover script). She described advances in visual sorting software that use a combination of textual metadata and content-based image retrieval (CBIR) to refine search results. Although CBIR is still in its infancy, when running over a small image set pre-selected by text searching it can be very helpful. CBIR can identify basic features like the colour that is used the most in an image, not much help if you run it over a large image collection with no other metadata (i.e. “give me all the mainly red pictures” will bring up images of everything from fire engines to strawberries – fun if all you want is inspiration, not so good if you have something more specific in mind). However, if you have a set of images of the Eiffel Tower for example, it could distinguish between close-ups and shots with lots of blue sky. If you like the blue sky ones, you can click on one and ask for “more like this” and be offered other mainly blue sky ones.

The second panel will be the subject of my next post.

Top

Are you a semantic romantic?

15th June, 2010 Fran Start a conversation
Estimated reading time 8–12 minutes

The “semantic web” is an expression that has been used for long enough now that I for one feel I ought to know what it means, but it is hard to know where to start when so much about it is presented in “techspeak”. I am trying to understand it all in my own non-technical terms, so this post is aimed at “semantic wannabes” rather than “semantic aficionados”. It suggests some ways of starting to think about the semantic web and linked open data without worrying about the technicalities.

At a very basic level, the semantic web is something that information professionals have been doing for years. We know about using common formats so that information can be exchanged electronically, from SGML, HTML, and then XML. In the 90s, publishers used “field codes” to identify subject areas so that articles could be held in databases and re-used in multiple publications. In the library world, metadata standards like MARC and Dublin Core were devised to make it easier to share cataloguing data. The semantic web essentially just extends these principles.

So, why all the hype?

There is money to be made and lost on semantic web projects, and investors always want to try to predict the future so they can back winning horses. The recent Pew Report (thanks to Brendan for the link) shows the huge variety of opinions about what the semantic web will become.

On the one extreme, the semantic evangelists are hoping that we can create a highly sophisticated system that can make sense of our content by itself, with the familiar arguments that this will free humans from mundane tasks so that we can do more interesting things, be better informed and connected, and build a better and more intelligent world. They describe systems that “know” that when you book a holiday you need to get from your house to the airport, that you must remember to reschedule an appointment you made for that week, and that you need to send off your passport tomorrow to renew it in time. This is helpful and can seem spookily clever, but is no more mysterious than making sure my holiday booking system is connected to my diary. There are all sorts of commercial applications of such “convenience data management” and lots of ethical implications about privacy and data security too, but we have had these debates many times in the past.

A more business-focused example might be that a search engine will “realise” that when you search for “orange” you mean the mobile phone company, because it “knows” you are a market analyst working in telecoms. It will then work out that documents that contain the words “orange” and “fruit” are unlikely to be what you are after, and so won’t return them in search results. You will also be able to construct more complex questions, for example to query databases containing information on tantalum deposits and compare them with information about civil conflicts, to advise you on whether the price of mobile phone manufacture is likely to increase over the next five years.

Again, this sort of thing can sound almost magical, but is basically just compiling and comparing data from different data sets. This is familiar ground. The key difference is that for semantically tagged datasets much of the processing can be automated, so data crunching exercises that were simply too time-consuming to be worthwhile in the past become possible. The evangelists can make the semantic web project sound overwhelmingly revolutionary and utopian, especially when people start talking in sci-fi sounding phrases like “extended cognition” and “distributed intelligence”, but essentially this is the familiar territory of structuring content, adding metadata, and connecting databases. We have made the cost-benefit arguments for good quality metadata and efficient metadata management many times.

On the other extreme, the semantic web detractors claim that there is no point bothering with standardised metadata, because it is too difficult politically and practically to get people to co-operate and use common standards. In terms familiar to information professionals, you can’t get enough people to add enough good quality metadata to make the system work. Clay Shirky in “Ontology is overrated” argued that there is no point in trying to get commonalty up front, it is just too expensive (there are no “tag police” to tidy up), you just have to let people tag randomly and then try to work out what they meant afterwards. This is a great way of harvesting cheap metadata, but doesn’t help if you need to be sure that you are getting a sensible answer to a question. It only takes one person to have mistagged something, and your dataset is polluted and your complex query will generate false results. Shirky himself declares that he is talking about the web as a whole, which is fun to think about, but how many of us (apart from Google) are actually engaged in trying to sort out the entire web? Most of us just want to sort out our own little corner.

I expect the semantic web to follow all other standardisation projects. There will always be a huge “non-semantic” web that will contain vast quantities of potentially useful information that can’t be accessed by semantic web systems, but that is no different from the situation today where there are huge amounts of content that can’t be found by search engines (the “invisible web” or “dark web”) – from proprietary databases to personal collections in unusual formats. No system has been able to include everything. No archive contains every jotting scrawled on a serviette, no bookshop stocks every photocopied fanzine, no telephone directory lists every phone number in existence. However, they contain enough to be useful for most people most of the time. No standard provides a perfect universal lingua franca, but common languages increase the number of people you can talk to easily. The adoption of XML is not universal, but for everyone who has “opted in” there are commercial benefits. Not everybody uses pdf files, but for many people they have saved hours of time previously spent converting and re-styling documents.

So, should I join in?

What you really need to ask is not “What is the future of the semantic web?” but “Is it worth my while joining in right now?”. How to answer that question depends on your particular context and circumstances. It is much easier to try to think about a project, product, or set of services that is relevant to you than to worry about what everyone else is doing. If you can build a product quickly and cheaply using what is available now, it doesn’t really matter whether the semantic web succeeds in its current form or gets superseded by something else later.

I have made a start by asking myself very basic questions like:

What sort of content/data do we have?
How much is there?
What format is it in at the moment?
What proportion of that would we like to share (is it all public domain, do we have some that is commercially sensitive, but some that isn’t, are there data protection or rights restrictions)?

If you have a lot of data in well-structured and open formats (e.g. XML), there is a good chance it will be fairly straightforward to link your own data sets to each other, and link your data to external data. If there are commercial and legal reasons why the data can’t be made public, it may still be worth using semantic web principles, but you might be limited to working with a small data set of your own that you can keep within a “walled garden” – whether or not this is a good idea is another story for another post.

A more creative approach is to ask questions like:

What content/data services are we seeking to provide?
Who are our key customers/consumers/clients and what could we offer them that we don’t offer now?
What new products or services would they like to see?
What other sources of information do they access (users usually have good suggestions for connections that wouldn’t occur to us)?

Some more concrete questions would be ones like:

What information could be presented on a map?
How can marketing data be connected to web usage statistics?
Where could we usefully add legacy content to new webpages?

It is also worth investigating what others are already providing:

What content/data out there is accessible? (e.g. recently released UK government data)
Could any of it work with our content/data?
Whose data would it be really interesting to have access to?
Who are we already working with who might be willing to share data (even if we aren’t sure yet what sort of joint products/projects we could devise)?

It’s not as scary as it seems

Don’t be put off by talk about RDF, OWL, and SPARQL, how to construct an ontology, and whether or not you need a triple store. The first questions to ask are familiar ones like who you would like to work with, what could you create if you could get your hands on their content, and what new creations might arise if you let them share yours? Once you can see the semantic web in terms of specific projects that make sense for your organisation, you can call on the technical teams to work out the details. What I have found is that the technical teams are desperate to get their hands on high quality structured content – our content – and are more than happy to sort out the practicalities. As content creators and custodians, we are the ones that understand our content and how it works, so we are the ones who ought to be seizing the initiative and starting to be imaginative about what we can create if we link our data.

A bit of further reading:
Linked Data.org
Linked Data is Blooming: Why You Should Care
What can Data.gov.uk do for me?

Top

The power of parametadata

2nd May, 2010 Fran 9 comments
Estimated reading time 4–6 minutes

First we had content, then not long after that we had metadata, although no-one called it that. Now we need parametadata – the metadata about metadata!

Neither metadata nor parametadata are anything new, but what is new is how central they have become to all sorts of business processes. People think there is something modern and techie about metadata, but ever since the first author signed their initials on a piece of work, or added a title, we have had metadata. Librarians are just one group who have been using metadata for centuries.

Thanks to technological advances, there is now a huge amount of processing that can be done with metadata, indeed that needs to be done if we are to have any idea what assets we have available. Metadata has become the active driver of numerous business processes. You couldn’t operate a computer without the metadata that tells you the name of a file, its location, when it was last saved, etc. and this sort of metadata is so ubiquitous that nobody tends to think about it too much. Now metadata is so pervasive, it is becoming increasingly important to talk about it and define different aspects and types.

One key distinction is the one between objective and subjective metadata. Subjective metadata refers to classification, tagging, taxonomies, etc. This metadata is subjective because it is always possible to argue about it. Objective metadata on the other hand is uncontroversial and typically process-driven – a file format is what it is, the time the file was last saved might cause consternation after a PC crash, but is unarguable. However, there is actually surprisingly little uncontroversial metadata. Even something like a title can be edited and changed – what do you do when some content acquires a popular or folk title that is not the same as its official title? This happens a lot with comedy sketches and songs, but can also happen to names of projects, working groups, etc.

Parametadata (or meta-metadata) is another subset of metadata – it is the metadata about the metadata, giving its provenance, date of creation, technical specifications, etc. Once you start to think about metadata as content in its own right, it becomes obvious that just as you wish to track the author, title, and so on of the core content, so too you need to track the author(s), provenance, date of creation and latest update of the metadata as well. For subjective metadata, parametadata becomes hugely useful. Because you can have multiple classifications of an asset, it is very important to track the source – distinguishing between author added keywords, indexer keywords, and folksonomic tags, for example – so that people can tell where a tag has come from.

As long as you know where tags have come from, you can decide whether or not you want to trust in their authority. In an increasingly muddled web, it is helpful to be told the source of a comment or an opinion in order to try to distinguish sound information from propaganda or uninformed speculation. Anecdotally, many people who were initially excited about citizen review sites – rating hotels, etc. – have now given up on them on the grounds that the people who contribute to them tend to have some kind of axe – or worse – to grind, so you can’t take them seriously. Even reviews that aim to be fair may not be relevant if the reviewer is too dissimilar to the reader. The perfect holiday for a group of teenagers is unlikely to be what a retired couple are looking for. So any review needs to carry sufficient information so that the reader can work out how relevant the content is to them. A good review site would carry a range of reviews aimed at different audiences.

Similarly, a rich navigation system needs to offer a range of tags and taxonomies, but these will only be useful when there is sufficient parametadata to tell the user where each scheme or tag came from, who created it, how up to date it is, etc. From a user perspective, being able to choose from a range of well-documented navigation systems means they can make an informed choice about whether to have fun with the randomness of folksonomic tags, to follow a specialist taxonomy in order to learn how a subject is handled by experts, or to use a guide constructed by the content curators for a general audience.

Interface designers can use the parametadata to make different sources of metadata distinct – with different visual or other cues, for example, to indicate different navigation environments. This means you can create a range of different “navigation worlds” and let your users wander to and fro while always making sure they know where – in terms of trust and authority – they are.

Top

Web Science 2010

27th April, 2010 Fran Start a conversation
< 1 minute

There have been lots of interesting presentations at Web Science 2010 in Raleigh. My metadata meerkats were popular – hard to beat charismatic megafauna. The papers and posters are online at The Journal of Web Science.

Top

Using taxonomies to support ontologies

4th April, 2010 Fran 9 comments
Estimated reading time 4–6 minutes

What is an ontology?
Ontologies are emerging from the techie background into the knowledge organisation foreground and – as usually happens – being touted as the new panacea to solve all problems from content management to curing headaches. As with any tool, there are circumstances where they work brilliantly and some where they aren’t right for the job.

Basically, an ontology is a knowledge model (like a taxonomy or a flow chart) that describes relationships between things. The main difference between ontologies and taxonomies is that taxonomies are restricted to broader and narrower relationships whereas ontologies can hold any kind of relationship you give them.

One way of thinking about this is to see taxonomies as vertical navigation and ontologies as horizontal. In practice, they usually work together. When you add cross references to a taxonomy, you are adding horizontal pathways and effectively specifying ontological rather than taxonomical relationships.

The flexibility in the type of relationship that can be defined is what gives ontologies their strength, but is also their weakness in that they are difficult to build well and can be time consuming to manage because there are infinite relationships you could specify and if you are not careful, you will specify ones that keep changing. Ontologies can answer far more questions than taxonomies, but if the questions you wish to ask can be answered by a taxonomy, you may find a taxonomy simpler and easier to handle.

What are the differences between taxonomies and ontologies?
A good rule of thumb is to think of taxonomies as being about narrowing down, refining, and zooming in on precise pieces of information and ontologies as being about broadening out, aggregating, and linking information. So, a typical combination of ontologies and taxonomies would be to use ontologies to aggregate content and with taxonomies overlaid to help people drill down through the mass of content you have pulled together.

Ontologies can also be used as links to join taxonomies together. So, if you have a taxonomy of regions, towns, and villages and a taxonomy of birds and their habitats you could use an ontological relationship of “lives in” to show which birds live in which places. By using a taxonomy to support the ontology, you don’t have to define a relationship between every village and the birds that live there, you can link the birds’ habitats to regions via the ontology and the taxonomy will do the work of including all the relevant villages under that region.

Programmers love ontologies, because they can envisage a world where all sorts of relationships between pieces of content can be described and these diverse relationships can be used to produce lots of interesting collections of content that can’t easily be brought together otherwise. However, they leave it to other people to provide the content and metadata. Specifying all those relationships can be complicated and time-consuming so it is important to work out in advance what you want to link up and why. A good place to start is to choose a focal point of the network of relationships you need. For example, there are numerous ways you could gather content about films. You could focus on the actors so you can bring together the films they have appeared in to create content collections describing their careers, or focus on genres and release dates to create histories of stylistic developments, or you could link films that are adaptations of books to copies of those books. The choices you make determine the metadata you will need.

Know your metadata
At the moment, in practice, ontologies are typically built to string together pre-existing metadata that has been collected for navigational or archival taxonomies, but this is just because that metadata already exists to be harvested. There is a danger in this approach that you end up making connections just because you can, not because they are useful to anybody. As with all metadata-based automated systems, you also need to be careful with the “garbage in garbage out” problem. If the metadata you are harvesting was created for a different purpose, you need to make sure that you do not build false assumptions about its meaning or quality into your ontology – for example, if genre metadata has been created according to the department the commissioning editor worked for, instead of describing the content of the actual programme itself. That may not have been a problem when the genre metadata was used only by audience research to gather ratings information, but does not translate properly when you want to use it in an ontology for content-defining purposes.

Feeding your ontology with accurate and clearly defined taxonomies is likely to give you better results than using whatever metadata just happens to be lying about. Well-defined sets of provenance metadata – parametadata – about your taxonomies and ontologies is becoming more and more valuable so that you can understand what metadata sets were built for, when they were last updated, and who manages them.

Why choose one when you can have both?
Ontologies are very powerful. They perform different work to taxonomies, but ontologies and taxonomies can support and enhance each other. Don’t throw away your taxonomies just because you are moving into the ontology space. Ontologies can be (they aren’t always – see Steve’s comment below) big, tricky, and complicated, so use your taxonomies to support them.

Top

Taxonomy as an application for an open world

14th March, 2010 Fran 6 comments
Estimated reading time 9–15 minutes

This post is based on the notes I made for the talk I gave at the LIKE dinner on February 25th. It covers a lot of themes I have discussed elsewhere on this blog, but I hope it will be useful as an overview.

Taxonomies have been around for ages
Pretty much the oldest form of recorded human writing is the list, back in ancient Sumeria, the Sumerian King list for example is about 4,000 years old. By the time of the ancient Greeks, taxonomies were familiar. We understand that something is a part of something else, and the notion of zooming in or narrowing down on the information we want is instinctive.
I am frequently frustrated by the limitations of free text search (see my earlier post Google is not perfect). The main limitation is to knowledge discovery – you can’t browse sensibly around a topic area and get any sense of overview of the field. Following link trails can be fun, but they leave out the obscure but important, the non-commercial, the unexpected.

The very brilliant Google staff are working on refining their algorithms all the time, but Google is a big commercial organisation and they are going to follow the money, which isn’t always where we need to be going. Other free text search issues include disambiguation/misspellings – so you need hefty synonym control, “aboutness” – you can’t find something with free text search if it doesn’t mention the word you’ve searched for, and audio-visual retrieval. The killer for heritage archives (and for highly regulated companies like pharmaceutical and law firms) is comprehensiveness – we don’t just want something on the subject, we want to know that we have retrieved everything on a particular subject.

Another myth is that search engines don’t use classification – they do, they use all sorts of classifications, it’s just that you don’t tend to notice them, partly because they are constantly being updated in response to user behaviour, giving the illusion that they don’t really exist. What is Google doing when it serves you up its best guesses, if not classifying the possible search results and serving you the categories it calculates are closest to what you want?

I’m a big fan of Google, it’s a true modern cathedral of intellectual power and I use Google all the time, but I seem to be unusual in that I don’t expect it to solve all my problems.
I also am aware of the fact that we can’t get to look at Google’s taxonomic processes arguably makes Google more political, more manipulable, and more big brother-ish than traditional open library classifications. We may not totally agree with the library classifications nor the viewpoints of their creators, but at least we know what those viewpoints are!

There was a lot of fuss about the rise of folksonomies and free tagging as being able to supersede traditional information management – and in an information overloaded world we need all the help we can get – the trouble is that folksonomies expand, coalesce, and collapse into taxonomies in the end. If they are to be effective – rather than just cheap – they need to do this – and either become self-policing or very frustrating. They are a great way of gathering information, but then you need to do something with it.

Folksonomies, just as much as taxonomies, represent a process of understanding what everyone else is talking about and negotiating some common ground. It may not be easy, but it is a necessary and indispensable part of human communication – not something we can simply outsource or computerise – algorithms just won’t do that for us. Once everything has been tagged with every term associated with every viewpoint, nothing might as well have been tagged at all. Folksonomies, just as much as taxonomies, collapse into giving a single viewpoint – it’s just that it is a viewpoint that is some obscure algorithmic calculation of popularity.

So, despite free text search and folksonomies, structured classification remains a very powerful and necessary part of your information strategy.

It’s an open world
Any information system – whatever retrieval methods it offers – has to meet the needs of its users. Current users can be approached, surveyed, talked to, but how do you meet the needs of future users? The business environment is not a closed, knowable constrained domain, but is an “open world”1 where change is the only certainty. (Open world is an expression from logic. It presumes that you can never have complete knowledge of truth or falsity. It is the opposite of the closed world, which works for constrained domains or tasks where rules can be applied – e.g. rules within a database).

So, how do you find the balance between stability, so your knowledge workers can learn and build on experience over time, while being able to react rapidly to changes?

Once upon a time, not much happened
The early library scientists such as Cutter, Kelley, Ranganathan, and Bliss, argued about which classification methods were the best, but they essentially presumed that it was possible to devise a system that maximised “user friendliness” and that once established, it would remain usable well into the future. By and large, that turned out to be the case, as it took many years for their assumptions about users to be seriously challenged.

Physical constraints tended to dictate the amount of updating that a system could handle. The time and labour required to re-mark books and update a card catalogue meant that it was worth making a huge effort to simply select or devise a classification and stick to it. It was easier to train staff to cope with the clunky technology of the time than adapt the technology to suit users. No doubt in the future, people will say exactly the same things about the clunky Internet and how awful it must have been to have to use keyboards to enter information.

So, it was sensible to plan your information project as one big chunk of upfront effort that would then be left largely alone. It is much easier to build systems based on the assumption that you can know everything in advance – you can have a simple linear project plan and fixed costs. However, it is very rare for this assumption to hold for very long, and the bigger the project, the messier it all gets.

Change now, change more
Everything is changing far more rapidly than it used to – from the development of new technologies to the rapid spread of ideas promoted by the emergence of social media and an “always on” culture. It’s harder than ever to stay cutting edge!

We all like to speak our own language and use our own names for things, and specialists and niche workers as well as fashionistas and trendsetters expect to be able to describe and discuss information in ways that make sense to them. The open philosophy of the Web 2.0 world means that they increasingly take this to be their right, but this is where folksonomic approaches can really help us.

What you need to do is to create a system that can include different pace layers so that you get the benefits of a stable taxonomy, with the rapid reactiveness of folksonomy as well as quick and easy free text search. You can think of your taxonomy as the centre of a coral reef, but coral is alive and grows following the currents and the behaviour of all the crazy fish and other organisms that dart about around it. It’s hard to pin down the crazy fish and other creatures, but they feed the central coral and keep it strong. In practice, this means incorporating multiple taxonomies and folksonomies and mapping them to one another, so that everyone can use the taxonomy and the terminology that they prefer. Taxonomy mapping tools require human training and human supervision, but they can lighten the load of the labour intensive process of mapping one taxonomy to another.

This means that taxonomy strategy does not have to be determined at a fixed point, but taxonomy creation is dynamic and organic. Folksonomies and new taxonomies can be harvested to feed back updates into the central taxonomy, breaking the traditional cycle of expensive major revision, gradual decline until the point of collapse, followed by subsequent expensive major revision…

There is a convergence between semi-automated mapping (we’ll be needing human editorial oversight for some time) and the semantic web project. This is the realisation of the “many perspectives, one repository” approach that should get round many problems of the subjective/objective divide. If you can’t agree on which viewpoint to adopt, why not have them all? Any arguments then become part of the mapping process – which is a bit of a fudge, but within organisations has the major benefit of removing a lot of politicking that surrounds information and knowledge management. It all becomes “something technical” to do with mapping that nobody other than information professionals is very interested in. Despite this, there is huge cultural potential when it comes to opening up public repositories and making them interoperate. The Europeana project is a good example.

Modern users demand that content is presented to them in a way that they feel comfortable with. The average search is a couple of words typed into Google, but they are willing to browse if they feel that they are closing in on what they want. To increase openness and usage means providing rich search and navigation experiences in a user-friendly way. If your repository is to be promoted to a wider audience future, the classification that will enable the creation of a rich navigation experience needs to be put in place now.

Your users should be able to wander about through the archive collections horizontally and vertically and to leave and delve into other collections, or to arrive at and move through the archive using their own organisation’s taxonomy and to tag where they want to tag, using whatever terms they like. The link points in the mappings provide crossroads in the navigation system for the users.

In this way the taxonomies are leveraged to become “hypertextual taxonomies” that provide rich links both horizontally and vertically.

Taxonomy as a spine
A core taxonomy that acts as an indexing language is the central spine to which other taxonomies can be attached and crucially – detached – as necessary. The automation of the bulk of the mapping process means that incorporating a new taxonomic view
becomes a task of checking the machine output for errors. Automated mapping processes can provide statistical calculations of likelihood of accuracy and so humans only need to examine those with a low likelihood of being correct.

Mapping software has the same problems as autoclassification software, so a mapping methodology, including workflow and approval processes, has to be defined and supported. The more important it is to get a fine-grained mapping, the more effort you will need to make, but a broad level mapping is easier to achieve.

Conclusion
If you start thinking of the taxonomy as an organic system in its own right – more like an open application that you can interact with – bolting on and removing elements as you choose, you do not need to attempt to account for every user viewpoint in the creation of the taxonomy, and that omission of a viewpoint at one stage does not preclude that collection from being incorporated later. Conversely, the mapping process allows “outsiders” to view your assets through their own taxonomies.

Our taxonomies represent huge edifices of intellectual effort. However, we can’t preserve them in aspic – hide them away as locked silos or like grand stately homes that won’t open their doors to the public. If we want them to thrive and grow we need to open them up to the light to let them expand, change and interact with other taxonomies and take in ideas from the outside.

Once you open up your taxonomy, share it and map it to other taxonomies, it becomes stronger. Rather than an isolated knowledge system that seems like a drain on resources, it becomes an embedded part of the information infrastructure, powering interactions between multiple systems. It ceases to be a part of document management, and becomes the way that the organisation interacts with knowledge globally. This means that the taxonomy gains strength from its associations but also gains prestige.
So our taxonomies can remain our friends for a little while longer. We won’t be hand cataloguing as we did in the past because all the wonders of the Google and automated world can be harnessed to help us.

Top

« Previous 1 … 8 9 10 11 12 … 27 Next »