Tag Archives: folksonomy

Online Information Conference – day two

    1 comment 
Estimated reading time 6–10 minutes

Linked Data in Libraries

I stayed in the Linked Data track for Day 2 of the Online Information Conference, very much enjoying Karen Coyle‘s presentation on metadata standards – FRBR, FRSAR, FRAD, RDA – and Sarah Bartlett‘s enthusiasm for using Linked Data to throw open bibliographic data to the world so that fascinating connections can be made. She explained that while the physical sciences have been well mapped and a number of ontologies are available, far less work has been done in the humanities. She encouraged humanities researchers to extend RDF and develop it.

In the world of literature, the potential connections are infinite and very little numerical analysis has been done by academics. For example, “intertextuality” is a key topic in literary criticism, and Linked Data that exposes the references one author makes to another can be analysed to show the patterns of influence a particular author had on others. (Google ngrams is a step in this direction, part index, part concordance.)

She stressed that libraries and librarians have a duty of care to understand, curate, and manage ontologies as part of their professional role.

Karen and Sarah’s eagerness to make the world a better place by making sure that the thoughtfully curated and well-managed bibliographic data held by libraries is made available to all was especially poignant at a time when library services in the UK are being savaged.

The Swedish Union Catalogue is another library project that has benefited from a Linked Data approach. With a concern to give users more access to and pathways into the collections, Martin Malmsten asked if APIs are enough. He stressed the popularity of just chucking the data out there in a quick and dirty form and making it as simple as possible for people to interact with it. However, he pointed out that licences need to be changed and updated, as copyright law designed for a print world is not always applicable for online content.

Martin pointed out that in a commercialised world, giving anything away seems crazy, but that allowing others to link to your data does not destroy your data. If provenance (parametadata) is kept and curated, you can distinguish between the metadata you assert about content and anything that anybody else asserts.

During the panel discussion, provenance and traceability – which the W3C is now focusing on (parametadata) – was discussed and it was noted that allowing other people to link to your data does not destroy your data, and often makes it more valuable. The question of what the “killer app” for the semantic web might be was raised, as was the question of how we might create user interfaces that allow the kinds of multiple pathway browsing that can render multiple relationships and connections comprehensible to people. This could be something a bit like topic maps – but we probably need a 13-year-old who takes all this data for granted to have a clear vision of its potential!

Tackling Linked Data Challenges

The second session of day two was missing Georgi Kobilarov of Uberblic who was caught up in the bad weather. However, the remaining speakers filled the time admirably.

Paul Nelson of Search Technologies pointed out that Google is not “free” to companies, as they pay billions in search engine optimisation (SEO) to help Google. Google is essentially providing a marketing service, and companies are paying huge amounts trying to present their data in the way that suits Google. It is therefore worth bearing in mind that Google’s algorithms are not resulting in a neutral view of available information resources, but are providing a highly commercial view of the web.

John Sheridan described using Linked Data at the National Archives to open up documentation that previously had very little easily searchable metadata. Much of the documentation in the National Archives is structured – forms, lists, directories, etc. – which present particular problems for free text searches, but are prime sources for mashing up and querying.

Taxonomies, Metadata, and Semantics: Frameworks and Approaches

There were some sensible presentations on how to use taxonomies and ontologies to improve search results in the third session.
Tom Reamy of KAPS noted the end of the “religious fervour” about folksonomy that flourished a few years ago, now that people have realised that there is no way for folksonomies to get better and they offer little help to infrequent users of a system. They are still useful as a way of getting insights into the kind of search terms that people use, and can be easier to analyse than search logs. A hybrid approach, using a lightweight faceted taxonomy over the top of folksonomic tags is proving more useful.

Taxonomies remain key in providing the structure on which autocategorisation and text analytics is based, and so having a central taxonomy team that engages in regular and active dialogue with users is vital. Understanding the “basic concepts” (i.e. Lakoff and Rosch’s “basic categories”) that are the most familiar terms to the community of users is vital for constructing a helpful taxonomy and labels should be as short and simple as possible. Labels should be chosen for their distinctiveness and expressiveness.

He also pointed out that adults and children have different learning strategies, which is worth remembering. I was also pleased to hear his clear and emphatic distinction between leisure and workplace search needs. It’s a personal bugbear of mine that people don’t realise that looking for a hairdresser in central London – where any one of a number will do – is not the same as trying to find a specific shot of a particular celebrity shortly after that controversial haircut a couple of years ago from the interview they gave about it on a chat show.

Tom highlighted four key functions for taxonomies:

  • knowledge organisation systems (for asset management)
  • labelling systems (for asset management)
  • navigation systems (for retrieval and discovery)
  • search systems (for retrieval)

He pointed out that text analytics needs taxonomy to underpin it, to base contextualisation rules on. He also stressed the importance of data quality, as data quality problems cause the majority of search project failures. People often focus on cool new features and fail to pay attention to the underlying data structures they need to put in place for effective searching.

He noted that the volumes of data and metadata that need to processed are growing at a furious rate. He highlighted Comcast as a company that is very highly advanced in the search and data management arena, managing multiple streams of data that are constantly being updated, for an audience that expects instant and accurate information.

He stated that structure will remain the key to findability for the foreseeable future. Autonomy is often hailed as doing something different to other search engines because it uses statistical methods, but at heart it still relies on structure in the data.

Richard Padley made it through the snow despite a four-hour train journey from Brighton, and spoke at length about the importance of knowledge organisation to support search. He explained the differences between controlled vocabularies, indexes, taxonomies, and ontologies and how each performs a different function.

Marianne Lykke then talked about information architecture and persuasive design. She also referred to “basic categories” as well as the need to guide people to where you want them to go via simple and clear steps.

Taxonomies, Metadata, and Semantics in Action

I spoke in the final session of the day, on metadata life cycles, asset lifecycles, parametadata, and managing data flows in complex information “ecosystems” with different “pace layers”.

Neil Blue from Biowisdom gave a fascinating and detailed overview of Biowisdom’s use of semantic technologies, in particular ontology-driven concept extraction. Biowisdom handle huge complex databases of information to do with the biological sciences and pharmaceuticals, so face very domain-specific issues, such as how to bridge the gap between “hard” scientific descriptions and “soft” descriptions of symptoms and side-effects typically given by patients.

In the final presentation of the day, Alessandro Pica outlined the use of semantic technologies by Italian News agency AGI.

Taxonomy as an application for an open world

Estimated reading time 9–15 minutes

This post is based on the notes I made for the talk I gave at the LIKE dinner on February 25th. It covers a lot of themes I have discussed elsewhere on this blog, but I hope it will be useful as an overview.

Taxonomies have been around for ages
Pretty much the oldest form of recorded human writing is the list, back in ancient Sumeria, the Sumerian King list for example is about 4,000 years old. By the time of the ancient Greeks, taxonomies were familiar. We understand that something is a part of something else, and the notion of zooming in or narrowing down on the information we want is instinctive.
I am frequently frustrated by the limitations of free text search (see my earlier post Google is not perfect). The main limitation is to knowledge discovery – you can’t browse sensibly around a topic area and get any sense of overview of the field. Following link trails can be fun, but they leave out the obscure but important, the non-commercial, the unexpected.

The very brilliant Google staff are working on refining their algorithms all the time, but Google is a big commercial organisation and they are going to follow the money, which isn’t always where we need to be going. Other free text search issues include disambiguation/misspellings – so you need hefty synonym control, “aboutness” – you can’t find something with free text search if it doesn’t mention the word you’ve searched for, and audio-visual retrieval. The killer for heritage archives (and for highly regulated companies like pharmaceutical and law firms) is comprehensiveness – we don’t just want something on the subject, we want to know that we have retrieved everything on a particular subject.

Another myth is that search engines don’t use classification – they do, they use all sorts of classifications, it’s just that you don’t tend to notice them, partly because they are constantly being updated in response to user behaviour, giving the illusion that they don’t really exist. What is Google doing when it serves you up its best guesses, if not classifying the possible search results and serving you the categories it calculates are closest to what you want?

I’m a big fan of Google, it’s a true modern cathedral of intellectual power and I use Google all the time, but I seem to be unusual in that I don’t expect it to solve all my problems.
I also am aware of the fact that we can’t get to look at Google’s taxonomic processes arguably makes Google more political, more manipulable, and more big brother-ish than traditional open library classifications. We may not totally agree with the library classifications nor the viewpoints of their creators, but at least we know what those viewpoints are!

There was a lot of fuss about the rise of folksonomies and free tagging as being able to supersede traditional information management – and in an information overloaded world we need all the help we can get – the trouble is that folksonomies expand, coalesce, and collapse into taxonomies in the end. If they are to be effective – rather than just cheap – they need to do this – and either become self-policing or very frustrating. They are a great way of gathering information, but then you need to do something with it.

Folksonomies, just as much as taxonomies, represent a process of understanding what everyone else is talking about and negotiating some common ground. It may not be easy, but it is a necessary and indispensable part of human communication – not something we can simply outsource or computerise – algorithms just won’t do that for us. Once everything has been tagged with every term associated with every viewpoint, nothing might as well have been tagged at all. Folksonomies, just as much as taxonomies, collapse into giving a single viewpoint – it’s just that it is a viewpoint that is some obscure algorithmic calculation of popularity.

So, despite free text search and folksonomies, structured classification remains a very powerful and necessary part of your information strategy.

It’s an open world
Any information system – whatever retrieval methods it offers – has to meet the needs of its users. Current users can be approached, surveyed, talked to, but how do you meet the needs of future users? The business environment is not a closed, knowable constrained domain, but is an “open world”1 where change is the only certainty. (Open world is an expression from logic. It presumes that you can never have complete knowledge of truth or falsity. It is the opposite of the closed world, which works for constrained domains or tasks where rules can be applied – e.g. rules within a database).

So, how do you find the balance between stability, so your knowledge workers can learn and build on experience over time, while being able to react rapidly to changes?

Once upon a time, not much happened
The early library scientists such as Cutter, Kelley, Ranganathan, and Bliss, argued about which classification methods were the best, but they essentially presumed that it was possible to devise a system that maximised “user friendliness” and that once established, it would remain usable well into the future. By and large, that turned out to be the case, as it took many years for their assumptions about users to be seriously challenged.

Physical constraints tended to dictate the amount of updating that a system could handle. The time and labour required to re-mark books and update a card catalogue meant that it was worth making a huge effort to simply select or devise a classification and stick to it. It was easier to train staff to cope with the clunky technology of the time than adapt the technology to suit users. No doubt in the future, people will say exactly the same things about the clunky Internet and how awful it must have been to have to use keyboards to enter information.

So, it was sensible to plan your information project as one big chunk of upfront effort that would then be left largely alone. It is much easier to build systems based on the assumption that you can know everything in advance – you can have a simple linear project plan and fixed costs. However, it is very rare for this assumption to hold for very long, and the bigger the project, the messier it all gets.

Change now, change more
Everything is changing far more rapidly than it used to – from the development of new technologies to the rapid spread of ideas promoted by the emergence of social media and an “always on” culture. It’s harder than ever to stay cutting edge!

We all like to speak our own language and use our own names for things, and specialists and niche workers as well as fashionistas and trendsetters expect to be able to describe and discuss information in ways that make sense to them. The open philosophy of the Web 2.0 world means that they increasingly take this to be their right, but this is where folksonomic approaches can really help us.

What you need to do is to create a system that can include different pace layers so that you get the benefits of a stable taxonomy, with the rapid reactiveness of folksonomy as well as quick and easy free text search. You can think of your taxonomy as the centre of a coral reef, but coral is alive and grows following the currents and the behaviour of all the crazy fish and other organisms that dart about around it. It’s hard to pin down the crazy fish and other creatures, but they feed the central coral and keep it strong. In practice, this means incorporating multiple taxonomies and folksonomies and mapping them to one another, so that everyone can use the taxonomy and the terminology that they prefer. Taxonomy mapping tools require human training and human supervision, but they can lighten the load of the labour intensive process of mapping one taxonomy to another.

This means that taxonomy strategy does not have to be determined at a fixed point, but taxonomy creation is dynamic and organic. Folksonomies and new taxonomies can be harvested to feed back updates into the central taxonomy, breaking the traditional cycle of expensive major revision, gradual decline until the point of collapse, followed by subsequent expensive major revision…

There is a convergence between semi-automated mapping (we’ll be needing human editorial oversight for some time) and the semantic web project. This is the realisation of the “many perspectives, one repository” approach that should get round many problems of the subjective/objective divide. If you can’t agree on which viewpoint to adopt, why not have them all? Any arguments then become part of the mapping process – which is a bit of a fudge, but within organisations has the major benefit of removing a lot of politicking that surrounds information and knowledge management. It all becomes “something technical” to do with mapping that nobody other than information professionals is very interested in. Despite this, there is huge cultural potential when it comes to opening up public repositories and making them interoperate. The Europeana project is a good example.

Modern users demand that content is presented to them in a way that they feel comfortable with. The average search is a couple of words typed into Google, but they are willing to browse if they feel that they are closing in on what they want. To increase openness and usage means providing rich search and navigation experiences in a user-friendly way. If your repository is to be promoted to a wider audience future, the classification that will enable the creation of a rich navigation experience needs to be put in place now.

Your users should be able to wander about through the archive collections horizontally and vertically and to leave and delve into other collections, or to arrive at and move through the archive using their own organisation’s taxonomy and to tag where they want to tag, using whatever terms they like. The link points in the mappings provide crossroads in the navigation system for the users.

In this way the taxonomies are leveraged to become “hypertextual taxonomies” that provide rich links both horizontally and vertically.

Taxonomy as a spine
A core taxonomy that acts as an indexing language is the central spine to which other taxonomies can be attached and crucially – detached – as necessary. The automation of the bulk of the mapping process means that incorporating a new taxonomic view
becomes a task of checking the machine output for errors. Automated mapping processes can provide statistical calculations of likelihood of accuracy and so humans only need to examine those with a low likelihood of being correct.

Mapping software has the same problems as autoclassification software, so a mapping methodology, including workflow and approval processes, has to be defined and supported. The more important it is to get a fine-grained mapping, the more effort you will need to make, but a broad level mapping is easier to achieve.

If you start thinking of the taxonomy as an organic system in its own right – more like an open application that you can interact with – bolting on and removing elements as you choose, you do not need to attempt to account for every user viewpoint in the creation of the taxonomy, and that omission of a viewpoint at one stage does not preclude that collection from being incorporated later. Conversely, the mapping process allows “outsiders” to view your assets through their own taxonomies.

Our taxonomies represent huge edifices of intellectual effort. However, we can’t preserve them in aspic – hide them away as locked silos or like grand stately homes that won’t open their doors to the public. If we want them to thrive and grow we need to open them up to the light to let them expand, change and interact with other taxonomies and take in ideas from the outside.

Once you open up your taxonomy, share it and map it to other taxonomies, it becomes stronger. Rather than an isolated knowledge system that seems like a drain on resources, it becomes an embedded part of the information infrastructure, powering interactions between multiple systems. It ceases to be a part of document management, and becomes the way that the organisation interacts with knowledge globally. This means that the taxonomy gains strength from its associations but also gains prestige.
So our taxonomies can remain our friends for a little while longer. We won’t be hand cataloguing as we did in the past because all the wonders of the Google and automated world can be harnessed to help us.


Classification at a Crossroads

    1 comment 
< 1 minute

I am sorry not to be able to make it to the UDC conference Classification at a Crossroads (via ISKO UK) in the Hague, as I am trying to get a better grasp of UDC at the moment. I would especially like to hear the paper on UDC and folksonomies. One of the issues I am thinking about is how to preserve the richness of a UDC-based classification while increasing usability. I am mulling over whether harvesting folksonomic tagging is a good way to get at the terminology that users prefer, and whether attaching additional thesaurus terms to a core classification is a good way of bridging the gap between the needs of long-term indexing and instant retrievability.


Vocab Control

    Start a conversation 
Estimated reading time 2–2 minutes

Having spent years working as an editor fussing over consistency of style and orthography, I shouldn’t have been as surprised as I was to find my tags on even this little blog site, written solely by me, had already become a mess. It didn’t take too long to tidy them up, but there are only a handful of articles here so far.

I worked with some extremely clever people in my first “proper” job back in the 90s, and we used to have a “90%” rule regarding algorithmic-based language processing (we mostly processed very well-structured text). However brilliant your program, you’d always have 10% of nonsense left over at the end that you needed to sort out by hand – mainly due to the vagaries of natural language and general human inconsistency. I’m no expert on natural language processing, but I get the impression that a lot of people still think 90% is really rather good. Certainly auto-classification software seems to run at a much lower success rate, even after manual training. It strikes me that there’s a parallel between folksonomies and this sort of software. Both process a lot of information on cheaply, so make possible processing on a scale that just couldn’t be done before, but you still need someone to tidy up around the edges if you want top quality.

I think the future of folksonomies depends on how this tidying-up process develops. There are various things happening to improve quality – like auto-complete predictive text. Google’s tag game is another approach, and ravelry.com use gentle human “shepherding” of taggers, personally suggesting tags and orthography (thanks to Elizabeth for pointing this one out to me).

I would really like to get hold of some percentages. If 75% is a decent showing for off-the peg auto-categorisation/classification software, and we could get up to 90% with bespoke algorithms processing structured text, what perecentages can you expect from a folksonomic approach?

An epistemological problem with folksonomies

    Start a conversation 
Estimated reading time 2–3 minutes

I’m still mulling over Helen Longino’s criteria for objectivity in scientific enquiry (see previous post: Science as Social Knowledge) and it occurred to me that folksonomies are not really open and democratic, but are actually obscure and impenetrable. The “viewpoint” of any given folksonomy might be an averaged out majority consensus or some other way of aggregating tags might have been used, and so you can’t tell if it is skewed by a numerically small but prolifically tagging group. This is the point Judith Simon made in relation to ratings and review software systems at the ISKO conference, but it seems to me the problem for folksonomies is even worse, because of the echo chamber effect of people amplifying popular tags. Without some way of showing who is tagging what and why, the viewpoint expressed in the folksonomy is a mystery. This is not necessarily the case, but I think you’d need to collect huge amounts of data from every tagger, then database it along with the tags, then run all sorts of analyses and publish them in order to show the background assumptions driving the majority tags.

If the folksonomic tags don’t help you find things, who could you complain to? How do you work out whether it doesn’t help you because you are a minority, or for some other reason? With a taxonomy, the structure is open – you may not like it but you can see what it is – and there will usually be someone “in charge” who you can challenge and criticise if you think your perspective has been overlooked. In many case the process of construction will be known too. I don’t see an obvious way of challenging or criticising a folksonomy in this way, so presumably it fails Longino’s criteria for objectivity.

You can just stick your own tags into a folksonomy and use them yourself so there is some trace of your viewpoint in there, but if the rest of the folksonomy doesn’t help you search, that means you can only find things once you have tagged them yourself, which would presumably rule out large content repositories. So, you have to learn and live with the imposed system – just like with a taxonomy – but it’s never quite clear exactly what that system is.

The Social Life of Information

    Start a conversation 
Estimated reading time 3–5 minutes

The Social Life of Information by John Seely Brown and Paul Duguid is an info classic. It’s one of those delightful books that manages to be very erudite, cover a huge range of theory, but reads effortlessly and even had me laughing out loud from time to time. (My favourite anecdote was that BT’s answer to homeworkers’ sense of isolation was to pipe a soundtrack of canned background noise and chatter into their offices!)

Essentially, the book argues that information and information technology cannot be separated from its social context and ignoring the human factors in technology adoption and use leads to fundamental misunderstandings of what it can and does do. This may mean overestimating the potential of information technology to change pre-existing institutions and practices, on both a personal and collective scale, and underestimating the ability of people to adapt technology to suit their ends rather than those envisaged by the technologists.

The authors argue that many “infoenthusiasts” miss subtleties of communication, such as the implicit social negotiations that take place in face-to-face conversations or the social meanings conveyed by a document printed on high quality paper or a book with expensive leather binding. Such nuances are easily lost when the words from such communications are removed from their original context and placed in a new environment – such as an electronic database.

Similarly, although personalisation is often touted as a great advance – you can have your own uniquley customised version of a website or a newspaper – such personalisation diminishes the power of the information source to act as a binding-point for a community. If we all have different versions of the newspaper, then we can’t assume we share common knowledge of the same stories. We then have to put additional work into reconnecting and recreating our knowledge communities, so the benefits of personalisation do not come without costs.

The importance of negotiation, collaboration, and improvisation is argued to be highly significant but extremely hard to build into automated systems. The social nature of language and the complexities of learning how to be a member of a community of practice, including knowing when to break or bend rules, are also essential to how human beings operate but extremely difficult to replicate in technological systems.

The theme of balance runs throughout the book – for example between the need to control processes while allowing freedom for innovation in companies or between the need for communication amongst companies and the need to protect intellectual property (knowledge in companies was often either seen as too “sticky” – hard to transfer and use – or too “leaky” – flowing too easily to competitors). At an institutional level, balance is needed between the importance of stability for building trust and openness to evolution (the perception of the value of a degree is bound up with the established reputation of an educational institution).

I found this very interesting, as my brother has been trying to persuade me that Daoism with its emphasis on things moving gradually from one state to another is a more productive way at looking at complex systems than the Aristotelian view that something can be in one category, or its opposite, but never both at once. (Here is a sisterly plug for an article he has written on the application of Daoist ideas to environmentalism). It also fits in with the idea of balancing the stability of an ordered taxonomy with the fast-flowing nature of folksonomies and of finding a way of using social media to support rather than compete with more formalised knowledge management practice. Brown and Duguid say: “For all the advantages of fluidity, we should not forget that fixity still has its charms. Most of us prefer the checks we receive and the contracts we write to remain stable”, which seems particularly apt given the global credit crisis!


Reductiones ad absurdum

    1 comment 
Estimated reading time 2–2 minutes

In Beneath the Metadata: Some Philosophical Problems with Folksonomy Elaine Petersen argues that as folksonomy is underpinned by relativism, it will always be flawed as an information retrieval method. So, folksonomy will collapse because everything ends up tagged with every conceivable tag so they all cancel each other out and you might as well have not bothered tagging anything.

On the other hand, David Weinberger in Why tagging matters? claims that taxonomy will fail because taxonomists want to invent one single taxonomy to classify everything in the entire world and in a totalitarian style insist that the one true taxonomy is the only way to organise knowledge.

I have no idea who these mysterious megalomaniac taxonomists are. Most of the taxonomists I am aware of only advocate using one single taxonomy for fairly well defined and limited situations (e.g. a single company, or perhaps a department in a big corporation) and are quite happy with the notion that you need lots of different taxonomies suited to context, which makes them much more like Petersen’s relativists.

Conversely, I am fairly sure you can’t actually create an infinite folksonomy with infinite tags for all possible viewpoints of all possible documents (let alone smaller knowledge units). When your taggers are a specific community with a shared purpose, they probably will hit upon a shared vocabulary that is “universal” within the boundaries of that community and so the folksonomy will be meaningful.

I think that these reductio ad absurdum arguments are interesting because they highlight how both folksonomies and taxonomies are inherently flexible and even somewhat unstable, especially when they become large and very widely used. Intervention and management of both will help improve and maintain their usefulness. No matter whether you choose one or the other or a combination of the two, you still need knowledge workers to keep them in good working order!