In the beginning was the word: the evolution of knowledge organisation

    1 comment 
Estimated reading time 3–5 minutes

I was delighted to be introduced by Mark Davey to Leala Abbott on Monday. Leala is a smart and accomplished digital asset management consultant from the Metropolitan Museum of Art in New York and we were discussing how difficult it is to explain what we do. I told her about how I describe “the evolution of classification” to people and she asked me to write it up here. So, this is my first blog post “by commission”.

In the beginning there was the word, then words (and eventually sentences).

Then people realised words could be very useful when they were grouped into lists (and eventually controlled vocabularies, keyword lists, tag lists, and folksonomies).

But then the lists started to get a bit long and unwieldy, so people broke them up into sections, or categories, and lo and behold – the first taxonomy.

faceted taxonomy
People then realised you could join related taxonomies together for richer information structuring and they made faceted taxonomies, labelling different aspects of a concept in the different facets.

Then people noticed that if you specified and defined the relationships between the facets (or terms and concepts), you could do useful things with those relationships too, which becomes especially powerful when using computers to analyse content, and so ontologies were devised.

Here is a very simple example of how these different KO systems work:

I need some fruit – I think in words – apples, pears, bananas. Already I have a shopping list and that serves its purpose as a reminder to me of things to buy (I don’t need to build a fruit ontology expressing the relationships between apples and other foodstuffs, for example).

When I get to the shop, I want to find my way around. The shop has handy signs – a big one says “Fresh fruit”, so I know which section of the shop to head for. When I get there, a smaller sign says “Apples” and even smaller ones tell me the different types of apples (Gala, Braeburn, Granny Smith…). The shop signs form a simple taxonomy, which is very useful for helping me find my way around.

When I get home, I want to know how to cook apple pie, so I get my recipe book, but I’m not sure whether to look under “Apples” or “Pies”. Luckily, the index includes Apples: Pies, Puddings and Desserts as well as Pies, Puddings and Desserts: Apples. The book’s index has used a faceted taxonomy, so I can find the recipe in either place, whichever one I look in first.

After dinner, I wonder about the history of apple pies, so I go online to a website about apples, where a lot of content about apple pies has been structured using ontologies. I then can search the site for “apple pie” and get suggestions for lots of articles related to apples and pies that I can browse through, based on the ideas that the people who built the ontology have linked together. For example, if the article date has been included, I could also ask more complex questions such as “give me all the articles on apple pies written before 1910”, and if the author’s nationality has been included, I could ask for all the articles on apple pies written before 1910 by US authors.

People often ask me if a taxonomy is better than a controlled vocabulary, or if an ontology is the best of all, but the question doesn’t make sense out of context – it really depends what you are trying to do. Ontologies are the most complex and sophisticated KO classification tools we have at the moment, but when I just want a few things from the shop, it’s a good old fashioned list every time.

Linking classification schemes via rdf

    Start a conversation 
< 1 minute

I was delighted to read an excellent blog post by Dan Brickley about the value of opening up classifications. It is lovely to read something like this from a technical expert who understands the power of innovation and new technology, but who also recognises the value of the information contained in classifications.

Assumptions, mass data, and ghosts in the machine

    1 comment 
Estimated reading time 3–5 minutes

Back in the summer, I was very lucky to meet Jonah Bossewitch (thanks Sam!) an inspiring social scientist, technical architect, software developer, metadatician, and futurologist. His article The Bionic Social Scientist is a call to arms for the social sciences to recognise that technological advances have led to a proliferation of data. This is assumed to be unequivocably good, but is also fuelling a shadow science of analysis that is using data but failing to challenge the underlying assumptions that went into collecting that data. As I learned from Bowker and Star, assumptions – even at the most basic stage of data collection – can skew the results obtained and that any analysis of such data may well be built on shaky (or at the very least prejudiced) foundations. When this is compounded by software that analyses data, the presuppositions of the programmers, the developers of the algorithms, etc. stack assumption on top of assumption. Jonah points out that if nobody studies this phenomenon, we are in danger of losing any possibility of transparency in our theories and analyses.

As software becomes more complex and data sets become larger, it is harder for human beings to perform “sanity checks” or apply “common sense” to the reports produced. Results that emerge from de facto “black boxes” of calculation based on collections of information that are so huge that no lone unsupported human can hope to grasp are very hard to dispute. The only possibility of equal debate is amongst other scientists, and probably only those working in the same field. Helen Longino’s work on science as social practice emphasised the need for equality of intellectual authority, but how do we measure that if the only possible intellectual peer is another computer? The danger is that the humans in the scientific community become even more like high priests guarding the machines that utter inscrutable pronouncements than they are currently. What can we do about this? More education, of course, with the academic community needing to devise ways of exposing the underlying assumptions and the lay community needing to become more aware of how software and algorithms can “code in” biases.

This appears to be a rather obscure academic debate about subjectivity in software development, but it strikes to the heart of the nature of science itself. If science cannot be self-correcting and self-criticising, can it still claim to be science?

A more accessbile example is offered by a recent article claiming that Facebook filters and selects updates. This example illustrates how easy it is to allow people to assume a system is doing one thing with massed data when in fact it is doing something quite different. Most people think that Facebook’s “Most Recent” updates provides a snapshot of the latest postings by all your friends, and if you haven’t seen updates from someone for a while, it is because they haven’t posted anything. The article claims that Facebook prioritises certain types of update over others (links take precedence over plain text) and updates from certain people. Doing this risks creating an echo chamber effect, steering you towards the people who behave how Facebook wants them to (essentially, posting a lot of monetisable links) in a way that most people would never notice.

Another familiar example is automated news aggregation – an apparently neutral process that actually involves sets of selection and prioritisation decisions. Automated aggreagations used to be based on very simple algorithms, so it was easy to see why certain articles were chosen and others excluded, but very rapidly such processing has advanced to the point that it is almost impossible (and almost certainly impractical) for a reader to unpick the complex chain of choices.

In other words, there certainly is a ghost in the machine, it might not be doing what we expect, and so we really ought to be paying attention to it.