Online Information Conference – day two

    1 comment 
Estimated reading time 6–10 minutes

Linked Data in Libraries

I stayed in the Linked Data track for Day 2 of the Online Information Conference, very much enjoying Karen Coyle‘s presentation on metadata standards – FRBR, FRSAR, FRAD, RDA – and Sarah Bartlett‘s enthusiasm for using Linked Data to throw open bibliographic data to the world so that fascinating connections can be made. She explained that while the physical sciences have been well mapped and a number of ontologies are available, far less work has been done in the humanities. She encouraged humanities researchers to extend RDF and develop it.

In the world of literature, the potential connections are infinite and very little numerical analysis has been done by academics. For example, “intertextuality” is a key topic in literary criticism, and Linked Data that exposes the references one author makes to another can be analysed to show the patterns of influence a particular author had on others. (Google ngrams is a step in this direction, part index, part concordance.)

She stressed that libraries and librarians have a duty of care to understand, curate, and manage ontologies as part of their professional role.

Karen and Sarah’s eagerness to make the world a better place by making sure that the thoughtfully curated and well-managed bibliographic data held by libraries is made available to all was especially poignant at a time when library services in the UK are being savaged.

The Swedish Union Catalogue is another library project that has benefited from a Linked Data approach. With a concern to give users more access to and pathways into the collections, Martin Malmsten asked if APIs are enough. He stressed the popularity of just chucking the data out there in a quick and dirty form and making it as simple as possible for people to interact with it. However, he pointed out that licences need to be changed and updated, as copyright law designed for a print world is not always applicable for online content.

Martin pointed out that in a commercialised world, giving anything away seems crazy, but that allowing others to link to your data does not destroy your data. If provenance (parametadata) is kept and curated, you can distinguish between the metadata you assert about content and anything that anybody else asserts.

During the panel discussion, provenance and traceability – which the W3C is now focusing on (parametadata) – was discussed and it was noted that allowing other people to link to your data does not destroy your data, and often makes it more valuable. The question of what the “killer app” for the semantic web might be was raised, as was the question of how we might create user interfaces that allow the kinds of multiple pathway browsing that can render multiple relationships and connections comprehensible to people. This could be something a bit like topic maps – but we probably need a 13-year-old who takes all this data for granted to have a clear vision of its potential!

Tackling Linked Data Challenges

The second session of day two was missing Georgi Kobilarov of Uberblic who was caught up in the bad weather. However, the remaining speakers filled the time admirably.

Paul Nelson of Search Technologies pointed out that Google is not “free” to companies, as they pay billions in search engine optimisation (SEO) to help Google. Google is essentially providing a marketing service, and companies are paying huge amounts trying to present their data in the way that suits Google. It is therefore worth bearing in mind that Google’s algorithms are not resulting in a neutral view of available information resources, but are providing a highly commercial view of the web.

John Sheridan described using Linked Data at the National Archives to open up documentation that previously had very little easily searchable metadata. Much of the documentation in the National Archives is structured – forms, lists, directories, etc. – which present particular problems for free text searches, but are prime sources for mashing up and querying.

Taxonomies, Metadata, and Semantics: Frameworks and Approaches

There were some sensible presentations on how to use taxonomies and ontologies to improve search results in the third session.
Tom Reamy of KAPS noted the end of the “religious fervour” about folksonomy that flourished a few years ago, now that people have realised that there is no way for folksonomies to get better and they offer little help to infrequent users of a system. They are still useful as a way of getting insights into the kind of search terms that people use, and can be easier to analyse than search logs. A hybrid approach, using a lightweight faceted taxonomy over the top of folksonomic tags is proving more useful.

Taxonomies remain key in providing the structure on which autocategorisation and text analytics is based, and so having a central taxonomy team that engages in regular and active dialogue with users is vital. Understanding the “basic concepts” (i.e. Lakoff and Rosch’s “basic categories”) that are the most familiar terms to the community of users is vital for constructing a helpful taxonomy and labels should be as short and simple as possible. Labels should be chosen for their distinctiveness and expressiveness.

He also pointed out that adults and children have different learning strategies, which is worth remembering. I was also pleased to hear his clear and emphatic distinction between leisure and workplace search needs. It’s a personal bugbear of mine that people don’t realise that looking for a hairdresser in central London – where any one of a number will do – is not the same as trying to find a specific shot of a particular celebrity shortly after that controversial haircut a couple of years ago from the interview they gave about it on a chat show.

Tom highlighted four key functions for taxonomies:

  • knowledge organisation systems (for asset management)
  • labelling systems (for asset management)
  • navigation systems (for retrieval and discovery)
  • search systems (for retrieval)

He pointed out that text analytics needs taxonomy to underpin it, to base contextualisation rules on. He also stressed the importance of data quality, as data quality problems cause the majority of search project failures. People often focus on cool new features and fail to pay attention to the underlying data structures they need to put in place for effective searching.

He noted that the volumes of data and metadata that need to processed are growing at a furious rate. He highlighted Comcast as a company that is very highly advanced in the search and data management arena, managing multiple streams of data that are constantly being updated, for an audience that expects instant and accurate information.

He stated that structure will remain the key to findability for the foreseeable future. Autonomy is often hailed as doing something different to other search engines because it uses statistical methods, but at heart it still relies on structure in the data.

Richard Padley made it through the snow despite a four-hour train journey from Brighton, and spoke at length about the importance of knowledge organisation to support search. He explained the differences between controlled vocabularies, indexes, taxonomies, and ontologies and how each performs a different function.

Marianne Lykke then talked about information architecture and persuasive design. She also referred to “basic categories” as well as the need to guide people to where you want them to go via simple and clear steps.

Taxonomies, Metadata, and Semantics in Action

I spoke in the final session of the day, on metadata life cycles, asset lifecycles, parametadata, and managing data flows in complex information “ecosystems” with different “pace layers”.

Neil Blue from Biowisdom gave a fascinating and detailed overview of Biowisdom’s use of semantic technologies, in particular ontology-driven concept extraction. Biowisdom handle huge complex databases of information to do with the biological sciences and pharmaceuticals, so face very domain-specific issues, such as how to bridge the gap between “hard” scientific descriptions and “soft” descriptions of symptoms and side-effects typically given by patients.

In the final presentation of the day, Alessandro Pica outlined the use of semantic technologies by Italian News agency AGI.

Online Information Conference 2010

    Start a conversation 
Estimated reading time 3–4 minutes

Despite the recession, tube strikes, and snow, there was a fine collection of speakers, exhibitors, and delegates at a smaller than usual Online Information Conference and Exhibition this year.

Librarians seem to be getting heavily into Linked Data, while the corporate sector is still mainly concerned with business intelligence and search.

On day one I enjoyed the practical explanations of how Linked Data principles have been made to work at The Guardian, The Press Association, the Dutch Parliament, and the ALISS health project in Scotland.

Linked Data tags are a form of metadata that can be used to automatically generate content aggregations for web pages. This means that not only can you re-use your own content, increasing its lifespan, but you can gather cheap content that is openly available online. This is very familiar territory to me, as we used to build products in the same way back in the 90s, the difference being that we didn’t have much of an external web to link to back then. In the meantime, using a linkable, interoperbale format for your tags has very many benefits, and whether your focus is primarily for content within or beyond a firewall, the arguments for using standards that have the potential to link to the wider world seem very compelling. I can’t see any logical reasons not to standardise the format your metadata is held in (technical and practical issues are another matter), although standardising the semantic content of the metadata is a far more difficult problem.

It was reassuring to hear that everyone else is struggling with the problems of who mints IDs and URIs, who settles arguments about what exactly the IDs refer to – especially across domains – and who resolves and manages mappings. Such issues are difficult to resolve within a firewall, out there on the Web they become vast. The W3C is starting to work on provenance standards (the parametadata or meta-metadata), a pet worry of mine, because I am certain we need to get that layer of semantic information into our tags as soon as possible if we are going to be able to advance the semantic web beyond crunching databases together.

In the meantime, Linked Data is working very well especially for mashups and information visualisations. I particularly liked the Dutch Parliament’s “Attaquograms” – a diagram showing how often MPs were interrupted in debates and how much they interrupted others, although it doesn’t appear to have changed their behaviour yet. I also enjoyed The Guardian’s “league tables” of MPs’ performance. When MPs protested that such analyses ignored qualitative issues, The Guardian pointed out that if MPs advocate such data crunching as a way to judge schools and hospitals, then it must be good enough to apply to MPs themselves.

Andy Hyde from the ALISS project is working on ways to use Linked Data to help people manage their health, especially for patients with long term conditions such as diabetes. He stressed the importance of involving the users in any information project and addressing them on their terms, stating “The most collaborative tool we have is the cup of tea”.

My only concern about using Linked Data to pull in Wikipedia content, is whether the audience will start recognising it. If every website that mentions a topic has the same Wikipedia content attached to it, won’t people get bored? Perhaps there are just so many people out there contributing, so many varieties of alogrithmic aggregations, and so much content to read, it will never happen!

There is a related Guardian technology blog post.

I will post summaries of days two and three shortly.

Augmented reality

    Start a conversation 
Estimated reading time 3–5 minutes

I went to a British Computer Society talk on Augmented Reality a few weeks ago. The BCS audience is typically highly technical, but the talks themselves are always accessible and entertaining. People often wonder why I am interested in augmented reality, because they assume it has nothing to do with information, but to me it is all about information. I would love to be able to serve up archive content to someone’s mobile phone using location data – a clip of a scene from an episode of their favourite programme that was filmed in that location, or an old news report about an event that took place there. Managing vast data sets containing huge amounts of content in a searchable form will form the backbone of many augmented reality tools and applications. If this isn’t an area that information scientists should be exploring, I don’t know what is!

The speakers were Professor Anthony Steed, Head of Virtual Environments and Computer Graphics at UCL, and Lester Madden, founder and director of Augmented Planet.

They explained the difference between visual search, true augmented reality, and virtual reality. Visual search is using an image as a search term (as in Google Goggles) and then returning results. Because this can be done via a camera, the image can be one that is in the searcher’s immediate environment, and the results can be returned as an soverlay on the original image. True augmented reality is not just adding graphics to an unrelated camera feed, but is responsive to the real surroundings. Virtual reality is an entirely computer-generated environment.

3-D models of the world are being built, but keeping them up to date is proving a challenge, and crowdsourcing may be the only pragmatic option. Another technical challenge suggested was how to render the augmentation visually indistinguishable from “real” vision, which raises all sorts of interesting philosophical and ethical questions about how we handle the behaviour of people who become confused or cease to be able to tell the difference, either temporarily or permanently. At the moment, augmented reality is quite distinct from virtual reality, but eventually the two will presumably meet. However, nobody seems to think that is likely anytime soon.

In the meantime, there was a rather lovely video of an augmented reality audience, designed to help people who have difficulty speaking in public. Apparently, this is a particular problem for those people in the software industry who are not natural extroverts but find that their careers can only advance if they get out from behind the screen and start talking at conferences, trade shows, etc., where audiences can be quite hostile. University students are hopeless at pretending to be a hostile audience – they are too polite, apparently (this week’s events notwithstanding!) – and actors are too expensive. Avatars, however, can be programmed to look bored, chat on their mobiles, get up and walk out, etc., and real people tend to have similar emotional reactions to the behaviour of avatars as they do to other humans, making an augmented reality theatre a perfect place for practising speaking and building confidence.

Augmented reality is also finding practical applications in the construction industry, to create visualisations of buildings before they are constructed, in medicine to help surgeons, and for improved videoconferencing. There are also many ways that augmented reality can be used to sell things – show me information about the restaurant or shoe shop in this street. Amusingly, identifying unique buildings is quite easy, but for the branded chains disambiguation is proving a challenge – their outlets look the same in every town – which brings us back again to familiar information science territory.

There is also a BCS blog post about the event.

KO

In the beginning was the word: the evolution of knowledge organisation

    1 comment 
Estimated reading time 3–5 minutes

I was delighted to be introduced by Mark Davey to Leala Abbott on Monday. Leala is a smart and accomplished digital asset management consultant from the Metropolitan Museum of Art in New York and we were discussing how difficult it is to explain what we do. I told her about how I describe “the evolution of classification” to people and she asked me to write it up here. So, this is my first blog post “by commission”.

word
In the beginning there was the word, then words (and eventually sentences).

list
Then people realised words could be very useful when they were grouped into lists (and eventually controlled vocabularies, keyword lists, tag lists, and folksonomies).

taxonomy
But then the lists started to get a bit long and unwieldy, so people broke them up into sections, or categories, and lo and behold – the first taxonomy.

faceted taxonomy
People then realised you could join related taxonomies together for richer information structuring and they made faceted taxonomies, labelling different aspects of a concept in the different facets.

ontology
Then people noticed that if you specified and defined the relationships between the facets (or terms and concepts), you could do useful things with those relationships too, which becomes especially powerful when using computers to analyse content, and so ontologies were devised.

Here is a very simple example of how these different KO systems work:

I need some fruit – I think in words – apples, pears, bananas. Already I have a shopping list and that serves its purpose as a reminder to me of things to buy (I don’t need to build a fruit ontology expressing the relationships between apples and other foodstuffs, for example).

When I get to the shop, I want to find my way around. The shop has handy signs – a big one says “Fresh fruit”, so I know which section of the shop to head for. When I get there, a smaller sign says “Apples” and even smaller ones tell me the different types of apples (Gala, Braeburn, Granny Smith…). The shop signs form a simple taxonomy, which is very useful for helping me find my way around.

When I get home, I want to know how to cook apple pie, so I get my recipe book, but I’m not sure whether to look under “Apples” or “Pies”. Luckily, the index includes Apples: Pies, Puddings and Desserts as well as Pies, Puddings and Desserts: Apples. The book’s index has used a faceted taxonomy, so I can find the recipe in either place, whichever one I look in first.

After dinner, I wonder about the history of apple pies, so I go online to a website about apples, where a lot of content about apple pies has been structured using ontologies. I then can search the site for “apple pie” and get suggestions for lots of articles related to apples and pies that I can browse through, based on the ideas that the people who built the ontology have linked together. For example, if the article date has been included, I could also ask more complex questions such as “give me all the articles on apple pies written before 1910”, and if the author’s nationality has been included, I could ask for all the articles on apple pies written before 1910 by US authors.

People often ask me if a taxonomy is better than a controlled vocabulary, or if an ontology is the best of all, but the question doesn’t make sense out of context – it really depends what you are trying to do. Ontologies are the most complex and sophisticated KO classification tools we have at the moment, but when I just want a few things from the shop, it’s a good old fashioned list every time.

Assumptions, mass data, and ghosts in the machine

    1 comment 
Estimated reading time 3–5 minutes

Back in the summer, I was very lucky to meet Jonah Bossewitch (thanks Sam!) an inspiring social scientist, technical architect, software developer, metadatician, and futurologist. His article The Bionic Social Scientist is a call to arms for the social sciences to recognise that technological advances have led to a proliferation of data. This is assumed to be unequivocably good, but is also fuelling a shadow science of analysis that is using data but failing to challenge the underlying assumptions that went into collecting that data. As I learned from Bowker and Star, assumptions – even at the most basic stage of data collection – can skew the results obtained and that any analysis of such data may well be built on shaky (or at the very least prejudiced) foundations. When this is compounded by software that analyses data, the presuppositions of the programmers, the developers of the algorithms, etc. stack assumption on top of assumption. Jonah points out that if nobody studies this phenomenon, we are in danger of losing any possibility of transparency in our theories and analyses.

As software becomes more complex and data sets become larger, it is harder for human beings to perform “sanity checks” or apply “common sense” to the reports produced. Results that emerge from de facto “black boxes” of calculation based on collections of information that are so huge that no lone unsupported human can hope to grasp are very hard to dispute. The only possibility of equal debate is amongst other scientists, and probably only those working in the same field. Helen Longino’s work on science as social practice emphasised the need for equality of intellectual authority, but how do we measure that if the only possible intellectual peer is another computer? The danger is that the humans in the scientific community become even more like high priests guarding the machines that utter inscrutable pronouncements than they are currently. What can we do about this? More education, of course, with the academic community needing to devise ways of exposing the underlying assumptions and the lay community needing to become more aware of how software and algorithms can “code in” biases.

This appears to be a rather obscure academic debate about subjectivity in software development, but it strikes to the heart of the nature of science itself. If science cannot be self-correcting and self-criticising, can it still claim to be science?

A more accessbile example is offered by a recent article claiming that Facebook filters and selects updates. This example illustrates how easy it is to allow people to assume a system is doing one thing with massed data when in fact it is doing something quite different. Most people think that Facebook’s “Most Recent” updates provides a snapshot of the latest postings by all your friends, and if you haven’t seen updates from someone for a while, it is because they haven’t posted anything. The article claims that Facebook prioritises certain types of update over others (links take precedence over plain text) and updates from certain people. Doing this risks creating an echo chamber effect, steering you towards the people who behave how Facebook wants them to (essentially, posting a lot of monetisable links) in a way that most people would never notice.

Another familiar example is automated news aggregation – an apparently neutral process that actually involves sets of selection and prioritisation decisions. Automated aggreagations used to be based on very simple algorithms, so it was easy to see why certain articles were chosen and others excluded, but very rapidly such processing has advanced to the point that it is almost impossible (and almost certainly impractical) for a reader to unpick the complex chain of choices.

In other words, there certainly is a ghost in the machine, it might not be doing what we expect, and so we really ought to be paying attention to it.

Financial sector ontologies

    Start a conversation 
Estimated reading time 1–2 minutes

I went to a Semantic Web meetup event on Tuesday where Mike Bennett of the EDM introduced an ontology for managing financial sector information that he has been developing.

It is always reassuring to discover that people working in completely different industries are facing the same challenges. Handling multiple viewpoints and the need to keep provenance of terminology well defined and clear was a key theme, as terms like “equities” can mean very different things in different contexts. Mike defined his own “archetypes” and used an “upper ontology” to act as an umbrella to connect other ontologies. I was particularly interested in the discussion of solutions for managing synonyms, one of which included a quite sophisticated use of RDF.

It was also interesting to hear Mike’s explanations of his use of taxonomies within the ontology and of the ingenious ways he finds to present his ideas to business people who don’t speak OWL!

For digital asset management, search is not enough

    Start a conversation 
< 1 minute

I was very flattered to be asked by Kate Simpson to write another article – For digital asset management, search is not enough – for the excellent resource FUMSI.

The article sums up some of the latest DAM trends and technologies with the intention of demystifying some of the services that software vendors are offering. It owes much to Theresa Regli‘s excellent work as a “professional cynic”.