I very much enjoyed the presentations given at the ISKO UK event on semantic analysis technologies yesterday and was particularly heartened by the emphasis placed by almost all of the speakers on the need for a human factor to train, maintain, and moderate software systems. My overall impression was that you can have complex software systems that work very well, but you need a lot of human input to set them up – feeding them carefully crafted controlled vocabularies, taxonomies, and especially ontologies – and preferably checking their output.

The first presentation by Luca Scagliarini of Expert System outlined their use of large-scale taxonomies to create an enhanced index that included linked concepts and relationships. The “turbo-charged” index – a sort of faceted taxonomy – is then run against content to create a sophisticated filterable search function.

Jeremy Bentley of smartlogic described taxonomies as “semantic middleware” adding that the notion of having one standard taxonomy has given way to a recognition of the need for multiple taxonomies to reflect multiple viewpoints. He pointed out that shopping in a supermarket would be practically impossible if none of the tins had labels and you had to look in each one to find out what was in it, but that this is essentially what search engines do – they look in all the documents to decide what is in them. He stressed that automatic generation of metadata is essential because of the volume and the need for consistency, but that automated systems are not yet good enough to build well-crafted ontologies, as they cannot allow for the complexities of context-specific requirements and differences between subject domains. Nevertheless, human ontologists can be helped by automatically generated suggestions.

Rob Lee of Rattle Research then described a way of leveraging information in DBpedia – outputs of Wikipedia articles as rdf triples – (muddy it) as a controlled vocabulary to generate links from documents, such as news stories, to other free online resources, such as Music Brainz. DBpedia contains disambiguation information, which improves relevance of links. By adding a search engine layer (they used Lucene to pick out key words), even more interesting links can be made between resources. However, the systems were most successful when restricted to simple identifiable entities – such as people, places, and companies. Such entities can be matched against dictionary/gazetteer-style authority files, which is harder to do for broad subject areas.

Helen Lippell, Karen Loasby, and Silver Oliver then talked about three projects at the BBC. Helen described a joint project involving several organisations including the BBC and the Financial Times to generate metadata automatically to tag news stories. They looked for specific names – such as company names – but encountered problems with company names changing, different forms of names, nicknames, etc. Other problems occurred when company names were ordinary language words (Next, thus, IF) or when company names contained symbols or special characters (e.g. M&S, more th>n).

Karen Loasby described a way of prompting journalists to add metadata to articles being added to the BBC’s content management system. Automated analysis worked best for short, factual content. The journalists themselves were often confused by the suggestions, however, and found it hard to grasp the purpose of the metadata. The system is still used in a modified form, and with some editorial supervision.

Silver Oliver then discussed projects to use statistically-based categorisation to try to pull together related content from different repositories. The method was more successsful for some topics than others and in some cases, rules-based methods were more successful than statistical ones. A major strength of rules-based methods is that they tend to be less of a “black box”. When irrelevant connections are made, you can usually look at the rules and see why the system has found a link, and manually adjust the rule, but with a statistically-based approach, it is harder to diagnose why false connections have occurred. A disadvantage of the rules-based approach is that the rules need manual updating from time to time.

The presentations were followed by a panel session. Issues discussed included granularity – automated systems seem to work well with short pieces of content, but longer items – such as books – might need to be broken into smaller units. Historical archives may need different sorts of semantic analysis to ephemeral content, like news stories, and multilingual mapping may be difficult if languages do not have one-to-one correspondences of concepts so hierarchies have to be re-constituted, rather than simply translated.