There have been lots of interesting presentations at Web Science 2010 in Raleigh. My metadata meerkats were popular – hard to beat charismatic megafauna. The papers and posters are online at The Journal of Web Science.
There have been lots of interesting presentations at Web Science 2010 in Raleigh. My metadata meerkats were popular – hard to beat charismatic megafauna. The papers and posters are online at The Journal of Web Science.
Journalism.co.uk :: Do publications need style guides for data? That would be nice – can we tidy them all up and semantic webify everything while we’re at it?
ISKO UK’s KOnnect blog notes that at least Google is taking metadata seriously.
Chatting about Wolfram Alpha the other day, it was pointed out to me that specialist knowledge for a general audience is actually a very niche area, and this is the source of the hype. You need to persuade your VC funders you are revolutionary, when actually you have a very tricky business model. Serious researchers will be using specialised systems already and most people want to look up things like train times rather than atomic weights of elements, so your market is people like students and journalists, who have an intermediate level of interest. Perhaps there are enough of them in the world to generate plenty of advertising revenue, but it seems like a tough call.
I hope the funders are happy with the old reference publishing model – lots of investment up front, in the hope not that the finished product will generate huge initial profits, but will have a long steady life. Wolfram Alpha employed 150 people in essentially traditional content creation roles and it will be interesting to see how they get their money back. Google doesn’t have to pay for its own content or metadata creation!
I went to the ISKO event on Thursday. The speaker, Dave Snowden of Cognitive Edge was very entertaining. He has already blogged about the lecture himself.
He pointed out that humans are great at pattern recognition (“intuition is compressed experience”) and are great satisficers (computers are great at optimising), and that humans never read or remember the same word in quite the same way (has anyone told Autonomy this?). I suppose this is the accretion of personal context and experience affecting your own understanding of the word. I remember as a child forming very strong associations with names of people I liked or disliked – if I disliked the person, I thought the name itself was horrible. This is clearly a dangerous process (and one I hope I have grown out of!) but presumably is part of the way people end up with all sorts of irrational prejudices and also explains why “reclaiming” words like “queer” eventually works. If you keep imposing new contexts on a word, those contexts will come to dominate. This factors into taxonomy work, as it explains the intensity people feel about how things should be named, but they won’t all agree. It must also be connected to why language evolves (and how outdated taxonomies start to cause rather than solve problems – like Wittgenstein’s gods becoming devils).
Snowden also talked about the importance of recognising the weak signal, and has developed a research method based on analysing narratives, using a “light touch” categorisation (to preserve fuzzy boundaries) and allowing people to categorise their own stories. He then plots the points collected from the stories to show the “cultural landscape”. If this is done repeatedly, the “landscapes” can be compared to see if anything is changing. He stressed that his methodology required the selection of the right level of detail in the narratives collected, disintermediation (letting people speak in their own words and categorise in their own way within the constraints), and distributed cognition.
I particularly liked his point that when people self-index and self-title they tend to use words that don’t occur in the text, which is a serious problem for semantic analysis algorithms (although I would comment that third party human indexers/editors will use words not in the text too – “aboutness” is a big problem!). He was also very concerned that computer scientists are not taught to see computers as tools for supporting symbiosis with humans, but as black box systems that should operate autonomously. I completely agree – as is probably quite obvious from many of my previous blog posts – get the computers to do the heavy lifting to free up the humans to sort out the anomalies, make the intuitive leaps, and be creative.
UPDATE: Here’s an excellent post on this talk from Open Intelligence.
Truevert: What is semantic about semantic search? is an easy introduction to the thinking behind the Truevert semantic search engine. I was heartened by the references to Wittgenstein and the attention Truevert have paid to the work of linguists and philosophers. So much commercial search seems to have been driven by computer scientists with little interest in philosophy, or if they did they kept quiet about it (any counter examples out there?)! Perhaps philosophers have not been so good at promoting themselves either. Perhaps the Chomskyian attempt to divide linguistics itself into “hard scientific” linguistics and “fuzzy” linguistic disciplines like sociolinguistics has not helped.
As a believer in interdisciplinary and collaborative approaches, I have always wondered why we seemed to be so bad at building these bridges and information science has always struck me as a natural crossing point. Of course, there has been a lot of collaboration, but my impression is that academia has been rather better at this than the commercial world, with organisations like ISKO UK working hard to forge links. Herbert Roitblat at Truevert is obviously proud of their philosophical and linguistic awareness, and more interestingly, thinks it is worth broadcasting in a promotional blog post.
The National Centre for Text Mining is “the first publicly-funded text mining centre in the world”. It is an initiative of Manchester and Liverpool universities, working with the University of California at Berkeley and the University of Tokyo. They appear to be working mainly on biology texts at the moment, but I enjoyed the explanations of their techniques and processes, despite the technicality. There are links to events and seminars that are aimed at the scientific community but some would probably be of interest to more general semantic web enthusiasts.
I very much enjoyed the presentations given at the ISKO UK event on semantic analysis technologies yesterday and was particularly heartened by the emphasis placed by almost all of the speakers on the need for a human factor to train, maintain, and moderate software systems. My overall impression was that you can have complex software systems that work very well, but you need a lot of human input to set them up – feeding them carefully crafted controlled vocabularies, taxonomies, and especially ontologies – and preferably checking their output.
The first presentation by Luca Scagliarini of Expert System outlined their use of large-scale taxonomies to create an enhanced index that included linked concepts and relationships. The “turbo-charged” index – a sort of faceted taxonomy – is then run against content to create a sophisticated filterable search function.
Jeremy Bentley of smartlogic described taxonomies as “semantic middleware” adding that the notion of having one standard taxonomy has given way to a recognition of the need for multiple taxonomies to reflect multiple viewpoints. He pointed out that shopping in a supermarket would be practically impossible if none of the tins had labels and you had to look in each one to find out what was in it, but that this is essentially what search engines do – they look in all the documents to decide what is in them. He stressed that automatic generation of metadata is essential because of the volume and the need for consistency, but that automated systems are not yet good enough to build well-crafted ontologies, as they cannot allow for the complexities of context-specific requirements and differences between subject domains. Nevertheless, human ontologists can be helped by automatically generated suggestions.
Rob Lee of Rattle Research then described a way of leveraging information in DBpedia – outputs of Wikipedia articles as rdf triples – (muddy it) as a controlled vocabulary to generate links from documents, such as news stories, to other free online resources, such as Music Brainz. DBpedia contains disambiguation information, which improves relevance of links. By adding a search engine layer (they used Lucene to pick out key words), even more interesting links can be made between resources. However, the systems were most successful when restricted to simple identifiable entities – such as people, places, and companies. Such entities can be matched against dictionary/gazetteer-style authority files, which is harder to do for broad subject areas.
Helen Lippell, Karen Loasby, and Silver Oliver then talked about three projects at the BBC. Helen described a joint project involving several organisations including the BBC and the Financial Times to generate metadata automatically to tag news stories. They looked for specific names – such as company names – but encountered problems with company names changing, different forms of names, nicknames, etc. Other problems occurred when company names were ordinary language words (Next, thus, IF) or when company names contained symbols or special characters (e.g. M&S, more th>n).
Karen Loasby described a way of prompting journalists to add metadata to articles being added to the BBC’s content management system. Automated analysis worked best for short, factual content. The journalists themselves were often confused by the suggestions, however, and found it hard to grasp the purpose of the metadata. The system is still used in a modified form, and with some editorial supervision.
Silver Oliver then discussed projects to use statistically-based categorisation to try to pull together related content from different repositories. The method was more successsful for some topics than others and in some cases, rules-based methods were more successful than statistical ones. A major strength of rules-based methods is that they tend to be less of a “black box”. When irrelevant connections are made, you can usually look at the rules and see why the system has found a link, and manually adjust the rule, but with a statistically-based approach, it is harder to diagnose why false connections have occurred. A disadvantage of the rules-based approach is that the rules need manual updating from time to time.
The presentations were followed by a panel session. Issues discussed included granularity – automated systems seem to work well with short pieces of content, but longer items – such as books – might need to be broken into smaller units. Historical archives may need different sorts of semantic analysis to ephemeral content, like news stories, and multilingual mapping may be difficult if languages do not have one-to-one correspondences of concepts so hierarchies have to be re-constituted, rather than simply translated.
How the Semantic Web Will Change Information Management: Three Predictions makes the semantic web sound so easy! Well worth reading for a very straightforward overview of what’s involved.
Many thanks to Traugott Koch for these links:
NKOS Workshop at ECDL in Aarhus.
NKOS Special Session at DC 2008 in Berlin, all in one single pdf file.
The Joint NKOS/CENDI Workshop “New Dimensions in Knowledge Organization
Systems”, in Washington, DC, USA on September 11, 2008. “Thanks to the contributors, programme committees, chairs and the large and very active audiences. We invite your active participation 2009 as well. Watch the website. ”
Applying Turing’s Ideas to Search – Boxes and Arrows: The design behind the design applies the Turing test to the problem of understanding searches in order to provide better results. Ferrara suggests we need to revisit the parsing approach (moving on from the pattern-matching paradigm) and to develop “social ontologies” in order to get better search results. The “social ontologies” are – if I have understood correctly – wikis of relationships that can then be accessed by search engines to make semantic inferences. The ontologies would have to be socially constructed as there is just too much information out there to put it all together any other way. It struck me that this is a bit like what SKOS is essentially hoping to do. Once upon a time I wanted to build a fully linked thesaurus of the English language where every word was linked to every related word, so you could navigate through the entire language, following pathways of meaning, with no word left out. People thought it was a daft idea, but compared with trying to build ontologies of everything, it doesn’t seem so crazy. Just shows how times have changed!