The first ISKO UK conference Content Architecture: Exploiting and Managing Diverse Resources was held in London on June 22-23.
The opening keynote by David Crystal was an entertaining overview of semantic analysis. He pointed out that around 25% of words in any language are culture-specific and that 25% of words in English have spelling variations. His work with adpepper media focused on targeting adverts but also avoiding inappropriate placement of adverts, where they might be offensive or insensitive (linking ads for alcoholic drinks to content about alcohol addiction treatments, for example). He pointed out that semantic analysis can’t cope with sarcasm or poetry, but fortunately most consumer websites don’t use much of either!
I enjoyed hearing Shana Wagger and Randi Park talk about the challenges they had encountered in trying to integrate a set of World Bank publications into a single online resource. Having been involved in similar projects, it was reassuring to hear all the familiar problems – you can’t just scan/OCR a printed book and bung it online. For example, in print you can use the same title for different objects (a paragraph heading, the title of a graph, a photo) and it is quite clear what each refers to, but turn them into online links and they need disambiguation. They used semantic analysis to create a knowledge architecture and managed metadata using Teragram.
Ann O’Brien, Steve Probets, and Libo Eric Si talked about their use of DDC as a switch language to facilitate library portal cross-browsing. The main problem highlighted was the loss of precision, as the switch language determines the highest level of precision available.
Ian Davis of Dow Jones outlined the three approaches to image retrieval: free text tagging, controlled vocabulary tagging, and content-based image retrieval (CBIR). The layers of information in any image make text tagging a bit hit and miss, even by experienced indexers, while CBIR is fun, it is still in its infancy. CBIR if fine if you want “something purple in the middle of something green” for example, but I couldn’t think of many contexts when that would be useful. It might be handy if you are creating artworks and are prepared to browse until you find a suitable subject.
A very good suggestion was to have a “trendy tag” feature where you could add buzzwords and then remove them later, when they dropped out of fashion. He also explained how there were no fixed correlations between objects and concepts, using the example of a lightbulb representing ideas. A bright shiny lightbulb in western cultures represents ideas and innovation, but he showed a picture of a very sad dead lightbulb on its side, which didn’t suggest bright ideas at all.
People are very critical of irrelevant image retrieval, much more so than irrelevant text in search results (presumably because it is easier to ignore) and CVs are better are reducing irrelevance than other methods at the moment. Creative people often want to browse around in search if inspiration, and so don’t want to use precise keyword searching.
Chris Town described an ontology-based approach to image retrieval.
Danny Budzak’s presentation on “history 2.0” and differing classifications according to different social perspectives (dockworkers and management, for example) was hugely entertaining and interesting. He was fascinated by the way we are all classifiers now, as the web has made us not only publishers, but librarians too.
Matthew Hodgson pointed out that however clever our computer systems get, information still has to be produced and interpreted by humans. He used semantic analysis to simplify complex drug prescription regulations to construct an online system that would allow doctors to access the information they needed quickly, easily, and accurately.
The second day’s keynote speech was by Clifford Lynch of the Coalition for Networked Information (CNI). He enthused about how the availability of data allows everyone to contribute to knowledge creation, especially in certain branches of science. For example, amateur astronomers and ecologists – citizen scientists – are measuring recording and submitting data that is hugely useful to the “professional” scientific community.
I was very amused by his description of trying to provide resources for scientists and humanities scholars, when for their requirements, the scientists had a very specific set of demands, limited to their particular domain, but the humanists wanted a database of everything anybody had ever said, created, done, thought, observed, throughout the whole of history, and from all over the world!
He was interested in how “citizen humanities” could be constructed and the problems of authority, mediating disagreements, and where the distinctions between public and private should be drawn. Should everybody be in the Dictionary of National Biography, for example? How do traditional classifications break down when everybody links to everything else?
Tom Scott explained the BBC’s approach to using Linked Data – essentially URIs and metadata. He stressed that it was a mistake to try to make people computer literate, we should be trying to make computers human literate and that increasing levels of abstraction make technology more human-centric. The BBC has built up content as microsites, but this frustrates browsing as it is not possible to follow a semantic thread.
There were also presentations on improving folksonomic tagging by use of controlled suggestions, integrating biodiversity data sets, semantic metadata annotation, lots about cloud computing, the FRSAD model, semantic interoperability in an international KOS, and between distributed digital repositories.