Information Retrieval

29th June, 2008 Fran 4 comments
Estimated reading time 3–5 minutes

The ISKO event at UCL on Thursday was fascinating. It was a real treat to hear the eminent Brian Vickery summarise the last 75 years of information retrieval developments, setting out the key questions to be answered and the challenges still to be overcome. At 90 years old he has a unique overview, having been a key member of the Classification Research Group and director of SLAIS. He pointed out that most retrieval systems have a particular user community in mind and that this affects the choice of information collected as well as the way the collection is structured. He also argued that being accepted as part of a specialist community involves use of the specialist terminology. I am very interested in the reverse of this – that lack of access to the “rght” terminology is exclusionary. It’s all about shibboleths! He said that key questions at the moment include – whether the costs and effort of building expensive retrieval systems like taxonomies are justified, whether the need for harmonisation is increasing, what is the future for general ontologies, and what needs to be done to improve statistical retrieval systems.

Stephen Robertson from Microsoft Research, who developed search algorithms that still power most of the big search engines today, talked about the TREC competition, which has almost always been won by statistically based searches. He drew a distinction between general purpose search and specialised search for highly specific contexts – such as individual organisations – adding that in general specialist search is lagging behind. He also said that we need to find ways of feeding other sources of knowledge – such as taxonomies – into statistical searching because only by yoking the power of both will we get marked improvements.

Ian Rowlands then talked about the much publicised JISC survey on the “Google generation” concluding that they are much the same as other generations. In all age groups about 20% are expert users of technology and 20% technophobes, with everyone else muddling along in the middle. The JISC project team observed that some people spend a long time looking at online navigation systems, sometimes without accessing any articles at all. It is hard to know whether this counts as success or failure. I can think of scenarios either way – often I just want to know what’s there and will return later, sometimes it means I can rule out a source as useless (which might be a good thing if it has saved me the time of reading through irrelevant articles or might be a bad thing if it means I can’t find what I need).

There was then a very interesting discussion in which people expressed concerns about information overload and the way that students find it hard to distinguish between authoritative and trivial sources. Ian lamented the fact that online you don’t have the visual clues that you had in physical libraries – big chunky leather bound books have an obvious “weight” and authority. Personally, I wonder how much this has been driven by the desire of publishers and teachers to make educational resources “fun”. If all your text books look like adverts and all your online learning resources look like pop videos, how are you going to learn which is which? It is perfectly possible to have an authoritative online style and publishers will produce it if that is what sells best. Throughout my career I have urged “authoritativeness” in design and been told by marketing departments that it isn’t what parents, teachers and kids want – they’ll only buy it if it looks flashy and fluffy! Another issue is the lack of a canon in a post-modern world – but that’s another story!

Here’s a post on the event on Madi Solomon’s Taxonomy Society blog.

Top

Language and Social Identity

28th June, 2008 Fran Start a conversation
Estimated reading time 1–2 minutes

Language and Social Identity is a collection of fascinating sociolinguistic papers. Dealing with gender and ethnicity, the researchers seek to show how stereotypes often arise from simple linguistic misunderstandings. For example, one paper argues that speakers of Indian English tend to use pronouns, conjunctions, and intonation very differently to speakers of UK English. UK speakers typically fail to pick up on the Indian English speakers’ cues and assume that what they are saying is confused or incoherent. Conversely, Indian English speakers think the UK English speakers must be either daft or extremely patronising because of their apparent failure to understand very simple logic. Another paper claims that men and women typically use utterances like “mm hmm” to mean different things. Women mean simply “I’m listening”, whereas men mean emphatically “I agree”. Men then think that women keep changing their minds and women think men just aren’t listening!

The most relevant paper from a taxonomic point of view was one on the highly charged political nature of language use in Montreal. The need to cut across language differences and negotiate norms of communication when diverse groups feel they have something to lose through compromise mirrors the inter-departmental language mediation that usually needs to happen in taxonomy projects.

Top

NKOS events

25th June, 2008 Fran Start a conversation
Estimated reading time 1–2 minutes

7th European Networked Knowledge Organization Systems (NKOS) Workshop at the European Conference on Research and Advanced Technology for Digital Libraries is taking place in Aarhus in September. Topics of contributions include:

* Concepts of Digital Libraries and digital content
* Collection building, management and integration
* System architectures, integration and interoperability
* Information organisation, search and usage
* Multilingual information access and multimedia content management
* User interfaces for digital libraries
* User studies and system evaluation
* Digital archiving and preservation: methodological, technical and legal issues
* Digital Library applications in e-science, e-learning, e-government, cultural heritage, etc.
* Web 2.0 and associated technologies

The International Conference on Dublin Core and Metadata Applications 2008 is taking place in Berlin in September on the intersection between NKOS and metadata, especially the main topic of the conference: Metadata for Semantic and Social Applications.

There is an NKOS events 2008 Wiki with details about what will be discussed, with the option to contribute.

Top

SAGE journals free trials

21st June, 2008 Fran Start a conversation
< 1 minute

SAGE will be running a free trial to its entire
portfolio of Information Science journals throughout July and August. To sign up (for access to
journals such as the IFLA Journal, Journal of Information Science and Information
Development) go to http://online.sagepub.com/cgi/freetrial
(from the 1st of July). Alternatively email
infoscience@sagepub.co.uk to be informed when the trial goes live.

Top

The Mundaneum

18th June, 2008 Fran Start a conversation
< 1 minute

The Mundaneum Museum Honors the First Concept of the World Wide Web New York Times review of the Mundaneum Museum.

Top

Sorting Things Out

15th June, 2008 Fran 1 comment
Estimated reading time 1–2 minutes

Sorting Things Out – Classification and its Consequences is a joy of a book, crammed with research and insights. It is very well written but is aimed at a serious academic audience, so pretty dense and packed with references. Bowker and Star examine in depth the development of the International Classification of Causes of Death, going back to 17th century archives and considering how something as apparently obvious and clearcut as death is in fact mired in political, religious, and economic biases. They go on to discuss the treatment of TB patients and the development of the Nursing Interventions Classification, again both of which would appear to be “objectively measurable” but are revealed to be complex intertwinings of various pressures. They then assess South Africa’s system of apartheid from the point of view of classification, showing how the arbitrary categorisation of people added to the brutality and cruelty of the regime. The book is not just a stark warning of how dominant regimes can use classification as a tool of oppression, but is also an important investigation of the powerplays involved in all categorisations.

Top

Essential Classification

14th June, 2008 Fran Start a conversation
< 1 minute

Here’s a review of Essential Classification by Vanda Broughton, a core Library Studies textbook and very easy read. It’s a sound introduction to classification – very practical and really aimed at trainee librarians, but included enough background and theory to keep me interested, including some pointers to the biases in the big classification systems. I was also intrigued by the assertion that people find it easier to remember numbers, so numerical shelfmarks are generally more popular than those based on letters. I always thought it was easier to remember letters, because you can make them into little phrases, but perhaps that’s just me!

Top

ISKO UK – Open Meeting

10th June, 2008 Fran Start a conversation
< 1 minute

ISKO UK – International Society for Knowledge Organization event on June 26th. Three eminent speakers Brian Vickery, Stephen Robertson and Ian Rowlands will address the issues that have dominated the information retrieval agenda since the 1950s, and still present challenges and opportunities for the future.

Top

Text mining

5th June, 2008 Fran 3 comments
Estimated reading time 3–4 minutes

Last night I went to the ERBI IT special interest group meeting on text mining. It was a real treat. Richard Kidd from the Royal Society of Chemistry opened by describing their award-winning Prospect project which applies semantic web technologies to primary research publishing. Essentially, along with the Sciborg project they have developed software to identify chemical entities using text mining and ontologies, which provides rich sources of links and metadata and helps their editors validate texts. There is a fantastic tool called OSCAR that can extract all sorts of information from chemistry texts. Taxonomies and ontologies plug in to these tools and systems to provide the base data. Richard stressed the need for a taxonomy to be a living thing that keeps up with terminology changes, and also talked about the way the RSC use “Tiny Ontologies All Strung Together (TOAST)” as there is no over-arching comprehensive chemistry ontology.

Phil Hastings then gave a summary of the work of Linguamatics, who have developed text-mining software for life sciences. They use natural language processing to allow “relationship searching” and the construction of complex queries, offering more sophisticated answers than can be provided by keyword searches across flat text by conventional search engines. They too use “bolt-on” taxonomies and ontologies that provide a sort of deep reference layer.

Julie Barnes from Biowisdom provided some practical examples of how “assertional metadata” can be used to help drug developers and clinicians assess the likely toxicity of certain compounds, side effects, etc. By focusing on creating high-quality metadata containing information about relationships, rather than just about the item itself, relationships and associations can easily be highlighted, helping pharmacologists to pick out key correlations from the huge oceans of data available. I particularly liked her contention that “the name or label we give something sometimes holds us in a dogma that stops us seeing something new” and that using metadata to surface relationships can bring up unexpected links and so lead to shifts in thinking and new discoveries.

Finally, the esteemed Dr Peter Murray-Rust from the University of Cambridge talked in more detail about his development of OSCAR and Chemical Markup Language (CML), an extension of XML. He stressed the need for annotation standards in markup to minimise ambiguity and that as humans rarely reach more than 90% agreement over ontological issues, it is unlikely that any software vendors claiming their product can do better will meet the challenge. However, he also made the point that “if we can communicate well, we can communicate both to humans and machines”.

This bears out my experience in reference publishing. We always used a mixture of automated and human processing, with the software doing the “heavy lifting” and the editors tidying up the anomalies and absurdities by hand afterwards. I think it will be a long time before we find something better than this “best of both worlds” approach. We also aim for consistent modes of expression to facilitate searching, databasing, and comparability. It is possible to use a rules-based approach to writing and still produce something that sounds natural and is easy to read. Classic formats, such as methodologies for writing up experiments, are a typical example of consistent structuring.

Top

Monthly Archives: June 2008