Last night I went to the ERBI IT special interest group meeting on text mining. It was a real treat. Richard Kidd from the Royal Society of Chemistry opened by describing their award-winning Prospect project which applies semantic web technologies to primary research publishing. Essentially, along with the Sciborg project they have developed software to identify chemical entities using text mining and ontologies, which provides rich sources of links and metadata and helps their editors validate texts. There is a fantastic tool called OSCAR that can extract all sorts of information from chemistry texts. Taxonomies and ontologies plug in to these tools and systems to provide the base data. Richard stressed the need for a taxonomy to be a living thing that keeps up with terminology changes, and also talked about the way the RSC use “Tiny Ontologies All Strung Together (TOAST)” as there is no over-arching comprehensive chemistry ontology.

Phil Hastings then gave a summary of the work of Linguamatics, who have developed text-mining software for life sciences. They use natural language processing to allow “relationship searching” and the construction of complex queries, offering more sophisticated answers than can be provided by keyword searches across flat text by conventional search engines. They too use “bolt-on” taxonomies and ontologies that provide a sort of deep reference layer.

Julie Barnes from Biowisdom provided some practical examples of how “assertional metadata” can be used to help drug developers and clinicians assess the likely toxicity of certain compounds, side effects, etc. By focusing on creating high-quality metadata containing information about relationships, rather than just about the item itself, relationships and associations can easily be highlighted, helping pharmacologists to pick out key correlations from the huge oceans of data available. I particularly liked her contention that “the name or label we give something sometimes holds us in a dogma that stops us seeing something new” and that using metadata to surface relationships can bring up unexpected links and so lead to shifts in thinking and new discoveries.

Finally, the esteemed Dr Peter Murray-Rust from the University of Cambridge talked in more detail about his development of OSCAR and Chemical Markup Language (CML), an extension of XML. He stressed the need for annotation standards in markup to minimise ambiguity and that as humans rarely reach more than 90% agreement over ontological issues, it is unlikely that any software vendors claiming their product can do better will meet the challenge. However, he also made the point that “if we can communicate well, we can communicate both to humans and machines”.

This bears out my experience in reference publishing. We always used a mixture of automated and human processing, with the software doing the “heavy lifting” and the editors tidying up the anomalies and absurdities by hand afterwards. I think it will be a long time before we find something better than this “best of both worlds” approach. We also aim for consistent modes of expression to facilitate searching, databasing, and comparability. It is possible to use a rules-based approach to writing and still produce something that sounds natural and is easy to read. Classic formats, such as methodologies for writing up experiments, are a typical example of consistent structuring.