Category Archives: information architecture

Making KO Work: integrating taxonomies into technology

Lincoln Cathedral
    Start a conversation 
Estimated reading time 6–10 minutes

The recent ISKO UK event Making KO Work: integrating taxonomies into technology offered four very different but complementary talks, followed by a panel session. These provided a good overview of current practice and largely concluded that although technology has advanced, there is still need for human intervention in KO work.

Can You Really Implement Taxonomies in Native SharePoint?

Marc Stephenson from Metataxis gave a clear and helpful overview of the key terms and principles you need to know when using taxonomies and folksonomies in SharePoint. SharePoint is very widely used as an enterprise document repository, and although its taxonomy management capabilities are limited, when combined with an external taxonomy management solution, it can enable very effective metadata capture.

The first step is to become familiar with the specialised terminology that SharePoint uses. Metadata in SharePoint is held as “Columns”, which can be System Columns that are fixed and integral to SharePoint functionality, or Custom Columns, which can be changed and which need to be managed by an information architecture role. For example, Columns can be set as “Mandatory” to ensure users fill them in. Columns can be configured to provide picklists or lookups, as well as being free text, and can be specified as “numeric”, “date” etc. Taxonomies can be included as “Managed Metadata”.

Different “Content Types” can be defined, for example to apply standardised headers and footers to documents, enforce workflow, or apply a retention/disposal policy, and many different pre-defined Content Types are available. Taxonomies are referred to as “Managed Term Sets”, and these can be controlled by a taxonomist role. “Managed Keywords” are essentially folksonomic tags, but SharePoint allows these to be transferred into Managed Term Sets, enabling a taxonomist to choose folksonomic tags to become part of more formal taxonomies.

The “Term Store Manager” provides some functionality for taxonomy management, such as adding synonyms (“Other Labels”), or deprecating terms so that they can no longer be found by users when tagging (but remain available for search). Terms can also be deleted, but that should only be done if there is a process for re-tagging documents, because a deleted tag will generate a metadata error the next time someone tries to save the document. Limited polyhierarchy is possible, because the same term can exist in more than one “Managed Term Set”.

“Term Groups” can be defined, which can be useful if different departments want to manage their own taxonomies.

There are various limitations – such as a maximum number of Managed Terms in a Term Set (30,000) and if SharePoint is deployed online across a large organisation, changes can take some time to propagate throughout the system. The process of importing taxonomies needs to be managed carefully, as there is no way to re-import or over-write Term Sets (you would end up with duplicate sets) and there is no easy way to export taxonomies. There is no provision for term history or scope notes, and no analytics, so SharePoint lacks full taxonomy management functionality.

There are companion taxonomy management products (e.g. SmartLogic’s Semaphore, or Concept Searching) and it is possible to use other taxonomy management tools (such as PoolParty, Synaptica, or MultiTes) but an additional import/export process would need to be built.

So, SharePoint offers a lot of options for metadata management, but is better as a taxonomy deployment tool than a master taxonomy management tool.

Integrating Taxonomy with Easy, Semantic Authoring

Joe Pairman of Mekon Ltd, demonstrated a very user-friendly lightweight set of tagging tools that allow non-expert users the ability to add rich metadata to content as they work. This addresses a key problem for taxonomists – how to ensure subject matter experts or authors who are more focused on content than metadata are able to tag consistently, quickly, and easily. By taking a form-based approach to content creation, authors are able to add structural metadata as they work, and add tags to specific words with a couple of clicks. This is particularly effective with a pre-defined controlled vocabulary.

The example Joe showed us was a very clear commercial use case of Linked Data, because the controlled vocabulary was very specific – products for sale. Each product was associated with a DBPedia concept, which provided the URI, and where a match to the text was detected the relevant word was highlighted. The user could then click on that word, see the suggested DBPedia concept, and click to tag. The tool (using FontoXML and Congility technology) then applied the relevant RDF to the underlying XML document “behind the scenes”, in a process of “inline semantic enrichment”. This approach enables accurate, author-mediated tagging at a very granular level. The customers reading the content online could then click on the hghlighted text and the relevant products could be displayed with an “add to cart” function, with the aim of increasing sales. As an added bonus, the tags are also available for search engines, helping surface very accurately relevant content in search results. (Schema.org tags could also be included.)

Enhancement of User Journeys with SOLR at Historic England

Richard Worthington of Historic England described the problems they had when deploying a SOLR/Lucene search to their documents without any taxonomy or thesaurus support for searching. They soon found that SQL searches were too blunt an instrument to provide useful results – for example, searching for “Grant” at first would bring up the page about the grants that were offered, but as soon as they added more data sets, this frequently searched-for page became buried under references to Grantchester, Grantham, etc.

Although they could manage relevancy to a certain extent at the data set level and by selecting “top results” for specific searches, the search team realised that this would be a painstaking and rigid process. It would also not address the problem that many terms used by the subject matter expert authors were not the same as the terms general users were searching for. For example, general users would search for “Lincoln Cathedral” rather than “Cathedral Church of St Mary of Lincoln”. So, they have much work for human taxonomists and thesaurus editors to do.

Applied Taxonomy Frameworks: Your Mileage May Vary

Alan Flett of SmartLogic took us through the latest enhancements to their products, showcasing a new feature called “Fact Extraction”. This works by identifying the context around specific data and information, in order to drive Business Intelligence and Analytics. The tool is essentially a user-friendly simplified algorithm builder that allows very specific searches to be constructed using pre-defined “building blocks”, such as “Facts”, “Entities”, and “Skips”. This means a specific piece of information, words to ignore, and entities such as a number or a date can be specified to construct a complex search query. This allows the search results to be defined by context and returned in context, and is especially effective for well-structured data sets. It also means that results are framed in a standardized format, which is useful for analytics.

Concluding Panel

Although techniques such as automated classification, machine learning, and AI are progressing all the time, these still work best when combined with a well-structured knowledge base. Creating that knowledge base relies on human intelligence, especially for the familiar problems of disambiguation and synonym collection, in particular where the content authors have a different approach or level of domain expertise to the end users of the search systems. The panel agreed that for both the creation of thesauruses, taxonomies, and ontologies and for the deployment of these in tagging, semi-automated approaches remain necessary, and so there is still much to be done by human taxonomists, ontologists, and information architects in order to make knowledge organisation work.

Image: Lincoln Cathedral. Photo by Zaphad1

Aggregations and basic categories

    Start a conversation 
Estimated reading time 2–3 minutes

I recently enjoyed reading about the work Safari are currently and doing to create a controlled vocabulary and topic aggregation pages to underpin navigation and discovery of their content.

Iterate, again

I very much liked the mix of manual and automated techniques the team used to maximise capturing value from existing resources while using machine processing to help and support the human editorial curation work. Lightweight iterative approaches have become standard in some areas of design, but achieving high quality information structures also usually requires several stages of revision and refinement. It is not always possible to predict what will happen in attempts to index or repurpose existing content, nor how users will respond to different information structures, and so the ability to iterate, correct, re-index, correct, adjust indexing methods, re-index, correct… is vital. Small samples of content are often not sufficient to find all potential issues or challenges, so it is always worth being prepared for surprises once you scale up.

Basics, as always

The Safari team identified the huge intellectual value locked into the existing human-created indexes and it is great to see them being able to extract some of that value, but then augment it using automated techniques. I was very interested to read about how the level of granularity in the individual indexes was too fine for overall aggregation. The team realised that there were “missing subtopics” – key topics that tended to be the subjects of entire books. These “missing subtopics” were found at the level of book titles and it struck me that this vital level of conceptualization aligns directly with Eleanor Rosch‘s work on basic categories and prototype theory. It is not surprising that the concepts that are “basic categories” to the likely readership would be found at book title level, rather than index level.

This is further illustrated by the fact that the very broad high level topics such as “business” did not work well either. These needed not to be “clustered up”, but broken down and refined to the level of the “basic categories” that people naturally think of first.

So, the Safari team’s work is a very clear illustration of not only how to combine manual and automated techniques but also how to find the “basic categories” that match users’ natural level of thinking about the subject area.


Semantic Search – Call for Papers for Special Issue on Semantic Search for Aslib Journal

    1 comment 
Estimated reading time 4–6 minutes

This special issue aims to explore the possibilities and limitations of Semantic Search. We are particularly interested in papers that place carefully conducted studies into the wider framework of current Semantic Search research in the broader context of Linked Open Data.

Research into Semantic Search and its applications has been gaining momentum over the last few years, with an increasing number of studies on general principles, proof of concept and prototypical applications. The market for Semantic Search applications and its role within the general development of (internet) technologies and its impact on different areas of private and public life have attracted attention. Simultaneously, many publicly funded projects in the field of cultural heritage were initialised. Researchers in many disciplines have been making progress in the establishment of both theories and methods for Semantic Search. However, there still is a lack of comparison across individual studies as well as a need for standardisation regarding the dissociation of Semantic Search of other search solutions, agreed upon definitions as well as technologies and interfaces.

Semantic Search research is often based on large and rich data sets and a combination of techniques ranging from statistical bag of words approaches and natural-language-processing enriched via a subtle utilisation of metadata over classificatory approaches right up to ontological reasoning. Over the last 10 years a lot of initial technical and conceptual obstacles in the field of Semantic Search have been overcome. After the initial euphoria for Semantic Search that resulted in a technically driven supply of search solutions, appraisal of successful and less successful approaches is needed. Amongst other things the limitations of working with open world solutions on – only apparently comprehensive – linked open data sets compared to small domain specific solutions need to be determined.
One ongoing challenge for semantic search solutions is their usability and user acceptance, as only highly usable walk-up-and-use-approaches stand a chance in the field of general search.

For this special issue, we invite articles which address the opportunities and challenges of Semantic Search from theoretical and practical, conceptual and empirical perspectives.

Topics of interest include but are not restricted to:

  • The history of semantic search – how the latest techniques and technologies have come out of developments over the last 5, 10, 20, 100, 2000… years
  • Technical approaches to semantic search : linguistic/NLP, probabilistic, artificial intelligence, conceptual/ontological …
  • Current trends in Semantic Search
  • Best practice – how far along the road from ‘early adopters’ to ‘mainstream users’ has semantic search gone so far?
  • Semantic Search and cultural heritage
  • Usability and user experience of Semantic Search
  • Visualisation and Semantic Search
  • Quality criteria for Semantic Search
  • Impact of norms and standardisation for instance (like ISO 25964 “Thesauri for information retrieval“) and the potential of Semantic Search?
  • How are semantic technologies fostering a need for cross-industry collaboration and standardisation?
  • How are Semantic Search techniques and technologies being used in practice?
  • Practical problems in brokering consensus and agreement – defining concepts, terms and classes, etc.
  • Curation and management of ontologies
  • Differences between web-scale, enterprise scale, and collection-specific scale techniques
  • Evaluation of Semantic Search solutions
  • Comparison of data collection approaches
  • User behaviour and the evolution of norms and conventions
  • Information behaviour and information literacy
  • User surveys
  • Usage scenarios and case studies

Submissions

Papers should clearly connect their studies to the wider body of Semantic Search scholarship, and spell out the implications of their findings for future research. In general, only research-based submissions including case studies and best practice will be considered. Viewpoints, literature reviews or general reviews are generally not acceptable.

Papers should be 4,000 to 6,000 words in length (including references). Citations and references should be in our journal style.

Please see the author guidelines at http://www.emeraldinsight.com/products/journals/author_guidelines.htm?id=ap for more details and submission instructions.
Submissions to Aslib Proceedings are made using ScholarOne Manuscripts, the online submission and peer review system. Registration and access is available at http://mc.manuscriptcentral.com/ap.

Important Dates

Paper submission: 15.12.2013
Notice of review results: 15.02.2013
Revisions due: 31.03.2014
Publication: Aslib Proceedings, issue 5, 2014.

About the Journal

Aslib Proceedings (ISSN: 0001-253X) is a peer-reviewed high-quality journal covering international research and practice in library and information science, and information management. The journal is the major publication for ASLIB – the Association for Information Management in the United Kingdom – a membership association for people who manage information and knowledge in organisations and the information industry.
Information about the journal can be found at
http://www.emeraldinsight.com/products/journals/journals.htm?id=ap

Contact the guest editors

Prof. Dr. Ulrike Spree
– Hamburg University of Applied Sciences –
Faculty Design, Medien and Information
Department Information
Finkenau 35
20081 Hamburg
Phone: +49/40/42875/3607
Email: ulrike.spree@haw-hamburg.de

Fran Alexander
Information Architect, BCA Research (2013- )
Taxonomy Manager, BBC Information and Archives (2009-13)
Email: fran@vocabcontrol.com
Twitter: @frangle

Libraries, Media, and the Semantic Web meetup at the BBC

    Start a conversation 
Estimated reading time 3–4 minutes

In a bit of a blog cleanup, I discovered this post languishing unpublished. The event took place earlier this year but the videos of the presentations are still well worth watching. It was an excellent session with short but highly informative talks by some of the smartest people currently working in the semantic web arena. The Videos of the event are available on You Tube.

Historypin

Jon Voss of Historypin was a true “information altruist”, describing libraries as a “radical idea”. The concept that people should be able to get information for free at the point of access, paid for by general taxation, has huge political implications. (Many of our libraries were funded by Victorian philanthropists who realised that an educated workforce was a more productive workforce, something that appears to have been largely forgotten today.) Historypin is seeking to build a new library, based on personal collections of content and metadata – a “memory-sharing” project. Jon eloquently explained how the Semantic Web reflects the principles of the first librarians in that it seeks ways to encourage people to open up and share knowledge as widely as possible.

MIMAS

Adrian Stevenson of MIMAS described various projects including Archives Hub, an excellent project helping archives, and in particular small archives that don’t have much funding, to share content and catalogues.

rNews

Evan Sandhaus of the New York Times explained the IPTC’s rNews – a news markup standard that should help search engines and search analytics tools to index news content more effectively.

schema.org

Dan Brickley’s “compare and contrast” of Universal Decimal Classification with schema.org was wonderful and he reminded technologists that it very easy to forget that librarians and classification theorists were attempting to solve search problems far in advance of the invention of computers. He showed an example of “search log analysis” from 1912, queries sent to the Belgian international bibliographic service – an early “semantic question answering service”. The “search terms” were fascinating and not so very different to the sort of things you’d expect people to be asking today. He also gave an excellent overview of Lonclass the BBC Archive’s largest classification scheme, which is based on UDC.

BBC Olympics online

Silver Oliver described how BBC Future Media is pioneering semantic technologies and using the Olympic Games to showcase this work on a huge and fast-paced scale. By using semantic techniques, dynamic rich websites can be built and kept up to the minute, even once results start to pour in.

World Service audio archives

Yves Raimond talked about a BBC Research & Development project to automatically index World Service audio archives. The World Service, having been a separate organisation to the core BBC, has not traditionally been part of the main BBC Archive, and most of its content has little or no useful metadata. Nevertheless, the content itself is highly valuable, so anything that can be done to preserve it and make it accessible is a benefit. The audio files were processed through speech-to-text software, and then automated indexing applied to generate suggested tags. The accuracy rate is about 70% so human help is needed to sort out the good tags from the bad (and occasionally offensive!) tags, but thsi is still a lot easier than tagging everything from scratch.

The Shape of Knowledge – review of ISKOUK event

    1 comment 
Estimated reading time 1–2 minutes

On Tuesday I attended a very interesting event about information visualization and I have written a review for the ISKO UK blog.

I was particularly fascinated by the ideas suggested by Martin Dodge of mapping areas that are not “space” and what this means for the definition of a “map”. So, the idea of following the “path” of a device such as a phone through the electromagnetic spectrum brings a geographical metaphor into a non-tangible “world”. Conversely, is the software and code that devices such as robots use to navigate the world a new form of “map”? Previously, I have thought of code as “instructions” and “graphs” but have always thought of the “graph” as a representation of coded instructions, visualized for the benefit of humans, rather than the machines. However, now that machines are responding more directly to visual cues, perhaps the gap between their “maps” and our “maps” is vanishing.

SLA Conference in Chicago

    Start a conversation 
Estimated reading time 3–5 minutes

Last month I had a wonderful time at the SLA (Special Libraries Association) conference in Chicago. I had never previously been to an SLA conference, even though there is a lively SLA Europe division. SLA is very keen to be seen as “not just for librarians” and the conference certainly spanned a vast range of information professions. The Taxonomy Division is thriving and there seem to be far more American than British taxonomists, which, although not surprising, was a pleasure as I don’t often find myself as one of a crowd! The conference has a plethora of receptions and social events, including the “legendary” IT division dance party.

There were well over 100 presentation sessions, as well as divisional meetings, panel discussions, and networking events that ranged from business breakfasts to tours of Chicago’s architectural sights. There was plenty of scope to avoid or embrace the wide range of issues and areas under discussion and I focused on taxonomies, Linked Data, image metadata, and then took a diversion into business research and propaganda.

I also thoroughly enjoyed the vendor demonstrations, especially the editorially curated and spam-free search engine Blekko, FastCase, and Law360 legal information vendors, and EOS library management systems.

My next posts will cover a few of the sessions I attended in more detail. Here’s the first:

Adding Value to Content through Linked Data

Joseph Busch of Taxonomy Strategies offered an overview of the world of Linked Data. The majority of Linked Data available in the “Linked Data Cloud” is US government data, with Life Sciences data in second place, which reflects the communities that are willing and able to make their data freely and publicly available. It is important to keep in mind the distinction between concept schemes – Dublin Core, FOAF, SKOS, which provide structures but no meanings – and semantic schemes – taxonomies, controlled vocabularies, ontologies, which provide meanings. Meanings are created through context and relationships, and many people assume that equivalence is simple and association is complex. However, establishing whether something is the “same” as something else is often far more difficult than simply asserting that two things are related to each other.

Many people also fail to use the full potential of their knowledge organization work. Vocabularies are tools that can be used to help solve problems by breaking down complex issues into key components, giving people ways of discussing ideas, and challenging perceptions.

The presentation by Joel Richard, web developer at the Smithsonian Libraries, focused on their botanic semantic project – digitizing and indexing Taxonomic Literature II. (I assume they have discussed taxonomies of taxonomy at some point!) This is a fifteen-volume guide to the literature of systemic botany published between 1753 and 1940. The International Association for Plant Taxonomy (IAPT) granted permission to the Smithsonian to release the work on the web under an open licence.

The books were scanned using OCR, which produced 99.97% accuracy, which sounds impressive but that actually means 5,000-12,000 errors – far too many for serious researchers. Errors in general text were less of a concern than errors in citations and other structured information, where – for example, mistaking an 8 for a 3 could be very misleading. After some cleanup work, the team next identified terms such as names and dates that could be parsed and tagged, and selected sets of pre-existing identifiers and vocabularies. They are continuing to look for ontologies that may be suitable for their data set. Other issues to think about are software and storage. They are using Drupal rather than a triplestore, but are concerned about scalability, so are trying to avoid creating billions of triples to manage.

Joel also outlined some of the benefits of using Linked Data, gave some examples of successful projects, and provided links to further resources.

Building bridges: Linking diverse classification schemes as part of a technology change project

    Start a conversation 
< 1 minute

My paper about my work on the linking and migration of legacy classification schemes, taxonomies, and controlled vocabularies has been published in the Journal for Business Information Review.

Building, visualising and deploying taxonomies and ontologies; the reality – Content Intelligence Forum event

    Start a conversation 
Estimated reading time 1–2 minutes

I have been trying to get to the Content Intelligence Forum meetups for some time as they always seem to offer excellent speakers on key topics that don’t tend to get the attention they deserve, so I was delighted to be able to attend Stephen D’Arcy’s talk a little while ago on taxonomies and ontologies.

Stephen has many years of experience designing semantic information systems for large organisations, ranging from health care providers, to banks, to media companies. His career illustrates the transferability and wide demand for information skills.

His 8-point checklist for a taxonomy project was extremely helpful – Define, Audit, Tools, Plan, Build, Deploy, Governance, Documentation – as were his tips for managing stakeholders, IT departments in particular. He warned against the pitfalls of not including taxonomy management early enough in search systems design, and the problems that you can be left with if you do not have a flexible and dynamic way of managing your taxonomy and ontology structures. He also included a lot of examples that illustrated the fun aspects of ontologies when used to create interesting pathways through entertainment content in particular.

The conversation after the talk was very engaging and I enjoyed finding out about common problems that information professionals face, including how best to define terms, how to encourage clear thinking, and how to communicate good research techniques.