Category Archives: semantic web

Making KO Work: integrating taxonomies into technology

Lincoln Cathedral
    Start a conversation 
Estimated reading time 6–10 minutes

The recent ISKO UK event Making KO Work: integrating taxonomies into technology offered four very different but complementary talks, followed by a panel session. These provided a good overview of current practice and largely concluded that although technology has advanced, there is still need for human intervention in KO work.

Can You Really Implement Taxonomies in Native SharePoint?

Marc Stephenson from Metataxis gave a clear and helpful overview of the key terms and principles you need to know when using taxonomies and folksonomies in SharePoint. SharePoint is very widely used as an enterprise document repository, and although its taxonomy management capabilities are limited, when combined with an external taxonomy management solution, it can enable very effective metadata capture.

The first step is to become familiar with the specialised terminology that SharePoint uses. Metadata in SharePoint is held as “Columns”, which can be System Columns that are fixed and integral to SharePoint functionality, or Custom Columns, which can be changed and which need to be managed by an information architecture role. For example, Columns can be set as “Mandatory” to ensure users fill them in. Columns can be configured to provide picklists or lookups, as well as being free text, and can be specified as “numeric”, “date” etc. Taxonomies can be included as “Managed Metadata”.

Different “Content Types” can be defined, for example to apply standardised headers and footers to documents, enforce workflow, or apply a retention/disposal policy, and many different pre-defined Content Types are available. Taxonomies are referred to as “Managed Term Sets”, and these can be controlled by a taxonomist role. “Managed Keywords” are essentially folksonomic tags, but SharePoint allows these to be transferred into Managed Term Sets, enabling a taxonomist to choose folksonomic tags to become part of more formal taxonomies.

The “Term Store Manager” provides some functionality for taxonomy management, such as adding synonyms (“Other Labels”), or deprecating terms so that they can no longer be found by users when tagging (but remain available for search). Terms can also be deleted, but that should only be done if there is a process for re-tagging documents, because a deleted tag will generate a metadata error the next time someone tries to save the document. Limited polyhierarchy is possible, because the same term can exist in more than one “Managed Term Set”.

“Term Groups” can be defined, which can be useful if different departments want to manage their own taxonomies.

There are various limitations – such as a maximum number of Managed Terms in a Term Set (30,000) and if SharePoint is deployed online across a large organisation, changes can take some time to propagate throughout the system. The process of importing taxonomies needs to be managed carefully, as there is no way to re-import or over-write Term Sets (you would end up with duplicate sets) and there is no easy way to export taxonomies. There is no provision for term history or scope notes, and no analytics, so SharePoint lacks full taxonomy management functionality.

There are companion taxonomy management products (e.g. SmartLogic’s Semaphore, or Concept Searching) and it is possible to use other taxonomy management tools (such as PoolParty, Synaptica, or MultiTes) but an additional import/export process would need to be built.

So, SharePoint offers a lot of options for metadata management, but is better as a taxonomy deployment tool than a master taxonomy management tool.

Integrating Taxonomy with Easy, Semantic Authoring

Joe Pairman of Mekon Ltd, demonstrated a very user-friendly lightweight set of tagging tools that allow non-expert users the ability to add rich metadata to content as they work. This addresses a key problem for taxonomists – how to ensure subject matter experts or authors who are more focused on content than metadata are able to tag consistently, quickly, and easily. By taking a form-based approach to content creation, authors are able to add structural metadata as they work, and add tags to specific words with a couple of clicks. This is particularly effective with a pre-defined controlled vocabulary.

The example Joe showed us was a very clear commercial use case of Linked Data, because the controlled vocabulary was very specific – products for sale. Each product was associated with a DBPedia concept, which provided the URI, and where a match to the text was detected the relevant word was highlighted. The user could then click on that word, see the suggested DBPedia concept, and click to tag. The tool (using FontoXML and Congility technology) then applied the relevant RDF to the underlying XML document “behind the scenes”, in a process of “inline semantic enrichment”. This approach enables accurate, author-mediated tagging at a very granular level. The customers reading the content online could then click on the hghlighted text and the relevant products could be displayed with an “add to cart” function, with the aim of increasing sales. As an added bonus, the tags are also available for search engines, helping surface very accurately relevant content in search results. (Schema.org tags could also be included.)

Enhancement of User Journeys with SOLR at Historic England

Richard Worthington of Historic England described the problems they had when deploying a SOLR/Lucene search to their documents without any taxonomy or thesaurus support for searching. They soon found that SQL searches were too blunt an instrument to provide useful results – for example, searching for “Grant” at first would bring up the page about the grants that were offered, but as soon as they added more data sets, this frequently searched-for page became buried under references to Grantchester, Grantham, etc.

Although they could manage relevancy to a certain extent at the data set level and by selecting “top results” for specific searches, the search team realised that this would be a painstaking and rigid process. It would also not address the problem that many terms used by the subject matter expert authors were not the same as the terms general users were searching for. For example, general users would search for “Lincoln Cathedral” rather than “Cathedral Church of St Mary of Lincoln”. So, they have much work for human taxonomists and thesaurus editors to do.

Applied Taxonomy Frameworks: Your Mileage May Vary

Alan Flett of SmartLogic took us through the latest enhancements to their products, showcasing a new feature called “Fact Extraction”. This works by identifying the context around specific data and information, in order to drive Business Intelligence and Analytics. The tool is essentially a user-friendly simplified algorithm builder that allows very specific searches to be constructed using pre-defined “building blocks”, such as “Facts”, “Entities”, and “Skips”. This means a specific piece of information, words to ignore, and entities such as a number or a date can be specified to construct a complex search query. This allows the search results to be defined by context and returned in context, and is especially effective for well-structured data sets. It also means that results are framed in a standardized format, which is useful for analytics.

Concluding Panel

Although techniques such as automated classification, machine learning, and AI are progressing all the time, these still work best when combined with a well-structured knowledge base. Creating that knowledge base relies on human intelligence, especially for the familiar problems of disambiguation and synonym collection, in particular where the content authors have a different approach or level of domain expertise to the end users of the search systems. The panel agreed that for both the creation of thesauruses, taxonomies, and ontologies and for the deployment of these in tagging, semi-automated approaches remain necessary, and so there is still much to be done by human taxonomists, ontologists, and information architects in order to make knowledge organisation work.

Image: Lincoln Cathedral. Photo by Zaphad1

25th International World Wide Web Conference in Montreal

    Start a conversation 
Estimated reading time 3–4 minutes

The 25th International World Wide Web Conference was held in Montreal, which meant I was able to attend the whole week. Nevertheless, with 118 papers, 72 posters, 30 demos, 21 workshops, and 7 tutorials, I still had to miss plenty of intriguing-looking sessions. One in particular I was sorry to miss was the presentation of a paper on how the qwerty arrangement of letters on a keyboard affects perceptions of the positivity of words – words with more “right hand” letters are considered more positive that those with more “left hand” letters. So, presumably, anything with an “-ing” ending is more positive than anything with an “-ed” ending.

Linked Data Workshop

I thoroughly enjoyed the Linked Data day and it seems we are moving closer to tools and apps that will help people publish and manage their data as Linked Data. As the opening speaker noted, it is hard to evangelize the use of Linked Data when many end users just want a familiar Excel spreadsheet. It is also hard for Linked Data enthusiasts who don’t have a development team familiar with rdf on hand to publish their data, even if they want to. Publishing Linked Data should be as easy as posting to social media!

A step in this direction is Annalist – A practical tool for creating, managing and sharing evolving linked data.

A very interesting paper was on semantic annotation of scholarly work on the Qur’an. Islamic scholarship is particularly well suited to a linked data approach, as so many verses of the Qur’an need to be read alongside other verses, as well as tracing the history of interpretations over the centuries.

Keynote Addresses

There were three very impressive keynote addresses, by Sir Tim Berners-Lee, Mary Ellen Zurko, and Peter Norvig.

Sir Tim‘s opening keynote called for re-decentralization of the web, pointing out that all the effort of contributing photos, news, etc. to sites such as Facebook, is that these contributions are not being made in an open way that will benefit humanity, but are locked in, so that they only benefit Facebook’s shareholders.

Mary Ellen Zurko talked about security on the web, how it remains vitally important as a socio-political as well as technical concern, and if she could turn back time would find an alternative to passwords!

Peter Norvig described machine learning algorithms, which can find correlations in datasets but still need human input to give context to the results.

Pataphysics

I was delighted to be introduced to pataphysics – the science of imaginary solutions – and patadata. It is pleasing that there are still people trying to defend the vision of the web as a playground for creativity, against the push to make it no more than an efficient commercialized virtual high street for shopping.

The spirit of surfing the web just to wander about in the hope of encountering unexpected juxtapositions and spontaneous serendipities – in contradiction to Google’s drive to give you exactly what you were looking for – is captured in the syzygy surfer. The Syzygy Surfer aims to delight you with its search results, without trying to match your preconceptions of what a “correct result” might be!


Semantic Theatre gets practical

    Start a conversation 
Estimated reading time 2–2 minutes

I have started to look into the CIDOC Conceptual Reference Model for cultural heritage metadata as part of my investigation of the concept of Semantic Theatre.

An events-based approach is used in a lot of ontological modelling. Thanks to Athanasios Velios, I learned that bookbinding can be broken down into a sequence of events, and this is an obvious route to try when thinking about how to model a performance event.

I think there is potential for relating the “objects” in the play – the performers as well as set, props, etc, – to concepts within the play. So far, I have been focusing mainly on modelling relationships between ideas within the script (e.g. lines where this character uses the ocean as a metaphor for life) and possibly comparing across scripts (e.g. which lines reference King Lear) but it would be interesting to include props and actors as well (e.g. in which scenes is a clock used as a reference to death). The use of a prop could easily be modelled as a distinct event within a play, and this would facilitate relating literary and metaphorical ideas to the object rather than just to the words in the script.

The play itself – Ocean Opera – will be performed at the Montreal Fringe Festival in June.

Adventures in Semantic Theatre

ship sailing into the full moon on the horizon
    2 comments 
Estimated reading time 5–8 minutes

I have been investigating the idea of using semantic techniques and technologies to enhance plays, along with the Montreal Semantic Web meetup group. There have been far fewer Semantic Web projects for the humanities than the sciences and even fewer that have examined the literary aspects of the theatre. Linked Open Data sets associated with the theatre are mostly bibliographic, library catalogue metadata, which treat plays from the point of view of simple objective properties of the artefact of a play, not its content: a play has an author, a publisher, a publication date, etc. Sometimes a nod towards the content is made by including genre, and there has been work on markup of scripts from a structural; perspective – acts, characters, etc. There are obvious and sound reasons for these kind of approaches, meeting bibliographic and structural use cases (e.g. “give me all the plays written by French authors between 1850-1890”; “give me the act, scene, and line references for all the speeches over ten lines long by a particular character”; “give me all the scenes in which more than three characters appear on stage at once”).

Modelling literary rather than physical connections

Once we started discussing at the meetups how we could model the content itself, especially in a qualitative manner, we quickly became embroiled in questions of whether or not we needed to create entire worldviews for each play and how we could relate things in the play to their real world counterparts.

One of the plays we are working on – Ocean Opera by Alex Gelfand (to be performed at the Montreal Fringe Festival this June) – included the Moon as a character. How and by what relationships could we link the Moon of the play to the Moon in the sky, and then how could we link it to other fictional and literary Moons?

Another play we analysed – Going Back Home by Rachel Jury – was a dramatization based on real people and historical events. It seemed obvious these should be linked to their real counterparts, and would a simple “is a fictional representation of” suffice? How should we relate depictions of historical events in the play to eyewitness accounts from the time or to newspaper reports?

Should we define the world view of each play? Would it matter when defining relationships if there were events in the play that were counterfactual or scientifically impossible?

How could we capture intertextuality and references to other plays? Should there be a differentiation between quotations and overt references by the author to other texts and less explicit allusions and shared cultural influences?

Artistic Use Cases

One of the most appealing aspects of this project to me is that we have no strict commercial or business requirements to meet. A starting point was the idea of a “literary search engine” that ranked relevance not according to information retrieval best practice, but under its own terms as art, or perhaps even defined its own “relevance within the world of the play”. In other words, we would be trying to produce results that were beautiful rather than results that best answered a query.

However, there are also a number of very practical use cases for modelling the literary world of a play, rather than just modelling a play as an object.

Querying within a play

Navigating within the text by answering such queries as ‘in which scenes do these two characters appear together’ answers one set of use cases. The BBC’s Mythology Engine was designed to help users find their way around within a lot of brands, series, and episodes, and characters and events were modelled as central.

An equivalent set of queries for literary aspects would be “how many scenes feature metaphors for anger and ambition” or “which monologues include references to Milton”.

Querying across many plays

If you extend such use cases across a body of plays, recommendation scenarios become possible. For example, “if you liked this play which frequently references Voltaire and includes nautical metaphors, then you might also like this play…” and there are clear commercial implications for the arts in terms of marketing and promotion, finding new audiences, and even in planning new work.

These kind of “metaphorical use cases” could also serve as a rich seam for generating interesting user journeys through a literary archive and as a way of promoting serendipitous discovery for students and researchers.

Storyline use cases

A lot of work that has been done at the BBC has been based around the concept of an ‘event’, and the relationship of events to storylines. This is particularly relevant for many practical and creative aspects of writing, compiling, broadcasting, archiving, and re-using content. For example, being able to distinguish the name of the journalist from the names of people who are mentioned within the story, and to distinguish between more and less significant people within a story according to whether they are mentioned as part of the main event or only in association with consequent or secondary events.

Literary and metaphorical use cases might take a similar approach but decompose the events in a story in terms of the emotional development of the characters.

Fictional worlds use cases

One of the ideas that I find the most appealing, but is the hardest to pin down, is the idea of modelling the internal ontological world of a work of fiction. In a fictional ontology, you can have relationships that make no sense in the ‘real’ world, so modelling them cannot rely on the kind of sense-testing and meeting of requirements that we use so much in commercial contexts.

In discussions, some people reacted very strongly against the idea of even attempting to model fictional worlds, which I found fascinating, while others immediately saw the idea as just another aspect of literary creation – an artistic endeavour in its own right.

There is an epistemological tangent in ontological thinking that goes into a debate about realism versus anti-realism that I haven’t fully got to grips with yet.

Where next?

I am at the very early stages of thinking through all this, and not sure where it will go, but am enjoying starting to gather a community of interest. If you would like to know more, I have written in more detail about it all on the project blog: http://www.semantictheatre.org.

How semantic search helps girls and boys, but in different ways

    Start a conversation 
Estimated reading time 2–4 minutes

While researching something else, I happened upon this rather cheering paper: The effect of semantic technologies on the exploration of the web of knowledge by female and male users. Gender issues only tangentially affect my core research, as I generally focus on linguistic communities that are defined by organizational or professional context, so gender effects are rather diluted by that point. I also prefer collapsing false dichotomies rather than emphasizing difference and division, and so I was very heartened that this article shows how semantic techniques can be unifying.

The study is based on observing the search strategies of a group of male and a group of female students in Taiwan. Given non-semantic search systems to use, the male students tended to search very broadly and shallowly, skimming large numbers of results and following links and going off on tangents to find other results. This enabled them to cover a lot of ground, often finding something useful, but also often left them with a rather chaotic collection of results and references. The female students tended to search very deeply and narrowly, often stopping to read in depth a paper that they had found, and trying to fully grasp the nature of the results that had been returned. This meant they tended to collect fewer results overall, the results tended to be clustered around a single concept, and they risked falling into the rather wonderfully named “similarity holes”. These “similarity holes” are search traps where a single search term or collection of terms leads to a small set of results and are essentially “dead ends”.

How did semantic search help?

When the students were given semantic search tools, the male students continued to search broadly and shallowly but the semantic associations helped them to conceptualize and organize what they were doing. This meant that they ended up with a far more coherent, relevant, and useful set of search results and references. In contrast, the female students using the semantic associations offered, found it far easier to broaden their searches and to come up with alternative search terms and approaches enabling them to avoid and break out of any “similarity holes” they fell into.

Gender effects dissipate

I was very heartened that improvements in technology can be gender-neutral – they can simply be improvements of benefit in different ways to everyone, they don’t have to deliberately try to account for gender difference. I was also very heartened to note that the researchers found that gender differences in search strategies dissipated once students were taught advanced information seeking and knowledge management strategies. Gender differences were only apparent in novice, inexperienced searchers. So, in information seeking work at least, any biological or socially created gender differences are weak and easily overcome with some well directed instruction and semantic techniques are a help rather than a hindrance.

Semantic Search – Call for Papers for Special Issue on Semantic Search for Aslib Journal

    1 comment 
Estimated reading time 4–6 minutes

This special issue aims to explore the possibilities and limitations of Semantic Search. We are particularly interested in papers that place carefully conducted studies into the wider framework of current Semantic Search research in the broader context of Linked Open Data.

Research into Semantic Search and its applications has been gaining momentum over the last few years, with an increasing number of studies on general principles, proof of concept and prototypical applications. The market for Semantic Search applications and its role within the general development of (internet) technologies and its impact on different areas of private and public life have attracted attention. Simultaneously, many publicly funded projects in the field of cultural heritage were initialised. Researchers in many disciplines have been making progress in the establishment of both theories and methods for Semantic Search. However, there still is a lack of comparison across individual studies as well as a need for standardisation regarding the dissociation of Semantic Search of other search solutions, agreed upon definitions as well as technologies and interfaces.

Semantic Search research is often based on large and rich data sets and a combination of techniques ranging from statistical bag of words approaches and natural-language-processing enriched via a subtle utilisation of metadata over classificatory approaches right up to ontological reasoning. Over the last 10 years a lot of initial technical and conceptual obstacles in the field of Semantic Search have been overcome. After the initial euphoria for Semantic Search that resulted in a technically driven supply of search solutions, appraisal of successful and less successful approaches is needed. Amongst other things the limitations of working with open world solutions on – only apparently comprehensive – linked open data sets compared to small domain specific solutions need to be determined.
One ongoing challenge for semantic search solutions is their usability and user acceptance, as only highly usable walk-up-and-use-approaches stand a chance in the field of general search.

For this special issue, we invite articles which address the opportunities and challenges of Semantic Search from theoretical and practical, conceptual and empirical perspectives.

Topics of interest include but are not restricted to:

  • The history of semantic search – how the latest techniques and technologies have come out of developments over the last 5, 10, 20, 100, 2000… years
  • Technical approaches to semantic search : linguistic/NLP, probabilistic, artificial intelligence, conceptual/ontological …
  • Current trends in Semantic Search
  • Best practice – how far along the road from ‘early adopters’ to ‘mainstream users’ has semantic search gone so far?
  • Semantic Search and cultural heritage
  • Usability and user experience of Semantic Search
  • Visualisation and Semantic Search
  • Quality criteria for Semantic Search
  • Impact of norms and standardisation for instance (like ISO 25964 “Thesauri for information retrieval“) and the potential of Semantic Search?
  • How are semantic technologies fostering a need for cross-industry collaboration and standardisation?
  • How are Semantic Search techniques and technologies being used in practice?
  • Practical problems in brokering consensus and agreement – defining concepts, terms and classes, etc.
  • Curation and management of ontologies
  • Differences between web-scale, enterprise scale, and collection-specific scale techniques
  • Evaluation of Semantic Search solutions
  • Comparison of data collection approaches
  • User behaviour and the evolution of norms and conventions
  • Information behaviour and information literacy
  • User surveys
  • Usage scenarios and case studies

Submissions

Papers should clearly connect their studies to the wider body of Semantic Search scholarship, and spell out the implications of their findings for future research. In general, only research-based submissions including case studies and best practice will be considered. Viewpoints, literature reviews or general reviews are generally not acceptable.

Papers should be 4,000 to 6,000 words in length (including references). Citations and references should be in our journal style.

Please see the author guidelines at http://www.emeraldinsight.com/products/journals/author_guidelines.htm?id=ap for more details and submission instructions.
Submissions to Aslib Proceedings are made using ScholarOne Manuscripts, the online submission and peer review system. Registration and access is available at http://mc.manuscriptcentral.com/ap.

Important Dates

Paper submission: 15.12.2013
Notice of review results: 15.02.2013
Revisions due: 31.03.2014
Publication: Aslib Proceedings, issue 5, 2014.

About the Journal

Aslib Proceedings (ISSN: 0001-253X) is a peer-reviewed high-quality journal covering international research and practice in library and information science, and information management. The journal is the major publication for ASLIB – the Association for Information Management in the United Kingdom – a membership association for people who manage information and knowledge in organisations and the information industry.
Information about the journal can be found at
http://www.emeraldinsight.com/products/journals/journals.htm?id=ap

Contact the guest editors

Prof. Dr. Ulrike Spree
- Hamburg University of Applied Sciences -
Faculty Design, Medien and Information
Department Information
Finkenau 35
20081 Hamburg
Phone: +49/40/42875/3607
Email: ulrike.spree@haw-hamburg.de

Fran Alexander
Information Architect, BCA Research (2013- )
Taxonomy Manager, BBC Information and Archives (2009-13)
Email: fran@vocabcontrol.com
Twitter: @frangle

This time it’s personal data – Indiverses and Personal APIs

    5 comments 
Estimated reading time 3–4 minutes

Sooner or later I was bound to find some other Semanticists in Canada and on Thursday I attended a Semantic Web meetup in Montreal. The audience was small, but that led to more of a group discussion atmosphere than a formal talk. The presenter, Dr Joan Yess Kahn, has coined the term Indiverse – Individual Information Universe – to facilitate her thinking about the set of personal information and data that we accumulate through our lives.

She pointed out that some of this information is created by us, some about us, some with our knowledge and consent, some without, and our entire digital lives can be stolen and abused. She made some interesting observations about how our personal and public information spaces were essentially one and the same before the industrial revolution, when most people’s work and home lives were intertwined (e.g. artisans living in their workshops), and that changes such as the industrial revolution and public education split those apart as people left home to work somewhere else. However, in the information age more people are returning to working from home while others are increasingly using their computers at work to carry out personal tasks, such as online shopping.

This blurring of the public and private has many social and commercial implications. We discussed the potential monetary value of personal attention and intention data to advertisers, and implications for surveillance of individuals by governments and other organizations.

We also talked about information overload and information anxiety. Joan has written about ways of categorizing, indexing, and managing our personal information – our address books, calendars, to do lists, etc. – and this led us to consider ideas of how to construct sharable, standardized Personal Data Lockers (for example The Locker Project) and to take back control of our online identity and information management, for example in shifting from Customer Relations Management (CRM) to Vendor Relations Management (VRM).

In previous posts I have talked about our need to become our own personal digital archivists as well and I was sent a link by Mark to a Personal API developed by Naveen. This takes personal information curation to the data level, as Naveen is seeking an easy way to manage the huge amounts of data that he generates simply by being a person in the world – his fitness routines, diet, etc.

There is a clear convergence here with the work done by such medical innovators as Patients Know Best electronic patient health records. Moral and social implications of who is responsible for curating and protecting such data are huge and wide-ranging. At the moment doting parents using apps to monitor their babies or fitness enthusiasts using apps (such as map my run etc.) are doing this for fun, but will we start seeing this as a social duty? Will we have right-wing campaigns to deny treatment to people who have failed to look after their health data or mass class actions to sue hospitals that get hacked? If you think biometric passports are information dense, just wait until every heartbeat from ultrasound to grave is encoded somewhere in your Indiverse.

ISKO UK 2013 – provisional programme

    Start a conversation 
Estimated reading time 2–2 minutes

I will probably be on the other side of the Atlantic when the ISKO UK conference takes place in July in London, UK. I will be sorry to miss it, because the committee have brought together a diverse, topical, and fascinating collection of speakers.

ISKO UK excels in unifying academic and practitioner communities, and the conference promises to investigate the barriers that separate research from practice and to seek out boundary objects that can bring the communities together.

This is demonstrated in person by the keynote speakers Patrick Lambe of Straits Knowledge and Martin White of Intranet Focus Ltd – both respected for their commercial as well as academic contributions to the field of Knowledge Organization.

Amidst what is already shaping up to be a very full and varied programme, the presentations by Jeremy Tarling and Matt Shearer (BBC News) and Jarred McGinnis and Helen Lippell (Press Association) will show how research in semantic techniques is now being put to practical use in managing the fast-flowing oceans of information that news organizations handle.

The programme also includes a whole session on combining ontologies with other tools, as well as papers on facet analysis and construction of controlled vocabularies. There’s even some epistemology to please pure theoreticians.