The call for papers for the special issue of Aslib Proceedings on semantic search is now up on the publisher’s website. The deadline for submissions is December 15th.
The call for papers for the special issue of Aslib Proceedings on semantic search is now up on the publisher’s website. The deadline for submissions is December 15th.
While researching something else, I happened upon this rather cheering paper: The effect of semantic technologies on the exploration of the web of knowledge by female and male users. Gender issues only tangentially affect my core research, as I generally focus on linguistic communities that are defined by organizational or professional context, so gender effects are rather diluted by that point. I also prefer collapsing false dichotomies rather than emphasizing difference and division, and so I was very heartened that this article shows how semantic techniques can be unifying.
The study is based on observing the search strategies of a group of male and a group of female students in Taiwan. Given non-semantic search systems to use, the male students tended to search very broadly and shallowly, skimming large numbers of results and following links and going off on tangents to find other results. This enabled them to cover a lot of ground, often finding something useful, but also often left them with a rather chaotic collection of results and references. The female students tended to search very deeply and narrowly, often stopping to read in depth a paper that they had found, and trying to fully grasp the nature of the results that had been returned. This meant they tended to collect fewer results overall, the results tended to be clustered around a single concept, and they risked falling into the rather wonderfully named “similarity holes”. These “similarity holes” are search traps where a single search term or collection of terms leads to a small set of results and are essentially “dead ends”.
When the students were given semantic search tools, the male students continued to search broadly and shallowly but the semantic associations helped them to conceptualize and organize what they were doing. This meant that they ended up with a far more coherent, relevant, and useful set of search results and references. In contrast, the female students using the semantic associations offered, found it far easier to broaden their searches and to come up with alternative search terms and approaches enabling them to avoid and break out of any “similarity holes” they fell into.
I was very heartened that improvements in technology can be gender-neutral – they can simply be improvements of benefit in different ways to everyone, they don’t have to deliberately try to account for gender difference. I was also very heartened to note that the researchers found that gender differences in search strategies dissipated once students were taught advanced information seeking and knowledge management strategies. Gender differences were only apparent in novice, inexperienced searchers. So, in information seeking work at least, any biological or socially created gender differences are weak and easily overcome with some well directed instruction and semantic techniques are a help rather than a hindrance.
This special issue aims to explore the possibilities and limitations of Semantic Search. We are particularly interested in papers that place carefully conducted studies into the wider framework of current Semantic Search research in the broader context of Linked Open Data.
Research into Semantic Search and its applications has been gaining momentum over the last few years, with an increasing number of studies on general principles, proof of concept and prototypical applications. The market for Semantic Search applications and its role within the general development of (internet) technologies and its impact on different areas of private and public life have attracted attention. Simultaneously, many publicly funded projects in the field of cultural heritage were initialised. Researchers in many disciplines have been making progress in the establishment of both theories and methods for Semantic Search. However, there still is a lack of comparison across individual studies as well as a need for standardisation regarding the dissociation of Semantic Search of other search solutions, agreed upon definitions as well as technologies and interfaces.
Semantic Search research is often based on large and rich data sets and a combination of techniques ranging from statistical bag of words approaches and natural-language-processing enriched via a subtle utilisation of metadata over classificatory approaches right up to ontological reasoning. Over the last 10 years a lot of initial technical and conceptual obstacles in the field of Semantic Search have been overcome. After the initial euphoria for Semantic Search that resulted in a technically driven supply of search solutions, appraisal of successful and less successful approaches is needed. Amongst other things the limitations of working with open world solutions on – only apparently comprehensive – linked open data sets compared to small domain specific solutions need to be determined.
One ongoing challenge for semantic search solutions is their usability and user acceptance, as only highly usable walk-up-and-use-approaches stand a chance in the field of general search.
For this special issue, we invite articles which address the opportunities and challenges of Semantic Search from theoretical and practical, conceptual and empirical perspectives.
Topics of interest include but are not restricted to:
Papers should clearly connect their studies to the wider body of Semantic Search scholarship, and spell out the implications of their findings for future research. In general, only research-based submissions including case studies and best practice will be considered. Viewpoints, literature reviews or general reviews are generally not acceptable.
Papers should be 4,000 to 6,000 words in length (including references). Citations and references should be in our journal style.
Please see the author guidelines at http://www.emeraldinsight.com/products/journals/author_guidelines.htm?id=ap for more details and submission instructions.
Submissions to Aslib Proceedings are made using ScholarOne Manuscripts, the online submission and peer review system. Registration and access is available at http://mc.manuscriptcentral.com/ap.
Paper submission: 15.12.2013
Notice of review results: 15.02.2013
Revisions due: 31.03.2014
Publication: Aslib Proceedings, issue 5, 2014.
Aslib Proceedings (ISSN: 0001-253X) is a peer-reviewed high-quality journal covering international research and practice in library and information science, and information management. The journal is the major publication for ASLIB – the Association for Information Management in the United Kingdom – a membership association for people who manage information and knowledge in organisations and the information industry.
Information about the journal can be found at
Prof. Dr. Ulrike Spree
- Hamburg University of Applied Sciences -
Faculty Design, Medien and Information
Information Architect, BCA Research (2013- )
Taxonomy Manager, BBC Information and Archives (2009-13)
Lots of people talk about tags, and they all tend to assume they mean the same thing. However, there are lots of different types of tag from HTML tags for marking up web pages to labels in databases and this can lead to all sorts of confusion and problems in projects.
Here are some definitions of “tag” that I’ve heard and that are different in significant ways. If you think my definitions can be improved, please comment, and please let me know of any other usages of that tricksy little word “tag” that you’ve happened upon.
Free text tags are usually uncontrolled and unstructured (folksonomic) simple strings of characters. Free text tagging functionaliy is usually no more than a simple text field in a database, so it very easy to implement technically. For limited collections, collections with low research value, user-generated collections, and collections that are not otherwise catalogued, free text tags provide the ability to do at least some searching (e.g. if you have a small collection of still images that have no other metadata attached, any subject keyword tags are better than none).
Folksonomic tagging was hailed as revolutionary a few years ago because it is cheap. However, it fails to solve numerous information retrieval problems. Most significantly, if you use free text tags, you need to do additional work later on to disambiguate them (apple, apple, or apple – company, record label, fruit?) or add any structure to them, including grouping synonyms to provide a more complete search (a search for “automobile” can’t retrieve items tagged “car” unless you can associate these synonyms in a synset, synonym ring, or thesaurus).
Controlled keywords are more useful than free text tags because they reduce the problems of synonyms and disambiguation by making the person applying the tag choose from a limited set of terms. It is easier to build a thesaurus containing all the controlled keywords, as you are not trying to encompass every possible word in the language (or indeed any string of characters that somebody might make up). Controlled vocabularies also avoid apparently trivial but practically problematic issues such as spelling variants and errors and use of abbreviations. However, flat controlled vocabularies become very unwieldy once you have more than about 50 terms. There may be a numeric identifier associated with a controlled vocabulary keyword, but it is usually only some kind of local internal system identifier.
Tags taken from controlled lists are often used for process-driven functions, as opposed to search or browse functions. So, someone might apply a tag from a controlled list to designate a workflow status of an asset. For such processes, it is usually fairly straightforward to control the vocabulary options available, so that only a few labels are available. Linguistic nuances are not so important in such contexts – people are just taught what the options are and usually it doesn’t occur to them to try to use other terms. If the available terms are inadequate, this often means there is something wrong with the business process or the system design (e.g. we need a workflow state of “pending approval” but we only have the labels “created” and “approved”).
Once a controlled vocabulary becomes too long to be easy to navigate, it can be “chunked up” or “broken down” into a taxonomy.
Keywords in taxonomies are more useful than keywords in flat controlled vocabularies because the taxonomy holds information about the relationships between terms. The simplest relationship is broader>narrower (parent>child). This means you can “chunk up” your flat vocabulary list into sections, e.g. to make it easier to navigate, to offer ways a researcher can modify their search (didn’t find what you wanted – try a broader search, too many results – try a narrower search). Usually internal IDs are used to connect the label displayed in the UI with the graph that contains the relationships between the concepts.
Often a taxonomy will also hold associative (“see also”) relationships, effectively extending the taxonomy to be a taxonomy-with-thesaurus.
This is the Linked Open Data approach. There are important differences between tag URIs and other types of tag. URI tags have to conform to various technical conventions and standards that support interoperability. In Linked Open Data contexts, URI tags are usually public and shared, rather than being private IDs. Relationships between URIs are usually expressed in an ontology, rather than a taxonomy (although the ontology may associate taxonomies or the ontology may be derived from pre-existing taxonomies).
It is possible to add any of the above types of tag to a web page (you can say a web page is just another type of asset). Differences between tags on assets and tags on web pages are usually to do with the ways those tags are stored and how they are used by other systems (i.e. a stock management system will need different information to a search engine). Search engine optimisation (SEO) bad practices led to a decline in the use of keyword tagging for search engine indexing, although the Semantic Web returns to the principle that content creators are the best people to index their content (see next section).
For web pages, the tags are often added in the header information, along with other instructions to the browser. On indiviudal assets (e.g. photos, videos) in content or asset management systems, the tags are often held in a particular field in a database. For Linked Open Data systems (whether managing web pages, traditional assets, or combinations of both), the tag URIs and their relationships (triples) are usually stored in a triple store, rather than conventional database.
With web pages, tagging can become very complex, as there might be a mixture of URI tags and basic labels, and a web page can be a complex information system in its own right, containing sub-elements such as audio and video content that itself might have various tags.
The language that is used to write web pages (HTML) is often described as comprising tags. So, you tag up flat text with instructions that tell the browser “this is a heading”, “this is a paragraph” etc. With the advent of HTML5 and vocabularies such as schema.org, more and more semantic information is being included in these tags. Search engines can use this information, for example to create more specific indexes.
So, when you ask someone if the content is tagged, and they say yes, it is always worth checking you both actually mean the same thing!
In a bit of a blog cleanup, I discovered this post languishing unpublished. The event took place earlier this year but the videos of the presentations are still well worth watching. It was an excellent session with short but highly informative talks by some of the smartest people currently working in the semantic web arena. The Videos of the event are available on You Tube.
Jon Voss of Historypin was a true “information altruist”, describing libraries as a “radical idea”. The concept that people should be able to get information for free at the point of access, paid for by general taxation, has huge political implications. (Many of our libraries were funded by Victorian philanthropists who realised that an educated workforce was a more productive workforce, something that appears to have been largely forgotten today.) Historypin is seeking to build a new library, based on personal collections of content and metadata – a “memory-sharing” project. Jon eloquently explained how the Semantic Web reflects the principles of the first librarians in that it seeks ways to encourage people to open up and share knowledge as widely as possible.
Adrian Stevenson of MIMAS described various projects including Archives Hub, an excellent project helping archives, and in particular small archives that don’t have much funding, to share content and catalogues.
Evan Sandhaus of the New York Times explained the IPTC’s rNews – a news markup standard that should help search engines and search analytics tools to index news content more effectively.
Dan Brickley’s “compare and contrast” of Universal Decimal Classification with schema.org was wonderful and he reminded technologists that it very easy to forget that librarians and classification theorists were attempting to solve search problems far in advance of the invention of computers. He showed an example of “search log analysis” from 1912, queries sent to the Belgian international bibliographic service – an early “semantic question answering service”. The “search terms” were fascinating and not so very different to the sort of things you’d expect people to be asking today. He also gave an excellent overview of Lonclass the BBC Archive’s largest classification scheme, which is based on UDC.
Silver Oliver described how BBC Future Media is pioneering semantic technologies and using the Olympic Games to showcase this work on a huge and fast-paced scale. By using semantic techniques, dynamic rich websites can be built and kept up to the minute, even once results start to pour in.
Yves Raimond talked about a BBC Research & Development project to automatically index World Service audio archives. The World Service, having been a separate organisation to the core BBC, has not traditionally been part of the main BBC Archive, and most of its content has little or no useful metadata. Nevertheless, the content itself is highly valuable, so anything that can be done to preserve it and make it accessible is a benefit. The audio files were processed through speech-to-text software, and then automated indexing applied to generate suggested tags. The accuracy rate is about 70% so human help is needed to sort out the good tags from the bad (and occasionally offensive!) tags, but thsi is still a lot easier than tagging everything from scratch.
My paper about my work on the linking and migration of legacy classification schemes, taxonomies, and controlled vocabularies has been published in the Journal for Business Information Review.
A happy week for ontologists, taxonomists, and other knowledge organisers as Google reveals its knowledge graph.
Patrick Lambe sums it up wonderfully:
Google Finally Comes Out of the Closet on Taxonomies.
Here’s a great post by Seth Earley:
Google Knowledge Graph and Taxonomy – It’s in There.
I nearly wept when one of our young rising IT stars queried in a meeting why we had separated “search” and “browse” as headings for our discusssions on archive navigation functionality. So, to spare me further tears here are some distinctions and similarities. There won’t be anything new for information professionals, but I hope it will be useful if any of your colleagues in IT need a little help. I am sure this is far from comprehensive, so please leave additions and comments!
Search is making a beeline to a known target, browse is wandering around and exploring.
Search is for when you know what you are looking for, browse is for when you don’t.
Search is for when you know what you are looking for exists, browse is for when you don’t.
Search expects you to look for something that is findable, browse shows you the sort of thing you can find.
Search is for when you already know what is available in a collection or repository, browse is how you find out what is there, especially if you are a newcomer.
Search is difficult when you don’t know the right words to use, browse offers suggestions.
Search is a quickfire answer, browse is educative.
Search is about one-off actions, browse is about establishing familiar pathways that can be followed again or varied with predictable results.
Search relies on the seeker to do all the thinking, browse offers suggestions.
Search is a tricky way of finding content on related topics, browse is an easy way of finding related content.
Search is difficult when you are trying to distinguish between almost identical content, browse can highlight subtle distinctions.
Search rarely offers completeness, browse often offers completeness.
Search is pretty much a “black box” to most people, so it is hard to tell how well it has worked, browse systems are visible so it is easy to judge them.
Search uses complex processing that most people don’t want to see, browse uses links and connections that most people like to see.
Search is based on calcuations and assumptions that are under the surface, browse systems offer frameworks that are more open.
Search works well on the web, because the web is so big no-one has had time to build an easy way to browse it, browse works well on smaller structured collections.
Search can run across vast collections, browse needs to be offered at human-readable scales.
Search does not usually give an indication of the size or scope of a collection, browse can be designed to indicate scale.
Search and browse are both ways of finding content.
Search and browse can both be configured in a huge variety of ways.
Search and browse both have many different mechanisms and implementations.
Search and browse should both be tailored to users’ needs.
Search and browse systems both require thought and editorial judgement in their creation so that they work effectively for any particular collection.
Search and browse systems can often both be created largely automatically.
Search and browse often both involve metadata.
Search and browse behaviours may be intertwined, with users switching from one to the other.
Search and browse may be used by the same users for different tasks at different times.
Search and browse both offer serendipity, although serendipitous opportunities are often hidden by interface design.
Almost always, you should offer both. Unless you are very sure that your users will always be performing the same kind of task and have the same level of familiarity with your content. With small static collections of content, it may not matter too much, but for most content collections, users will probably want both, but which you make your main focus depends on the context and collection.
Shops might have lots of images and very little text, so a beautifully designed navigation system will help customers find – and buy – products they might not know about, while only a simple search system might be needed to cover searches for product names. A library will need to support lots of searches for titles and across catalogue text with a good search system, but will also need to help educate and inform users with a clear user-friendly browsable navigation system. A large incoherent collection of unstructured text with no particular purpose is likely to be difficult to navigate no matter what you design, so will need good search, but – apart from the web itself – such unbounded and unmanaged collections tend to be quite unusual.
Many people find it very difficult to understand why search within an organization can’t “just be like Google”. This is often because they haven’t thought about the differences between an organization and the Internet.
Search engines like Google work because they have access to big data. Google gets billions of searches to process, from billions of users. Even if your organization is a large one, it won’t have that many users either searching or contributing content, so it cannot number crunch on the same scale as Google. Your IT department is probably a lot smaller than Google’s and your enterprise search team’s daily budget is unlikely to cover more than the tiniest fraction of what Google spends. Last, but by no means least, your organization doesn’t have as much content as the Internet, so it probably needs to be far more careful about not losing any that is valuable.
There are important differences between how and why people search when they are at work and when they are not, and between how and why they search the Internet and their organization’s Intranet or archives. People rarely surf their organization’s Intranet for fun, to be entertained, or to while away the time. The differences in serious research behaviour and leisure searching are well documented, so I am going to write about another aspect of differences between the Internet and organizations that is often overlooked.
There are vast differences in the ways that people create and curate content on the Internet and within an organization. These differences have a significant effect on the way search functions. The key difference is in how much they link their content to that of others. Of course, there are people whose jobs are to create and curate online content – all the web editors, content strategists, copywriters, social media marketers, etc. – but they will be the first to explain that they have a very specialised set of skills focused on making their content searchable, commercial, or otherwise user friendly. They do a whole lot of things that most people as part of the day job neither know how nor have the time to do.
One of the key things that web professionals and unpaid web enthusiasts do with their content is to add and manage links. Links are what organize the web. Links are what group sites into clusters by content. Links are the web’s classification scheme. Clay Shirky back in 2005 said “there is no shelf” but it makes just as much sense to think of millions of shelves – infinite shelves going off in all directions, with new ones being created and old ones being discarded. The web is not linear – like a shelf – but it is not without structure. Google effectively picks one of the near infinity of shelves and offers it up as a linear list whenever you do a search. It chooses the shelf that seems to be the most popular, or that fits its commercial model. First on the shelf is often a paid-for advertisement or a Wikipedia entry, followed by other big well-established commercial sites. Out there on the Internet, people do an awful lot of shopping, and not much work, so that’s fine. (If they are doing more shopping than work when they are at work, your organization probably has bigger problems than search to deal with.).
For many other searches, especially more thematic research, people would be disappointed with the results, were it not for the magic of the way the web works – the links. As long as Google slings a site at you that has lots of links to other sites, it doesn’t have to take you straight to what you want, it lets you and the links do the rest of the work. Links gather together similar content, so they function like a classification scheme. The links associate content that is aimed at similar audiences, is on similar topics, is of a similar age. The links represent a huge amount of sorting, cataloguing, and classification work. Google did not have to pay for this work (genius business model). People do this work for Google for free. They do this work as part of creating and curating their content.
Many of Google’s volunteer librarians do this work for fun. They create fan sites, they write Wikipedia articles, they produce lists and generate indexes to their favourite content. They provide cataloguing descriptions and context. They do all this work partly because they enjoy it and partly because they hope to get “repaid” by their site becoming popular. They hope this will either lead to monetary reward (their band will get signed, they’ll get a better job, they’ll sell advertising) or social reward (they’ll make online “friends”, get positive feedback from comments, etc.).
From the commercial angle, people do this work because they expect to gain financial reward. They want to sell more products and make money. This is why there are howls of pain whenever Google tweaks its algorithms. Companies that balk at investing in internal search systems will spend fortunes chasing SEO.
If you want your organization’s search to be “just like Google” you need to think about how linked your content is. Do people who create content in your organization do so for the same reasons and with the same motivations as people create and link content on the web? It is very unlikely that you have lots of “fans” who will spend their free time creating lists of your companies’ best information resources, or collecting and rating and reviewing reports and documents. Most employees are too busy getting on with their day jobs to spend office hours pursuing their “fan” projects. Even if your staff have plenty of spare time, how many of them are big enough fans of some aspect of work to treat it like a hobby? If you want people to start looking out for similar documents on your Intranet and linking their own documents to them, you will probably have to find ways of motivating them to do this as a special initiative. It is not likely to come “for free”, like it does for the web search engines.
For some organizations, encouraging and incentivising “fan”-type behaviour may work. If the organization already has a strong collaborative culture, with people sharing ideas and using social media, it may be a small step to get them to think of their documents and presentations as blog posts. Including content creation and curation in people’s job roles and rewarding those who do well will foster a link-rich Intranet. By recognising and rewarding people who promote useful links and lists and get them to rank highly in your enterprise searches, you could bring an element of gamification to encourage this sort of behaviour. For other organisations, the culture may support this kind of web-style content creation, but people are generally too busy, have skill sets too far from what is required, or need training and encouragement. In such organizations it may make sense to have the equivalent of web editors, content strategists, user experience specialists, search engine optimizers, etc. working with the organization’s internal content to promote the most valuable resources. In other words, layer of “linkers” who work alongside the content originators.
For other organizations, where it would be inappropriate, too time consuming, or too far from established culture to encourage web-like information behaviour, enterprise search will never work “just like Google”. More formal and standardized metadata management processes are likely to be needed. Organizations that generate a lot of very specific content that is unlikely to be useful in broader contexts, confidential content, or large volumes of very similar structured content are likely to find it hard to move away from directed and standardised searching.
Many organizations will have a “mixed economy” with different types of content and different departments operating with different styles (e.g. what works in a marketing department is unlikely to work in the same way in a finance department).
Without links, each search result is isolated. This stops the searcher in their tracks and means they cannot surf in the way they do on the Internet. They will have to check search results one after another in a linear fashion. If your search engine is not getting the most relevant results to the top of that list, your staff will be spending a huge amount of time working their way through that list. They cannot plump for one likely looking result then follow the trail of links, as they do on the web. The links as a form of classification do not exist, so you need another mechanism (taxonomy, ontology, index, directory) to help people find groups of related content and browse through from one document to another.
So, even though you may have the technology and the budget to match Google’s, unless your content creators are linking freely, you will never completely succeed in turning your Intranet into a mini-Internet.
Catching up on my reading, I found this post by Jonah Bossewitch: Pick a Corpus, Any Corpus and was particularly struck by his clear articulation of the growing information gulf between organizations and individuals.
I have since been thinking about the contrast between our localised knowledge organization systems and the semantic super-trawlers of the information oceans that are only affordable – let alone accessible – to the megawealthy. It is hard not to see this as a huge disempowerment of ordinary people, swamping the democratizing promise of the web as a connector of individuals. The theme has also cropped up in KIDMM discussions about the fragmentation of the information professions. The problem goes far beyond the familiar digital divide, beyond just keeping our personal data safe, to how we can render such meta-industrial scale technologies open for ordinary people to use. Perhaps we need public data mines to replace public libraries? It seems particularly bad timing that our public institutions – our libraries and universities – are under political and financial attack just at the point when we need them to be at the technological (and expensive) cutting edge.
We rely on scientists and experts to advise us on how to use, store and transport potentially hazardous but generally useful chemicals, radioactive substances, even weapons, and information professionals need to step up to the challenges of handling our new potentially hazardous data and data analysis tools and systems. I am reassured that there are smart people like Jonah rising to the call, but we all need to engage with the issues.