To index is to translate

    Start a conversation 
Estimated reading time 3–4 minutes

Living in Montreal means I am trying to improve my very limited French and in trying to communicate with my Francophone neighbours I have become aware of a process of attempting to simplify my thoughts and express them using the limited vocabulary and grammar that I have available. I only have a few nouns, fewer verbs, and a couple of conjunctions that I can use so far and so trying to talk to people is not so much a process of thinking in English and translating that into French, as considering the basic core concepts that I need to convey and finding the simplest ways of expressing relationships. So I will say something like “The sun shone. It was big. People were happy” because I can’t properly translate “We all loved the great weather today”.

This made me realise how similar this is to the process of breaking down content into key concepts for indexing. My limited vocabulary is much like the controlled vocabulary of an indexing system, forcing me to analyse and decompose my ideas into simple components and basic relationships. This means I am doing quite well at fact-based communication, but my storytelling has suffered as I have only one very simple emotional register to work with. The best I can offer is a rather laconic style with some simple metaphors: “It was like a horror movie.”

It is regularly noted that ontology work in the sciences has forged ahead of that in the humanities, and the parallel with my ability to express facts but not tell stories struck me. When I tell my simplified stories I rely on shared understanding of a broad cultural context that provides the emotional aspect – I can use the simple expression “horror movie” because the concept has rich emotional associations, connotations, and resonances for people. The concept itself is rather vague, broad, and open to interpretation, so the shared understanding is rather thin. The opposite is true of scientific concepts, which are honed into precision and a very constrained definitive shared understanding. So, I wonder how much of sense that I can express facts well is actually an illusion, and it is just that those factual concepts have few emotional resonances.

A major aspect of poetry is about extending the meanings of words to their limits, to allow for the maximum emotional resonance and personal interpretation. Perhaps poetry speaks to individuals precisely because it doesn’t evoke a shared understanding but calls out new meanings and challenges the reader to think differently, to find new meanings? This is the opposite of indexing, which is about simplifying and constraining to the point at which all the fuzziness is driven away and you are left with nothing but “dead metaphors”. The only reason indexing the sciences seems easier is because so many scientific concepts have been analyzed and defined to this point already, doing much of the indexer’s work for them.

I am not sure if these musings have any practical applications. People sometimes ask me if I think my previous studies of languages and literature have helped in my current work. I have known many excellent monolingual indexers but am also aware that many people who are good at semantics speak more than one language. However, I am sure it is helpful to think of the process of indexing as a form of translation, albeit if the idea of removing all the poetry from language in order to create a usable, useful index is not at all romantic!

Can you use statistics to find meaning?

    Start a conversation 
Estimated reading time 2–2 minutes

I enjoyed this article in New Scientist about using statistical analysis on the Voynich manuscript to try to work out whether it is a meaningful but secret code or just gibberish.

Ultimately, I remain puzzled as to what the statistics actually tell us. They identify patterns, but meaning is more than simply patterns. However, the fact that certain sets of symbols in the Voynich text appear to cluster in sections with common illustrations suggests it is code. The counter-argument that you could deliberately fake such clustering by mechanical means is intriguing. Without far larger samples, and an understanding of random clusterings, I have no idea whether this sort of faking would produce the same patterns as natural language. I am sure clusters must appear all over the place, without bearing any meaning whatsoever.

I also thought it was interesting that one of the arguments in favour of gibberish was that there were no mistakes. It strikes me there could be many reasons for the lack of proofing and correction and I would want to know more about the rate of correction in similar works before I could assess that argument. I know that standardization of spelling came relatively late, presumably before then far more “mistakes” would have been tolerated.

Nevertheless, a fascinating mystery and one that perhaps cannot be resolved by analysis but by coincidental discovery of the key (if it exists!) – if it is gibberish, perhaps we will never know. Either way, I am sure it would have amused the author to know that their work would still be a controversial topic hundreds of years after it was written!

This time it’s personal data – Indiverses and Personal APIs

Estimated reading time 3–4 minutes

Sooner or later I was bound to find some other Semanticists in Canada and on Thursday I attended a Semantic Web meetup in Montreal. The audience was small, but that led to more of a group discussion atmosphere than a formal talk. The presenter, Dr Joan Yess Kahn, has coined the term Indiverse – Individual Information Universe – to facilitate her thinking about the set of personal information and data that we accumulate through our lives.

She pointed out that some of this information is created by us, some about us, some with our knowledge and consent, some without, and our entire digital lives can be stolen and abused. She made some interesting observations about how our personal and public information spaces were essentially one and the same before the industrial revolution, when most people’s work and home lives were intertwined (e.g. artisans living in their workshops), and that changes such as the industrial revolution and public education split those apart as people left home to work somewhere else. However, in the information age more people are returning to working from home while others are increasingly using their computers at work to carry out personal tasks, such as online shopping.

This blurring of the public and private has many social and commercial implications. We discussed the potential monetary value of personal attention and intention data to advertisers, and implications for surveillance of individuals by governments and other organizations.

We also talked about information overload and information anxiety. Joan has written about ways of categorizing, indexing, and managing our personal information – our address books, calendars, to do lists, etc. – and this led us to consider ideas of how to construct sharable, standardized Personal Data Lockers (for example The Locker Project) and to take back control of our online identity and information management, for example in shifting from Customer Relations Management (CRM) to Vendor Relations Management (VRM).

In previous posts I have talked about our need to become our own personal digital archivists as well and I was sent a link by Mark to a Personal API developed by Naveen. This takes personal information curation to the data level, as Naveen is seeking an easy way to manage the huge amounts of data that he generates simply by being a person in the world – his fitness routines, diet, etc.

There is a clear convergence here with the work done by such medical innovators as Patients Know Best electronic patient health records. Moral and social implications of who is responsible for curating and protecting such data are huge and wide-ranging. At the moment doting parents using apps to monitor their babies or fitness enthusiasts using apps (such as map my run etc.) are doing this for fun, but will we start seeing this as a social duty? Will we have right-wing campaigns to deny treatment to people who have failed to look after their health data or mass class actions to sue hospitals that get hacked? If you think biometric passports are information dense, just wait until every heartbeat from ultrasound to grave is encoded somewhere in your Indiverse.

Tagging the cart before the horse – Getting your project plan in order

    Start a conversation 
Estimated reading time 6–9 minutes

When people launch search improvement or information organziation projects, one of the commonest mistakes is to be over-eager to “just get the content indexed or tagged” without spending enough time and thought on the structure of an index, what should be tagged, and how the tags themselves should be structured.

This typically happens for two reasons:
1. The project managers – often encouraged by service providers who just want to get their hands on the cheque – simply underestimate the amount of preparatory work involved, whether it is structuring and testing a taxonomy, setting up and checking automated concept extaction rules, or developing a comprehensive domain model and tag set, so they fail to include enough – if any – of a development and testing stage in the plan. This often happens when the project is led by people who do not work closely with the content itself. Projects led by marekting or IT departments often fall into this trap.

2. The project managers include development and testing, with iterative correction and improvement phases, but are put under pressure to cut corners, or to compress deadlines.  This tends to happen when external forces affect timescales – for example local government projects that have to spend the budget before the end of the financial year. It can also happen when stakeholder power is unevenly distributed – for example, the advice of information professionals is sought but then over-ruled by more powerful stakeholders who have a fixed deadline in mind – for example a launching a new website in time for the Christmas market.

Forewarned is forearmed

Prevention is better than cure in both these scenarios, but easier said than done. Your best defence is to understand organizational culture, politics, and history and to evangelize the role and importance of information work and your department. Find out which departments have initiated information projects in the past, which have the biggest budgets, which have the most proactive leadership teams, then actively seek allies in those departments. Find out if there are meetings on information issues you could attend, offer to help, or even do something like conduct a survey on information use and needs and ask for volunteers to be interviewed.  Simply by talking to people at any level in those departments you will start to find out what is going on, and you will remind people in those departments of your existence and areas of expertise.

On a more formal level, you can look at organizational structures and hierarchies and make sure that you have effective chains of communication that follow chains of command. This may mean supporting your boss in promoting the work of your department to their boss. This is especially important in organizations with lots of layers of middle management, as middle managers can get so caught up in day to day work that longer term strategy can get put on the back burner, so offer support.

If you find out about projects early enough, you have a chance of influencing the project planning stages to make sure information and content issues are given the attention they need, right from the start.

Shutting the stable door…

Sometimes despite our best efforts we end up in a project that is already tripping over itself. A common scenario is for tagging work to be presented as a fait accompli. This is particularly likely with fully automated tagging work, as processing can be done far faster than any manual tagging effort. However, it is highly unusual for any project to be undertaken without its being intended to offer some sort or service or solve some recognized problem.

Firstly, assess how well it achieves its intended goals. If you have only been called into the project at the late stage, is this because it is going off the rails and the team want a salvage solution, or is it because it works well in one context and the team want to see if it can be used more widely? If it is the latter, that’s great – you can enjoy coming up with lots of positive and creative proposals. However, the core business planning principles are pretty much the same whether you are proposing to extend a successful project or corralling one that is running out of control.

Once you know what the project was meant to achieve, assess how much budget and time you have left, as that will determine the scope to make changes and improvements. Work out what sort of changes are feasible. Can you get an additional set of tags applied for example? Can you get sets of tags deleted? Are you only able to make manual adjustments or can you re-run automated processes? How labour intensive are the adjustment processes? Is chronology a factor – in other words can you keep the first run for legacy content but evolve the processes for future content?

These assessments are especially valuable for projects that are at an intermediate stage as there is much more scope to alter their direction. In these cases it is vital to prioritize and focus on what can be changed in a pragmatic way. For example, if the team are working chronologically through a set of documents, you may have time to undertake planning and assessment work focused on the most recent and have that ready before they get to a logical break point. So, you prioritize developing a schema relevant to the current year, and make a clean break on a logical date, such as January 1. If they have been working topic by topic, is there a new search facet you could introduce and get a really good set for that run as a fresh iteration?

If there are no clean breakpoints or clear sets of changes to be made, focus on anything that is likely to cause user problems or confusion or serious information management problems in future. What are likely to cause real pain points? What are the worst of those?

Once you have identified the worst issues and clarified the resources you have for making the changes, you have the basis for working up the time and money you need to carry them out. This can form the basis of your business case and project plan either to improve a faltering project and pull it back on track or to add scope to a project that is going well.

…after the horse has bolted

If there is limited scope to make changes, and the project is presented as already complete, it is still worth assessing how well it meets its goals as this will help you work out how you can best use and present the work that has been done. For example, can it be offered as an “optional extra” to existing search systems?

It is also worth assessing the costs and resource involved in order to make changes you would recommend even if it seems there is no immediate prospect of getting that work done. It is likely that sooner or later someone will want to re-visit the work, especially if it is not meeting its goals. Then it will be useful to know whether it can be fixed with a small injection of resource or whether it requires a major re-working, or even abandoning and starting afresh. Such a prospect may seem daunting, but if you can learn lessons and avoid repeating mistakes the next time around, then that can be seen as a positive. If one of the problems with the project was the lack of input from the information team early on, then it is worth making sure for the sake of the information department and the organization as a whole that the same mistake does not happen again. If you demonstrate well enough how you would have done things differently, you might even get to be in charge next time!

ISKO UK 2013 – provisional programme

    Start a conversation 
Estimated reading time 2–2 minutes

I will probably be on the other side of the Atlantic when the ISKO UK conference takes place in July in London, UK. I will be sorry to miss it, because the committee have brought together a diverse, topical, and fascinating collection of speakers.

ISKO UK excels in unifying academic and practitioner communities, and the conference promises to investigate the barriers that separate research from practice and to seek out boundary objects that can bring the communities together.

This is demonstrated in person by the keynote speakers Patrick Lambe of Straits Knowledge and Martin White of Intranet Focus Ltd – both respected for their commercial as well as academic contributions to the field of Knowledge Organization.

Amidst what is already shaping up to be a very full and varied programme, the presentations by Jeremy Tarling and Matt Shearer (BBC News) and Jarred McGinnis and Helen Lippell (Press Association) will show how research in semantic techniques is now being put to practical use in managing the fast-flowing oceans of information that news organizations handle.

The programme also includes a whole session on combining ontologies with other tools, as well as papers on facet analysis and construction of controlled vocabularies. There’s even some epistemology to please pure theoreticians.

Now you tell me! – Information at the right time

    Start a conversation 
Estimated reading time 7–12 minutes

My posts have been a bit thin on the ground this year because I have just relocated from the UK to Canada, to start a new job. This has been a lot of fun, but hard work.

The process has made me acutely aware of the problems of delivering information at the time when it is the most useful. This is often overlooked in information and knowledge management, but getting the timing wrong can render extremely valuable information almost worthless if it is forgotten, lost, or misunderstood.

Knowledge takes time

When I started my previous job, it was a new role, so I had to spend time researching, collecting, and collating relevant knowledge and information from existing employees and previous projects. There was no one person I could talk to in order to get an overview and there was no single story for me to join. I enjoyed the research process, but from an organizational learning pont of view, the less time and energy each new employee has to spend merely to find out what they need to do to be effective in their job, the better. I find corporate storytelling, as advocated by knowledge managers such as David Gurteen, very useful, especially when forging connections between “technical” and “business” teams and transferring tacit knowledge. However, telling helpful stories is easier said than done for many reasons, but in this post I will focus on the time factor.

When I came to hand over my responsibilities to my successor, I found I could construct “histories” of projects explaining how we had arrived where we are reasonably easily, because time in the past could be compressed into simple linear sequences – we spent a year debating this, then we recieved a directive to do that, then we reassessed where we were, etc. Such a narrative can answer many questions, explaining scope change, for example. However, when trying to explain what must be done next, it is important not simply to explain sequence, but also timing and conditionality – when judgement calls need to be made. The “history” has to become a “prophecy” and prophecies tend to be vague and enigmatic for a reason!

If you are handing over a project that has very clearly defined next steps, you can produce a set of instructions that say “when this happens, do this, which will result in something else, and then you will have to do the next thing”. You can even add expected timings. For example “when the technical team deliver the next software release (scheduled for April), you will have one week to check that the data has been transferred correctly, and if there are any errors, inform the technical team before signing off the release”. However, if the project is broader in scope, or less well defined, it is far harder. For example, “following the release of the new search system, you will need to manage user satisfaction with the results, which you could do by statistical analysis, customer satisfaction surveys, interviews, liaison with the UX teams, etc., but you will need to put forward a proposal for funding to do such work and what you should propose will be more successful if it is tailored in reponse to circumstances – available resource or budgets, departmental politics, etc. So, it may be better to conduct some small-scale interviews with a handful of users or it may be worth pushing for an expensive technical solution.” Even this simple example of a potential future judgement call has become extremely difficult to turn into a coherent story. Perhaps you could create a series of “time capsule” handover notes – if this happens, open the red envelope (or file or video), if the opposite, open the blue one. Of course, this is hugely time consuming to set up, so not very practical, and you are never going to be able to include all possible future universes.

Successful succession planning is about training someone to make good decisions by themselves so that they can take over from you without instructions covering every eventuality. Again, this is about timing. You have to train them over time, by involving them in decisions and teaching them how to make good decisions well ahead of when you leave. Try to do it on the afternoon before you head out of the office and both you and they will struggle!

Information overload is just information arriving at the wrong time

I have received huge amounts of information, advice, and instructions over the past few months about the move. Instructions arriving too soon can be reassuring as you can “practice” what you need to do, but this may be very inefficient in workplace settings, as it is unlikely people will remember clearly enough over time, so they will end up re-reading the same instructions – possibly even several times – and may even become over-confident that they know exactly what to do, fail to double-check at the appropriate moment, and then make unnecessary mistakes. So, it is worth thinking about the nature of the instructions and planning the delivery of the information as a project in its own right.

It is very tempting to offload everything onto somebody as soon as you think of it, but what you are actuallly doing is handing over the management of the delivery of that information to them. If you are doing this as a deliberate delegation to a junior member of staff, it may be a valid part of their job, but then it should be recognised that they are taking on those extra tasks as part of their workload. For example, if your employee is going to have to fill in a set of project completion forms, assemble paperwork, and write the report before a certain date, it is not necessarily helping them if you send them all the project completion forms before they have even started the project, as that burdens them with both the records management of those documents and the project managment aspect of remembering that the forms need to be submitted a week after the report itself is completed and submitted. If there is some way you can set up a business process that automatically triggers the sending of the project completion forms after submission of the project, the employee can think about other things. It may be helpful to tell them about the forms, and give them a list of the key documents that they will need (monthly budget statements, for example) to keep throughout the project, but you probably don’t need to make them read through every single question in detail at the start of the project, when they are absorbing huge amounts of other new – and significant – information.

Business process efficiency requires thinking carefully about the timing of information delivery. Considering potential “information overload” points from a human perspective is too often forgotten. It may seem simpler aned cheaper to lump all the instructions together and deliver them at once from a technical point of view, but it is worth weighing up whether the savings on the technical side are really worth more than the savings on the human cognitive efficiency side.

Interaction design is about timing

The same principles remain relevant on the “small scale” timings of interactions with a particular website or business system. Good interaction designers pay attention to the limits of human memory and concentration to make sure that users are delivered information in digestible chunks and in easy to follow steps and stages. Poor interaction design leaves the user clicking to and fro or opening multiple windows to try to remember something that was on a different screen.

If you have ever felt the need to make notes on a piece of paper while using a particular interface, tell the designer! Deciding which information the user needs to see early in an interaction and which can be left to the end can be controversial. For example, quoting a base price on a home page and only adding compulsory additions, such as taxes, supplements, and other charges at the last stage of booking could be described as a mis-use of the notion that users need to be delivered information in small manageable chunks. However, providing detailed directions about how to contact the delivery company in case of a query right at the end of a purchasing process is likely to be a good use of information timing, as the user has finished thinking about everything else by that stage.

An example of “perfect timing” I encountered recently was when I travelled into central London to go to a particular shop. I had already made the decision to go there and knew the address, but I did not know whether to turn left or right when leaving the tube station. To my relief, there on the wall in front of me at the top of the escalators was a large advertisement for the shop, telling me the exit to take, the direction to turn, and the number of metres to walk. I did not care about the metre-precise location when I decided to visit the shop, and I did not want to be burdened with that level of detail when I planned my route to the tube station, but getting the information at the exact time that I needed it was extremely helpful.

Same problem – different scales

Balancing the urge to deliver all available information at once against the information overload and cognitive burden this can cause applies whether dealing with long-term knowledge management (typically years-decades), medium-term business process management (months-years), and short-term interactions with an interface (minutes). There is no simple formula to getting this right, but these different information delivery situations have much in common. Thinking about your knowledge management as a form of interaction or your interaction design as a business process may help offer a new perspective on tricky problems of creating well-timed narratives and telling people not just what they need to know, but telling them what they need to know when they need to know it.

Tag you’re it – but is your tag the same as my tag?

Estimated reading time 5–8 minutes

Lots of people talk about tags, and they all tend to assume they mean the same thing. However, there are lots of different types of tag from HTML tags for marking up web pages to labels in databases and this can lead to all sorts of confusion and problems in projects.

Here are some definitions of “tag” that I’ve heard and that are different in significant ways. If you think my definitions can be improved, please comment, and please let me know of any other usages of that tricksy little word “tag” that you’ve happened upon.

 1) A tag is a free text keyword you add as part of the metadata of something to help search

Free text tags are usually uncontrolled and unstructured (folksonomic) simple strings of characters. Free text tagging functionaliy is usually no more than a simple text field in a database, so it very easy to implement technically. For limited collections, collections with low research value, user-generated collections, and collections that are not otherwise catalogued, free text tags provide the ability to do at least some searching (e.g. if you have a small collection of still images that have no other metadata attached, any subject keyword tags are better than none).

Folksonomic tagging was hailed as revolutionary a few years ago because it is cheap. However, it fails to solve numerous information retrieval problems. Most significantly, if you use free text tags, you need to do additional work later on to disambiguate them (apple, apple, or apple – company, record label, fruit?) or add any structure to them, including grouping synonyms to provide a more complete search (a search for “automobile” can’t retrieve items tagged “car” unless you can associate these synonyms in a synset, synonym ring, or thesaurus).

 2) A tag is a keyword that is selected from a controlled vocabulary or authority list

Controlled keywords are more useful than free text tags because they reduce the problems of synonyms and disambiguation by making the person applying the tag choose from a limited set of terms. It is easier to build a thesaurus containing all the controlled keywords, as you are not trying to encompass every possible word in the language (or indeed any string of characters that somebody might make up). Controlled vocabularies also avoid apparently trivial but practically problematic issues such as spelling variants and errors and use of abbreviations. However, flat controlled vocabularies become very unwieldy once you have more than about 50 terms. There may be a numeric identifier associated with a controlled vocabulary keyword, but it is usually only some kind of local internal system identifier.

Tags taken from controlled lists are often used for process-driven functions, as opposed to search or browse functions. So, someone might apply a tag from a controlled list to designate a workflow status of an asset. For such processes, it is usually fairly straightforward to control the vocabulary options available, so that only a few labels are available. Linguistic nuances are not so important in such contexts – people are just taught what the options are and usually it doesn’t occur to them to try to use other terms. If the available terms are inadequate, this often means there is something wrong with the business process or the system design (e.g. we need a workflow state of “pending approval” but we only have the labels “created” and “approved”).

 3) A tag is a keyword that is selected from a taxonomy

Once a controlled vocabulary becomes too long to be easy to navigate, it can be “chunked up” or “broken down” into a taxonomy.
Keywords in taxonomies are more useful than keywords in flat controlled vocabularies because the taxonomy holds information about the relationships between terms. The simplest relationship is broader>narrower (parent>child). This means you can “chunk up” your flat vocabulary list into sections, e.g. to make it easier to navigate, to offer ways a researcher can modify their search (didn’t find what you wanted – try a broader search, too many results – try a narrower search). Usually internal IDs are used to connect the label displayed in the UI with the graph that contains the relationships between the concepts.
Often a taxonomy will also hold associative (“see also”) relationships, effectively extending the taxonomy to be a taxonomy-with-thesaurus.

 4) A tag is a type of Uniform Resource Identifier (URI)

This is the Linked Open Data approach. There are important differences between tag URIs and other types of tag. URI tags have to conform to various technical conventions and standards that support interoperability. In Linked Open Data contexts, URI tags are usually public and shared, rather than being private IDs. Relationships between URIs are usually expressed in an ontology, rather than a taxonomy (although the ontology may associate taxonomies or the ontology may be derived from pre-existing taxonomies).

 5) A tag is metadata added to a web page for search engines to index

It is possible to add any of the above types of tag to a web page (you can say a web page is just another type of asset). Differences between tags on assets and tags on web pages are usually to do with the ways those tags are stored and how they are used by other systems (i.e. a stock management system will need different information to a search engine). Search engine optimisation (SEO) bad practices led to a decline in the use of keyword tagging for search engine indexing, although the Semantic Web returns to the principle that content creators are the best people to index their content (see next section).

For web pages, the tags are often added in the header information, along with other instructions to the browser. On indiviudal assets (e.g. photos, videos) in content or asset management systems, the tags are often held in a particular field in a database. For Linked Open Data systems (whether managing web pages, traditional assets, or combinations of both), the tag URIs and their relationships (triples) are usually stored in a triple store, rather than conventional database.

With web pages, tagging can become very complex, as there might be a mixture of URI tags and basic labels, and a web page can be a complex information system in its own right, containing sub-elements such as audio and video content that itself might have various tags.

 6) A tag is a label used to mark up content within a web page that can be used for display purposes and for indexing

The language that is used to write web pages (HTML) is often described as comprising tags. So, you tag up flat text with instructions that tell the browser “this is a heading”, “this is a paragraph” etc. With the advent of HTML5 and vocabularies such as, more and more semantic information is being included in these tags. Search engines can use this information, for example to create more specific indexes.

So, when you ask someone if the content is tagged, and they say yes, it is always worth checking you both actually mean the same thing!

Libraries, Media, and the Semantic Web meetup at the BBC

    Start a conversation 
Estimated reading time 3–4 minutes

In a bit of a blog cleanup, I discovered this post languishing unpublished. The event took place earlier this year but the videos of the presentations are still well worth watching. It was an excellent session with short but highly informative talks by some of the smartest people currently working in the semantic web arena. The Videos of the event are available on You Tube.


Jon Voss of Historypin was a true “information altruist”, describing libraries as a “radical idea”. The concept that people should be able to get information for free at the point of access, paid for by general taxation, has huge political implications. (Many of our libraries were funded by Victorian philanthropists who realised that an educated workforce was a more productive workforce, something that appears to have been largely forgotten today.) Historypin is seeking to build a new library, based on personal collections of content and metadata – a “memory-sharing” project. Jon eloquently explained how the Semantic Web reflects the principles of the first librarians in that it seeks ways to encourage people to open up and share knowledge as widely as possible.


Adrian Stevenson of MIMAS described various projects including Archives Hub, an excellent project helping archives, and in particular small archives that don’t have much funding, to share content and catalogues.


Evan Sandhaus of the New York Times explained the IPTC’s rNews – a news markup standard that should help search engines and search analytics tools to index news content more effectively.

Dan Brickley’s “compare and contrast” of Universal Decimal Classification with was wonderful and he reminded technologists that it very easy to forget that librarians and classification theorists were attempting to solve search problems far in advance of the invention of computers. He showed an example of “search log analysis” from 1912, queries sent to the Belgian international bibliographic service – an early “semantic question answering service”. The “search terms” were fascinating and not so very different to the sort of things you’d expect people to be asking today. He also gave an excellent overview of Lonclass the BBC Archive’s largest classification scheme, which is based on UDC.

BBC Olympics online

Silver Oliver described how BBC Future Media is pioneering semantic technologies and using the Olympic Games to showcase this work on a huge and fast-paced scale. By using semantic techniques, dynamic rich websites can be built and kept up to the minute, even once results start to pour in.

World Service audio archives

Yves Raimond talked about a BBC Research & Development project to automatically index World Service audio archives. The World Service, having been a separate organisation to the core BBC, has not traditionally been part of the main BBC Archive, and most of its content has little or no useful metadata. Nevertheless, the content itself is highly valuable, so anything that can be done to preserve it and make it accessible is a benefit. The audio files were processed through speech-to-text software, and then automated indexing applied to generate suggested tags. The accuracy rate is about 70% so human help is needed to sort out the good tags from the bad (and occasionally offensive!) tags, but thsi is still a lot easier than tagging everything from scratch.

Local is the new social – location data startups

    Start a conversation 
Estimated reading time 4–6 minutes

A few weeks ago I attended an event by Dreamstake featuring a collection of startup companies that are using open geographical data – such as the data released by Ordnance Survey. There was much championing of the possibilities of much money to be made by using data that organisations release for free. This seems obvious to me – someone else has paid to do all the preparatory work so others can cash in. No-one seems concerned about the ethics of this. If UK taxpayers have paid for the OS work to be done, should they not automatically be shareholders in any company that profits from the fruits of this investment?

The companies showcased all had new twists on using location data. What I found especially interesting was the emphasis on context. When selling services, place alone is not enough. Time is important and also the circumstances. So, a businesswoman on a work trip will want probably different products and services to when she is out with her family.

The speakers were
James Pursey of Sortedapp
Sadiq Qasim LoYakk
Craig Wareham of Viewranger
Tim Buick of Streetpin

Location-based marketing

James Pursey opened by giving a brief history of location-based marketing, pointing out that this was pioneered by the Yellow Pages (now His company attempts to match time, place, and location and makes the consumer the advertiser and the service provider the respondent. He explained this as a “reverse Ebay”. Instead of advertising your products and services, consumers post details of what they want, e.g. I need someone to clean my flat before my wife gets home (the data game still seems to be a man’s world!). The message is then pushed to local cleaners who have a window of time in which to respond. The app works on the location of your mobile phone, but you can alter that on a map so that you can be at home but arrange a service to be provided near your workplace, etc.

Chatting about a shared experience

Sadiq Qasim explained that LoYakk – local yakking – recognises that conversations are often focused around specific places and events. Social media links tend to be based on static lists of friends, with very little contextualisation. However, social relationships and conversations are often transient. You might want to chat to someone at a conference, but that doesn’t mean you want to become lifelong friends. By creating an app that mirrors the real world nature of such connections, people can drop in, chat to people in the vicinity and leave again. Events such as conferences, arts and sporting events, and holiday destinations are particularly well suited to this approach.

Mobile is local

Craig Wareham described Viewranger, which is an app for outdoorsy people. It combines guidebook information, a social community, a marketplace, based around location and has become popular with search and rescue teams.

Tim Buick of Streetpin emphasised that about half of searches on mobiles – perhaps unsurprisingly – are for something local. However, time is very relevant – he might be near a great pub that has a special offer on beer but he doesn’t want to be told about it at 8 in the morning when he has just dropped the kids off at nursery, but in the same location 12 hours later with his mates, the offer might be just what they want. The right information, to the right person, at the right place and at the right time is what matters.

The distinction between what is useful information and what is marketing becomes very blurred.

Place, space, maps

Thinking about this event along with the Shape of Knowledge event’s discussions of maps of cyberspace, and the Superhuman exhibition’s raising the question of the potential of transhumans to relate to space in a different way to current humans, made me wonder how location-based services will change in future. The technologically enhanced human will, presumably, need maps that make sense to computers as well as maps that make sense in real space and time. Navigation and location are most likely going to change beyond all recognition.