The “semantic web” is an expression that has been used for long enough now that I for one feel I ought to know what it means, but it is hard to know where to start when so much about it is presented in “techspeak”. I am trying to understand it all in my own non-technical terms, so this post is aimed at “semantic wannabes” rather than “semantic aficionados”. It suggests some ways of starting to think about the semantic web and linked open data without worrying about the technicalities.

At a very basic level, the semantic web is something that information professionals have been doing for years. We know about using common formats so that information can be exchanged electronically, from SGML, HTML, and then XML. In the 90s, publishers used “field codes” to identify subject areas so that articles could be held in databases and re-used in multiple publications. In the library world, metadata standards like MARC and Dublin Core were devised to make it easier to share cataloguing data. The semantic web essentially just extends these principles.

So, why all the hype?

There is money to be made and lost on semantic web projects, and investors always want to try to predict the future so they can back winning horses. The recent Pew Report (thanks to Brendan for the link) shows the huge variety of opinions about what the semantic web will become.

On the one extreme, the semantic evangelists are hoping that we can create a highly sophisticated system that can make sense of our content by itself, with the familiar arguments that this will free humans from mundane tasks so that we can do more interesting things, be better informed and connected, and build a better and more intelligent world. They describe systems that “know” that when you book a holiday you need to get from your house to the airport, that you must remember to reschedule an appointment you made for that week, and that you need to send off your passport tomorrow to renew it in time. This is helpful and can seem spookily clever, but is no more mysterious than making sure my holiday booking system is connected to my diary. There are all sorts of commercial applications of such “convenience data management” and lots of ethical implications about privacy and data security too, but we have had these debates many times in the past.

A more business-focused example might be that a search engine will “realise” that when you search for “orange” you mean the mobile phone company, because it “knows” you are a market analyst working in telecoms. It will then work out that documents that contain the words “orange” and “fruit” are unlikely to be what you are after, and so won’t return them in search results. You will also be able to construct more complex questions, for example to query databases containing information on tantalum deposits and compare them with information about civil conflicts, to advise you on whether the price of mobile phone manufacture is likely to increase over the next five years.

Again, this sort of thing can sound almost magical, but is basically just compiling and comparing data from different data sets. This is familiar ground. The key difference is that for semantically tagged datasets much of the processing can be automated, so data crunching exercises that were simply too time-consuming to be worthwhile in the past become possible. The evangelists can make the semantic web project sound overwhelmingly revolutionary and utopian, especially when people start talking in sci-fi sounding phrases like “extended cognition” and “distributed intelligence”, but essentially this is the familiar territory of structuring content, adding metadata, and connecting databases. We have made the cost-benefit arguments for good quality metadata and efficient metadata management many times.

On the other extreme, the semantic web detractors claim that there is no point bothering with standardised metadata, because it is too difficult politically and practically to get people to co-operate and use common standards. In terms familiar to information professionals, you can’t get enough people to add enough good quality metadata to make the system work. Clay Shirky in “Ontology is overrated” argued that there is no point in trying to get commonalty up front, it is just too expensive (there are no “tag police” to tidy up), you just have to let people tag randomly and then try to work out what they meant afterwards. This is a great way of harvesting cheap metadata, but doesn’t help if you need to be sure that you are getting a sensible answer to a question. It only takes one person to have mistagged something, and your dataset is polluted and your complex query will generate false results. Shirky himself declares that he is talking about the web as a whole, which is fun to think about, but how many of us (apart from Google) are actually engaged in trying to sort out the entire web? Most of us just want to sort out our own little corner.

I expect the semantic web to follow all other standardisation projects. There will always be a huge “non-semantic” web that will contain vast quantities of potentially useful information that can’t be accessed by semantic web systems, but that is no different from the situation today where there are huge amounts of content that can’t be found by search engines (the “invisible web” or “dark web”) – from proprietary databases to personal collections in unusual formats. No system has been able to include everything. No archive contains every jotting scrawled on a serviette, no bookshop stocks every photocopied fanzine, no telephone directory lists every phone number in existence. However, they contain enough to be useful for most people most of the time. No standard provides a perfect universal lingua franca, but common languages increase the number of people you can talk to easily. The adoption of XML is not universal, but for everyone who has “opted in” there are commercial benefits. Not everybody uses pdf files, but for many people they have saved hours of time previously spent converting and re-styling documents.

So, should I join in?

What you really need to ask is not “What is the future of the semantic web?” but “Is it worth my while joining in right now?”. How to answer that question depends on your particular context and circumstances. It is much easier to try to think about a project, product, or set of services that is relevant to you than to worry about what everyone else is doing. If you can build a product quickly and cheaply using what is available now, it doesn’t really matter whether the semantic web succeeds in its current form or gets superseded by something else later.

I have made a start by asking myself very basic questions like:

  • What sort of content/data do we have?
  • How much is there?
  • What format is it in at the moment?
  • What proportion of that would we like to share (is it all public domain, do we have some that is commercially sensitive, but some that isn’t, are there data protection or rights restrictions)?

If you have a lot of data in well-structured and open formats (e.g. XML), there is a good chance it will be fairly straightforward to link your own data sets to each other, and link your data to external data. If there are commercial and legal reasons why the data can’t be made public, it may still be worth using semantic web principles, but you might be limited to working with a small data set of your own that you can keep within a “walled garden” – whether or not this is a good idea is another story for another post.

A more creative approach is to ask questions like:

  • What content/data services are we seeking to provide?
  • Who are our key customers/consumers/clients and what could we offer them that we don’t offer now?
  • What new products or services would they like to see?
  • What other sources of information do they access (users usually have good suggestions for connections that wouldn’t occur to us)?

Some more concrete questions would be ones like:

  • What information could be presented on a map?
  • How can marketing data be connected to web usage statistics?
  • Where could we usefully add legacy content to new webpages?

It is also worth investigating what others are already providing:

  • What content/data out there is accessible? (e.g. recently released UK government data)
  • Could any of it work with our content/data?
  • Whose data would it be really interesting to have access to?
  • Who are we already working with who might be willing to share data (even if we aren’t sure yet what sort of joint products/projects we could devise)?

It’s not as scary as it seems

Don’t be put off by talk about RDF, OWL, and SPARQL, how to construct an ontology, and whether or not you need a triple store. The first questions to ask are familiar ones like who you would like to work with, what could you create if you could get your hands on their content, and what new creations might arise if you let them share yours? Once you can see the semantic web in terms of specific projects that make sense for your organisation, you can call on the technical teams to work out the details. What I have found is that the technical teams are desperate to get their hands on high quality structured content – our content – and are more than happy to sort out the practicalities. As content creators and custodians, we are the ones that understand our content and how it works, so we are the ones who ought to be seizing the initiative and starting to be imaginative about what we can create if we link our data.

A bit of further reading:
Linked Data.org
Linked Data is Blooming: Why You Should Care
What can Data.gov.uk do for me?