Category Archives: semantic web

Tag you’re it – but is your tag the same as my tag?

    2 comments 
Estimated reading time 5–8 minutes

Lots of people talk about tags, and they all tend to assume they mean the same thing. However, there are lots of different types of tag from HTML tags for marking up web pages to labels in databases and this can lead to all sorts of confusion and problems in projects.

Here are some definitions of “tag” that I’ve heard and that are different in significant ways. If you think my definitions can be improved, please comment, and please let me know of any other usages of that tricksy little word “tag” that you’ve happened upon.

 1) A tag is a free text keyword you add as part of the metadata of something to help search

Free text tags are usually uncontrolled and unstructured (folksonomic) simple strings of characters. Free text tagging functionaliy is usually no more than a simple text field in a database, so it very easy to implement technically. For limited collections, collections with low research value, user-generated collections, and collections that are not otherwise catalogued, free text tags provide the ability to do at least some searching (e.g. if you have a small collection of still images that have no other metadata attached, any subject keyword tags are better than none).

Folksonomic tagging was hailed as revolutionary a few years ago because it is cheap. However, it fails to solve numerous information retrieval problems. Most significantly, if you use free text tags, you need to do additional work later on to disambiguate them (apple, apple, or apple – company, record label, fruit?) or add any structure to them, including grouping synonyms to provide a more complete search (a search for “automobile” can’t retrieve items tagged “car” unless you can associate these synonyms in a synset, synonym ring, or thesaurus).

 2) A tag is a keyword that is selected from a controlled vocabulary or authority list

Controlled keywords are more useful than free text tags because they reduce the problems of synonyms and disambiguation by making the person applying the tag choose from a limited set of terms. It is easier to build a thesaurus containing all the controlled keywords, as you are not trying to encompass every possible word in the language (or indeed any string of characters that somebody might make up). Controlled vocabularies also avoid apparently trivial but practically problematic issues such as spelling variants and errors and use of abbreviations. However, flat controlled vocabularies become very unwieldy once you have more than about 50 terms. There may be a numeric identifier associated with a controlled vocabulary keyword, but it is usually only some kind of local internal system identifier.

Tags taken from controlled lists are often used for process-driven functions, as opposed to search or browse functions. So, someone might apply a tag from a controlled list to designate a workflow status of an asset. For such processes, it is usually fairly straightforward to control the vocabulary options available, so that only a few labels are available. Linguistic nuances are not so important in such contexts – people are just taught what the options are and usually it doesn’t occur to them to try to use other terms. If the available terms are inadequate, this often means there is something wrong with the business process or the system design (e.g. we need a workflow state of “pending approval” but we only have the labels “created” and “approved”).

 3) A tag is a keyword that is selected from a taxonomy

Once a controlled vocabulary becomes too long to be easy to navigate, it can be “chunked up” or “broken down” into a taxonomy.
Keywords in taxonomies are more useful than keywords in flat controlled vocabularies because the taxonomy holds information about the relationships between terms. The simplest relationship is broader>narrower (parent>child). This means you can “chunk up” your flat vocabulary list into sections, e.g. to make it easier to navigate, to offer ways a researcher can modify their search (didn’t find what you wanted – try a broader search, too many results – try a narrower search). Usually internal IDs are used to connect the label displayed in the UI with the graph that contains the relationships between the concepts.
Often a taxonomy will also hold associative (“see also”) relationships, effectively extending the taxonomy to be a taxonomy-with-thesaurus.

 4) A tag is a type of Uniform Resource Identifier (URI)

This is the Linked Open Data approach. There are important differences between tag URIs and other types of tag. URI tags have to conform to various technical conventions and standards that support interoperability. In Linked Open Data contexts, URI tags are usually public and shared, rather than being private IDs. Relationships between URIs are usually expressed in an ontology, rather than a taxonomy (although the ontology may associate taxonomies or the ontology may be derived from pre-existing taxonomies).

 5) A tag is metadata added to a web page for search engines to index

It is possible to add any of the above types of tag to a web page (you can say a web page is just another type of asset). Differences between tags on assets and tags on web pages are usually to do with the ways those tags are stored and how they are used by other systems (i.e. a stock management system will need different information to a search engine). Search engine optimisation (SEO) bad practices led to a decline in the use of keyword tagging for search engine indexing, although the Semantic Web returns to the principle that content creators are the best people to index their content (see next section).

For web pages, the tags are often added in the header information, along with other instructions to the browser. On indiviudal assets (e.g. photos, videos) in content or asset management systems, the tags are often held in a particular field in a database. For Linked Open Data systems (whether managing web pages, traditional assets, or combinations of both), the tag URIs and their relationships (triples) are usually stored in a triple store, rather than conventional database.

With web pages, tagging can become very complex, as there might be a mixture of URI tags and basic labels, and a web page can be a complex information system in its own right, containing sub-elements such as audio and video content that itself might have various tags.

 6) A tag is a label used to mark up content within a web page that can be used for display purposes and for indexing

The language that is used to write web pages (HTML) is often described as comprising tags. So, you tag up flat text with instructions that tell the browser “this is a heading”, “this is a paragraph” etc. With the advent of HTML5 and vocabularies such as schema.org, more and more semantic information is being included in these tags. Search engines can use this information, for example to create more specific indexes.

So, when you ask someone if the content is tagged, and they say yes, it is always worth checking you both actually mean the same thing!

Libraries, Media, and the Semantic Web meetup at the BBC

    Start a conversation 
Estimated reading time 3–4 minutes

In a bit of a blog cleanup, I discovered this post languishing unpublished. The event took place earlier this year but the videos of the presentations are still well worth watching. It was an excellent session with short but highly informative talks by some of the smartest people currently working in the semantic web arena. The Videos of the event are available on You Tube.

Historypin

Jon Voss of Historypin was a true “information altruist”, describing libraries as a “radical idea”. The concept that people should be able to get information for free at the point of access, paid for by general taxation, has huge political implications. (Many of our libraries were funded by Victorian philanthropists who realised that an educated workforce was a more productive workforce, something that appears to have been largely forgotten today.) Historypin is seeking to build a new library, based on personal collections of content and metadata – a “memory-sharing” project. Jon eloquently explained how the Semantic Web reflects the principles of the first librarians in that it seeks ways to encourage people to open up and share knowledge as widely as possible.

MIMAS

Adrian Stevenson of MIMAS described various projects including Archives Hub, an excellent project helping archives, and in particular small archives that don’t have much funding, to share content and catalogues.

rNews

Evan Sandhaus of the New York Times explained the IPTC’s rNews – a news markup standard that should help search engines and search analytics tools to index news content more effectively.

schema.org

Dan Brickley’s “compare and contrast” of Universal Decimal Classification with schema.org was wonderful and he reminded technologists that it very easy to forget that librarians and classification theorists were attempting to solve search problems far in advance of the invention of computers. He showed an example of “search log analysis” from 1912, queries sent to the Belgian international bibliographic service – an early “semantic question answering service”. The “search terms” were fascinating and not so very different to the sort of things you’d expect people to be asking today. He also gave an excellent overview of Lonclass the BBC Archive’s largest classification scheme, which is based on UDC.

BBC Olympics online

Silver Oliver described how BBC Future Media is pioneering semantic technologies and using the Olympic Games to showcase this work on a huge and fast-paced scale. By using semantic techniques, dynamic rich websites can be built and kept up to the minute, even once results start to pour in.

World Service audio archives

Yves Raimond talked about a BBC Research & Development project to automatically index World Service audio archives. The World Service, having been a separate organisation to the core BBC, has not traditionally been part of the main BBC Archive, and most of its content has little or no useful metadata. Nevertheless, the content itself is highly valuable, so anything that can be done to preserve it and make it accessible is a benefit. The audio files were processed through speech-to-text software, and then automated indexing applied to generate suggested tags. The accuracy rate is about 70% so human help is needed to sort out the good tags from the bad (and occasionally offensive!) tags, but thsi is still a lot easier than tagging everything from scratch.

SLA Conference in Chicago

    Start a conversation 
Estimated reading time 3–5 minutes

Last month I had a wonderful time at the SLA (Special Libraries Association) conference in Chicago. I had never previously been to an SLA conference, even though there is a lively SLA Europe division. SLA is very keen to be seen as “not just for librarians” and the conference certainly spanned a vast range of information professions. The Taxonomy Division is thriving and there seem to be far more American than British taxonomists, which, although not surprising, was a pleasure as I don’t often find myself as one of a crowd! The conference has a plethora of receptions and social events, including the “legendary” IT division dance party.

There were well over 100 presentation sessions, as well as divisional meetings, panel discussions, and networking events that ranged from business breakfasts to tours of Chicago’s architectural sights. There was plenty of scope to avoid or embrace the wide range of issues and areas under discussion and I focused on taxonomies, Linked Data, image metadata, and then took a diversion into business research and propaganda.

I also thoroughly enjoyed the vendor demonstrations, especially the editorially curated and spam-free search engine Blekko, FastCase, and Law360 legal information vendors, and EOS library management systems.

My next posts will cover a few of the sessions I attended in more detail. Here’s the first:

Adding Value to Content through Linked Data

Joseph Busch of Taxonomy Strategies offered an overview of the world of Linked Data. The majority of Linked Data available in the “Linked Data Cloud” is US government data, with Life Sciences data in second place, which reflects the communities that are willing and able to make their data freely and publicly available. It is important to keep in mind the distinction between concept schemes – Dublin Core, FOAF, SKOS, which provide structures but no meanings – and semantic schemes – taxonomies, controlled vocabularies, ontologies, which provide meanings. Meanings are created through context and relationships, and many people assume that equivalence is simple and association is complex. However, establishing whether something is the “same” as something else is often far more difficult than simply asserting that two things are related to each other.

Many people also fail to use the full potential of their knowledge organization work. Vocabularies are tools that can be used to help solve problems by breaking down complex issues into key components, giving people ways of discussing ideas, and challenging perceptions.

The presentation by Joel Richard, web developer at the Smithsonian Libraries, focused on their botanic semantic project – digitizing and indexing Taxonomic Literature II. (I assume they have discussed taxonomies of taxonomy at some point!) This is a fifteen-volume guide to the literature of systemic botany published between 1753 and 1940. The International Association for Plant Taxonomy (IAPT) granted permission to the Smithsonian to release the work on the web under an open licence.

The books were scanned using OCR, which produced 99.97% accuracy, which sounds impressive but that actually means 5,000-12,000 errors – far too many for serious researchers. Errors in general text were less of a concern than errors in citations and other structured information, where – for example, mistaking an 8 for a 3 could be very misleading. After some cleanup work, the team next identified terms such as names and dates that could be parsed and tagged, and selected sets of pre-existing identifiers and vocabularies. They are continuing to look for ontologies that may be suitable for their data set. Other issues to think about are software and storage. They are using Drupal rather than a triplestore, but are concerned about scalability, so are trying to avoid creating billions of triples to manage.

Joel also outlined some of the benefits of using Linked Data, gave some examples of successful projects, and provided links to further resources.

Data Ghosts in the Facebook Machine by Fantasticlife

    1 comment 
< 1 minute

Understanding how data mining works is going to become increasingly important. There is a huge gap in popular and even professional knowledge about what organisations can now do “under the surface” with our data. For a very clear and straightforward explanation of how social graphs work and why we should be paying attention read Data Ghosts in the Facebook Machine.

Transforming and extending classification systems – UDCC Seminar

    Start a conversation 
Estimated reading time 2–3 minutes

This post is the last in a series about the UDC consortium international seminar in The Hague, 19-20 September, 2011

Joan S. Mitchell, OCLC (USA), and Marcia Lei Zeng, Kent State University (USA), supported by Maja Žumer, University of Ljubljana (Slovenia), talked about extending models for controlled vocabularies to classification systems: modelling DDC with FRSAD, which led to interesting discussions about their concepts of “nomen” and “thema”.

Along with my former colleague Andy Heather, now CTO at DODS Parliamentary Communications Ltd, I talked about our work on the data migration of classifications from a legacy database into new taxonomy management software, presenting our paper: Transformation of a legacy UDC-based classification system: exploiting and remodelling semantic relationships.

Conclusions

The key ideas I took away from the conference were:
1) Classifications and ontologies are not an either/or choice. They have different properties and different strengths and weaknesses and so should be chosen according to the task in hand.
2) It is difficult to turn a classification into an ontology, but easy to turn an ontology into a taxonomy, so if you don’t have either to start with and can’t decide, an ontology is a safer bet. If you already have a classification, you need to think carefully about whether it is worth turning it into a fully modelled ontology, as converting it to RDF or SKOS is likely to be much easier. However, at the moment, RDF and SKOS have limitations, especially in handling faceted taxonomies, so beware of losing semantic richness in the conversion process. Polyhierarchies offer a way of expressing facets in SKOS.
4) Vocabulary control and alignment continue to be significant issues for the Semantic Web.
5) Ontology curation, management, and semantic alignment will be increasingly important issues for the Semantic Web.

Slides and audio recordings of all 21 talks can be now downloaded from the conference website.

Conference proceedings are published by Ergon Verlag and can now be
purchased/ordered online from http://seminar.udcc.org/2011/php/proceedings.php.