controlled vocabularies

Lots of people talk about tags, and they all tend to assume they mean the same thing. However, there are lots of different types of tag from HTML tags for marking up web pages to labels in databases and this can lead to all sorts of confusion and problems in projects.

Here are some definitions of “tag” that I’ve heard and that are different in significant ways. If you think my definitions can be improved, please comment, and please let me know of any other usages of that tricksy little word “tag” that you’ve happened upon.

1) A tag is a free text keyword you add as part of the metadata of something to help search

Free text tags are usually uncontrolled and unstructured (folksonomic) simple strings of characters. Free text tagging functionaliy is usually no more than a simple text field in a database, so it very easy to implement technically. For limited collections, collections with low research value, user-generated collections, and collections that are not otherwise catalogued, free text tags provide the ability to do at least some searching (e.g. if you have a small collection of still images that have no other metadata attached, any subject keyword tags are better than none).

Folksonomic tagging was hailed as revolutionary a few years ago because it is cheap. However, it fails to solve numerous information retrieval problems. Most significantly, if you use free text tags, you need to do additional work later on to disambiguate them (apple, apple, or apple – company, record label, fruit?) or add any structure to them, including grouping synonyms to provide a more complete search (a search for “automobile” can’t retrieve items tagged “car” unless you can associate these synonyms in a synset, synonym ring, or thesaurus).

2) A tag is a keyword that is selected from a controlled vocabulary or authority list

Controlled keywords are more useful than free text tags because they reduce the problems of synonyms and disambiguation by making the person applying the tag choose from a limited set of terms. It is easier to build a thesaurus containing all the controlled keywords, as you are not trying to encompass every possible word in the language (or indeed any string of characters that somebody might make up). Controlled vocabularies also avoid apparently trivial but practically problematic issues such as spelling variants and errors and use of abbreviations. However, flat controlled vocabularies become very unwieldy once you have more than about 50 terms. There may be a numeric identifier associated with a controlled vocabulary keyword, but it is usually only some kind of local internal system identifier.

Tags taken from controlled lists are often used for process-driven functions, as opposed to search or browse functions. So, someone might apply a tag from a controlled list to designate a workflow status of an asset. For such processes, it is usually fairly straightforward to control the vocabulary options available, so that only a few labels are available. Linguistic nuances are not so important in such contexts – people are just taught what the options are and usually it doesn’t occur to them to try to use other terms. If the available terms are inadequate, this often means there is something wrong with the business process or the system design (e.g. we need a workflow state of “pending approval” but we only have the labels “created” and “approved”).

3) A tag is a keyword that is selected from a taxonomy

Once a controlled vocabulary becomes too long to be easy to navigate, it can be “chunked up” or “broken down” into a taxonomy.
Keywords in taxonomies are more useful than keywords in flat controlled vocabularies because the taxonomy holds information about the relationships between terms. The simplest relationship is broader>narrower (parent>child). This means you can “chunk up” your flat vocabulary list into sections, e.g. to make it easier to navigate, to offer ways a researcher can modify their search (didn’t find what you wanted – try a broader search, too many results – try a narrower search). Usually internal IDs are used to connect the label displayed in the UI with the graph that contains the relationships between the concepts.
Often a taxonomy will also hold associative (“see also”) relationships, effectively extending the taxonomy to be a taxonomy-with-thesaurus.

4) A tag is a type of Uniform Resource Identifier (URI)

This is the Linked Open Data approach. There are important differences between tag URIs and other types of tag. URI tags have to conform to various technical conventions and standards that support interoperability. In Linked Open Data contexts, URI tags are usually public and shared, rather than being private IDs. Relationships between URIs are usually expressed in an ontology, rather than a taxonomy (although the ontology may associate taxonomies or the ontology may be derived from pre-existing taxonomies).

5) A tag is metadata added to a web page for search engines to index

It is possible to add any of the above types of tag to a web page (you can say a web page is just another type of asset). Differences between tags on assets and tags on web pages are usually to do with the ways those tags are stored and how they are used by other systems (i.e. a stock management system will need different information to a search engine). Search engine optimisation (SEO) bad practices led to a decline in the use of keyword tagging for search engine indexing, although the Semantic Web returns to the principle that content creators are the best people to index their content (see next section).

For web pages, the tags are often added in the header information, along with other instructions to the browser. On indiviudal assets (e.g. photos, videos) in content or asset management systems, the tags are often held in a particular field in a database. For Linked Open Data systems (whether managing web pages, traditional assets, or combinations of both), the tag URIs and their relationships (triples) are usually stored in a triple store, rather than conventional database.

With web pages, tagging can become very complex, as there might be a mixture of URI tags and basic labels, and a web page can be a complex information system in its own right, containing sub-elements such as audio and video content that itself might have various tags.

6) A tag is a label used to mark up content within a web page that can be used for display purposes and for indexing

The language that is used to write web pages (HTML) is often described as comprising tags. So, you tag up flat text with instructions that tell the browser “this is a heading”, “this is a paragraph” etc. With the advent of HTML5 and vocabularies such as schema.org, more and more semantic information is being included in these tags. Search engines can use this information, for example to create more specific indexes.

So, when you ask someone if the content is tagged, and they say yes, it is always worth checking you both actually mean the same thing!

Tag Archives: controlled vocabularies

Tag you’re it – but is your tag the same as my tag?

1) A tag is a free text keyword you add as part of the metadata of something to help search

2) A tag is a keyword that is selected from a controlled vocabulary or authority list

3) A tag is a keyword that is selected from a taxonomy

4) A tag is a type of Uniform Resource Identifier (URI)

5) A tag is metadata added to a web page for search engines to index

6) A tag is a label used to mark up content within a web page that can be used for display purposes and for indexing

Pages

Recently

Blogroll