Essentials of Metadata and Taxonomy Conference
Monday 10th March, 2008
CBI Conference Centre, Centre Point Tower, LONDON
Organisers: Henry Stewart Events
Chaired by Madi Weland Solomon, former corporate taxonomist at Walt Disney, taxonomy consultant, and a highly entertaining and charismatic speaker, the conference also featured presentations by 10 other leading information experts, including the “grand master” of taxonomy, Seth Earley. A wealth of information and experience was presented, with additional contributions from leading software vendors, balanced by a round table discussion chaired by the principal of CMS Watch, Theresa Regli. Four more taxonomy consultants joined two panel sessions, one featuring “truth or dare” questions on hot topics, prompting a lively discussion session to close the proceedings.
The conference offered practical advice for anyone new to taxonomy and metadata about what to consider when devising a strategy and how to assess software, with insights into recent innovations made by some of the most information-savvy companies operating today, such as Siemens, Dow Jones, Walt Disney, Getty Images, the BBC, Woolworths, Diageo, Lowe Worldwide, and Icelandair.
The Message is the Medium: The Semantic Revolution
Madi Weland Solomon: Independent Taxonomy Consultant, Corporate Nomenclature Taxonomist for the Walt Disney Company, Project Associate for Standards and Vocabularies at the Getty Research Institute, Curatorial Assistant at the Eli Broad Art Foundation.
Key points: Search alone is woefully inadequate but is strengthened by metadata and taxonomy. Users need systems that are welcoming and familiar. Information is no longer a scarcity, attention is.
For content to be meaningful, it needs not just a creator but a user or consumer. Google, Yahoo!, Microsoft, Sun and other IT corporations are buying up huge tracts of land for data processing centres as the amount of information we generate expands exponentially. Typing two words into a search engine is a woefully inadequate way of trying to find anything relevant but search can be strengthened if it is supported by a classification scheme.
Metadata describes the object and a taxonomy provides the context so that the description makes sense, placing it in an intelligent system. Social tagging is limited in scope, and often produces ambiguous and irrelevant results, but there are ways to harness the power of user-generated content, in appropriate circumstances. The most successful of these projects happen when the mass of social tagging is placed within a well-defined structure.
New knowledge is being gained through Discovery Informatics, in which working with information about something becomes more important that working with the object itself. In a semantically enabled world, information is connected, so links and similarities are revealed. Separate information silos can be linked and modulated by a user to reflect their individual point of view.
There is currently no easy way of describing audio and visual content – it is not “self-describing”, so metadata is the key to being able to organise it. Metadata needs to be easy for users to understand and to reflect their needs.
Introduction to Metadata, Taxonomies, Taxonomy Governance, and DRM
Seth Earley: President and Founder of Earley & Associates Inc., Co-author of Practical Knowledge Management, founder of Boston Knowledge Management Forum, former adjunct professor at North-eastern University, founder of Taxonomy Community of Practice, co-founder of Search Community of practice.
Key points: Taxonomy is an enabler of finding and sharing and the foundation of coherence. Taxonomies and metadata are part of a moving picture – by the time you do anything, things have changed.
The typical situation is disconnected systems, inconsistent metadata, and retrieval reliant on search, which makes it hard to find relevant information (for example, when one person has used an abbreviation in a document but another has used the full form of the word). Individuals often do not know how to turn a question into a meaningful search.
A common language is needed to link diverse information systems, such as digital asset management, customer relationship management, etc – all of which are focused on particular business processes. Information architecture is concerned with organising unambiguous data, such as prices and addresses. Semantic architecture is about exposing the nuances and ambiguities of meaning, helping users search more precisely.
Low cost solutions, like social tagging, are appropriate for low-cost content, like emails or discussions about ongoing projects, but key documents, like approved methods or best practices, need to be easy to find by everyone. Taxonomy helps an organisation decide what is important and defines users, roles, and processes. Taxonomy enables navigation, but is not the same as navigation, as there are lots of ways to navigate through a taxonomy. Governance is necessary to avoid chaos and to avoid initial decisions causing roadblocks and hampering future developments.
How Understanding Business Meaning Enables Business Change
John Jordan, Senior Consultant, Siemens
Key points: You cannot have business intelligence without understanding what the data held by the business is actually about. If you have a common business language, you can easily change parts of the system without affecting other parts, and you can easily communicate with the outside world.
In most organisations, many different words are used to describe the same things. By analysing the underlying data, you can draw out the shared meaning behind all the different terms. Conversely, some things that businesses think of as single items are often actually more complex.
Data analysis is not an end in itself, but when done as part of another project, it clarifies how data is related to business processes. By integrating business data, communication becomes possible, and change becomes straightforward. By understanding how all the data in an organisation is related conceptually, you are not dependant on any particular IT system. You know exactly what you need your IT system to do, and it becomes a way of supporting business needs instead of driving them or slowing them down.
It is better to have a common language that links your data, because this can operate as a “semantic bus” transferring data from one place to another. So, even if you have distinct IT applications, they can communicate with each other through the “semantic bus” (the middleware). Rather than trying to connect one application to another, you connect them all to the middleware. This means that instead of lots of specific connections that have to be recreated in every single other application if just one application is replaced, all you need to do is make sure the new application can speak the “common language” and hop on the “semantic bus”. This also means that external applications that speak the common language can access your information systems. You could let suppliers look at you stock control system, for example, or pull in web-based information from outside sources.
The Next Wave: Using Wikipedia as a controlled vocabulary
Chris Sizemore and Silver Oliver, BBC Future Media and Technology
Key points: The principles of indexing languages can lead to powerful information creation using the rich resources available for free on the Internet.
The BBC wanted to link external content to its topic pages, but needed an indexing language so that subjects could be represented in a consistent way and similar subject matter on external sites could be identified correctly. People have started to link to Wikipedia pages not to provide background information but to disambiguate terms on their websites, so a network of sites linked to Wikipedia has developed. This inspired the idea that Wikipedia page titles could perhaps act as a common interface. The Lucene search engine was used to index BBC content and Wikipedia pages and then the two sets of results were compared with a similarity algorithm. The matching Wikipedia page titles were then applied as tags to the original content. The tags are then used to pull in links from other websites, in order to build up complex web pages with text and images aggregated from external sources. This method has provided the BBC with a quick and cheap way to tag the masses of content in its archive.
The system was tested using suggestions from the audience and worked well with general content.
The Inward and Outward-Facing Taxonomy: Enhancing Search
Bill Hutchison, Taxonomy consultant, founder of Wordmap
Key points: If content has value, it is worth taking the time to organise so it can be found easily.
Searches often produce lots of irrelevant results because many words have multiple meanings. “Bath” may be something to bathe in or a city in Somerset. People with low literacy, low technical literacy, or using foreign languages find it particularly hard to construct effective searches. Customers often use different search terms to companies. People typically only use one or two words in a search and expect to get good results. Search engines use search algorithms that make sense mathematically but that do not necessarily produce helpful results. A taxonomy can be describes as a service to help searchers by showing different options for different possible meanings, providing the option to drill down through a subject by providing links to more detailed content (narrower terms), and by suggesting related content (associated terms).
Representing Metadata for Better Project Communication
Rachel Lovinger, Senior Content Strategist, Avenue A | Razorfish
Key points: Stakeholders need to understand what it is you as a taxonomist can offer them. Clear communication and good record-keeping are the keys to a successful project.
There is a lot of jargon associated with metadata and taxonomy strategies. It often helps to show clients sites that work well because they have good metadata in place and to explain how metadata will make their site work better. Clear simple diagrams, humour, and making sure you provide clear concrete examples rather than using abstract terms make it easier for clients to understand and become enthusiastic about your plans.
It is important that project goals and what you need from the client are clearly defined. It is worth thinking about metadata and taxonomy early on in a web development project and tracking all important decisions and keeping good records so that you can reverse a decision if you need to later on. Build consensus early, as rifts tend to grow, not shrink.
In the first phase of the project, define the broad scope and structure and find out who the users are and what they want. In the second phase identify the information relationships that need to be expressed and finally work on the detailed taxonomy and metadata plan.
Taxonomy and its impact on Web 2.0
Chris Foye, Solutions Architect, Dow Jones
Key points: Solutions should get the right content to users when they need it and how they need it. Folksonomy can be a rich resource but is only possible when users enjoy tagging for fun or feel they can profit from voluntarily adding tags.
New levels of metadata can be created by harnessing the mass of users of the Internet. Social networking sites like Facebook are becoming increasingly influential as campaigning groups. Folksonomy (social tagging) is an attempt to harness communities to create a meaningful organisation of information. Folksonomy divides into two – narrow – where the material to be tagged is intended for a small, perhaps private, group (such as family photographs) and broad (such as del.ici.ous) where the intention is to make metadata (on del.ici.ous in the form of bookmarks) publicly available.
A popular aspect of folksonomy is the immediacy of feedback, where on site such as Flickr, as a tag is added, the tags added by other users are suggested. This encourages the re-use of the same tags, reinforcing their usage, so that the most popular tags are amplified. Predictive text is also used to encourage consistency in spelling.
Users tend to want a reward and will only tag content when they either enjoy the process or can see some personal gain. Amazon launched a tagging option but it was hardly used because people could not see how the tags would help them in future. Other disadvantages are “tag bombing” where the same person repeatedly submits a tag in order to distort its importance, or “opinion tags”, where negative comments are submitted as tags. However, folksonomy can provide a rich source of candidate terms and popular synonyms for use in a formal taxonomy.
Automated tagging software like the Open Calais API produced by Reuters are good at identifying entities, but not relationships. For some areas, such as Records Management, content needs to be carefully controlled and so formal taxonomies are more helpful than social tagging.
EXECUTIVE ROUND TABLE
Moderator: Madi Solomon
Matt Adamson, Taxonomy Consultant, Dow Jones
Bob Bater, Vice-chair ISKO UK, Principal Associate InfoPlex Associates
Theresa Regli, Principal, CMS Watch
Georgina Wistow, process manager, BT Design, formerly Director of Content Operations at Getty Images
“Truth or Dare”: the panel were asked to choose either truth, in which case they had to be totally honest, or dare, when they had to reveal something they would not normally disclose at a conference.
Will RDF and OWL replace formal taxonomies?
Data is becoming more structured. The corporate sector has embraced taxonomies and this seems to be spreading to the commercial sector, but another format may appear that will replace everything.
How has your experience informed your consultancy?
You have to go with something. It doesn’t have to be perfect as long as you are agile enough to make changes.
Is search the new taxonomy?
Taxonomy of itself isn’t useful; it is only useful when implemented in a user interface. Actually they are both closely inter-related.
What’s the biggest promise that software vendors can’t fulfil?
It does that out of the box – 90% of buyers have to configure and customise the software themselves. However, vendors often get blamed when the problem is poor metadata or bad processes.
What is the most requested consultancy service?
It’s usually when people have come to an impasse or have some kind of problem.
What comes first, metadata or taxonomy?
Usually metadata, as you need to block out the top-level areas and map out metadata fields like format and audience. It is changing. Now it is more common for organisations to want to bring together a number of taxonomies to match a metadata scheme.
Businesspeople seemed to latch on to the term taxonomy. They seem to like it.
AUDIENCE ROUND TABLE
Theresa Regli, Principal, CMS Watch
Key points: Software vendors overstate the benefits and minimise the weaknesses of their products. It is often hard to tell exactly what is being offered. A clever implementation of inexpensive software tools will usually work better than a poor implementation of the most expensive tools.
Two types of software package were considered – auto-categorisation and auto-classification. Auto-categorisation software – which might more usefully be called category auto-creation – goes through documents and suggests categories. Auto-classification software automatically assigns sets of pre-defined metadata to documents. The basic software has not changed much in the last 10-15 years.
Key vendors are Autonomy, FAST, IBM OmniFind, MS Sharepoint (sort of), SAP, Apache Lucene, Open Text, Recommind & H5, ISYS, Thunderstone, Inxight (now Business Objects) and Interwoven Meta Tagger. Clustering products include those offered by Oracle, Endeca and Vivisimo. Some work by processing rules, others look for patterns in the text. This can produce absurdities, when a common but unhelpful phrase (like Contact Andrea) is used repeatedly the software identifies it as important. Software tools will never solve problems of poorly applied metadata. Auto-suggestion software can be more helpful, but time is needed to train the software manually. Accuracy rates tend to start at 40-60% and with training can be raised to 70% but it is unusual to get better rates. Software tends to be based either on statistical analysis of text (how often words occur) or linguistic analysis (which words occur near other words).
Where software is very helpful is in recognising entities – all instances of a word, for example. It can also make bulk processing possible. Certain types of software are more appropriate to certain types of content – for example highly structured content where patterns tend to repeat will work will with pattern-based software.
Image recognition software is mainly used in law enforcement and commercial applications are limited. Some work has been done on extracting content from captions in videos.
Preservation, the Archive, and the Working Library
Sue Malden Media Archive consultant, Chair of FOCAL International
Roger Roberts, Head of Cultural Facilities, RTBF
Kevin Price, Chief Operational Officer, BAFTA
BAFTA opened up its archive, which helped it to fulfil its charitable remit of providing access. Metadata is the key to communication between organisations and the outside world. In the future files will be insignificant, and rich context linked objects will be the norm.
Gaining investment for metadata projects is always difficult. Libraries tend to focus on exploitation of resources and archives on preservation, but that distinction does not necessarily remain valid in the digital world. Granularity is an issue when dealing with films, but standards have been developed to govern the use of metadata to describe scenes and clips.
Jeremy Bentley from Smart Logic talked about Semaphore, their award winning taxonomy and classification support for the Google Appliance. Semaphore presents users with the familiar Google interface but can then suggest broader and narrower term options, or suggest similar terms if the search has returned no results, based on the underlying taxonomy.
Finnur Magnusson from Vyre described their Unify taxonomy management software, which is used by many major commercial companies to support a range of operation processes from catalogue creation and online advertising to content and records management.
Petra Tant from aDAM described their “media intelligent” digital asset management software.