The Shape of Knowledge – review of ISKOUK event

6th September, 2012 Fran 1 comment
Estimated reading time 1–2 minutes

On Tuesday I attended a very interesting event about information visualization and I have written a review for the ISKO UK blog.

I was particularly fascinated by the ideas suggested by Martin Dodge of mapping areas that are not “space” and what this means for the definition of a “map”. So, the idea of following the “path” of a device such as a phone through the electromagnetic spectrum brings a geographical metaphor into a non-tangible “world”. Conversely, is the software and code that devices such as robots use to navigate the world a new form of “map”? Previously, I have thought of code as “instructions” and “graphs” but have always thought of the “graph” as a representation of coded instructions, visualized for the benefit of humans, rather than the machines. However, now that machines are responding more directly to visual cues, perhaps the gap between their “maps” and our “maps” is vanishing.

Top

Digital Asset Management Techniques for Indexing Non-Textual Content – SLA Chicago

22nd August, 2012 Fran Start a conversation
Estimated reading time 2–3 minutes

David Riecks of Controlled Vocabulary gave a presentation about indexing images. He pointed out that metadata is all around us, but we don’t tend to notice it. He described the sort of metadata needed to make an asset “smart” and how organizations like the PLUS registry are attempting to provide a simple, one-stop shop for rights and licensing metadata. The Embedded Metadata Manifesto sets out details of metadata that needs to be included in image files to promote easy and legal re-use of content and so protect the rights of photographers and others in the content creation and related industries.

David also provided an extremely useful list of metadata resources , including a handy link to a website that checks whether metadata is being stripped from files at the point of upload.

Laura Fu talked us through the latest Digital Asset Management (DAM) implementation at Sears and the issues they face in indexing the images used in their product catalogues. She gained stakeholder buy-in with the slogan: “We’re here to save your assets”! I was also quietly pleased by her comment that “Sears have 1.1 million assets, but users …want taxonomies and tagging to make search more Google-like”.

Randall Marcinko of Marcinko Enterprises Inc. then talked about using different elements of assets to act as indexing mechanisms. He gave an example of where they were able to use the images associated with pieces of text as disambiguators to distinguish between the text. He also pointed out the dangers of trying to make every information project the same, and to think carefully about what is needed. It is easey to fall into the trap of simply offering all clients the same solution, whether that works best for them or not. Depending on what you are trying to achieve, a simple list is all that is needed, not a complex taxonomy or thesaurus, and the simpler the method of solving a problem, the easier and cheaper it is likely to be to implement.

Top

Keeping your Taxonomy Fresh and Relevant – SLA Chicago

18th August, 2012 Fran 1 comment
< 1 minute

Matt Johnson from EMC gave an extremely clear and useful presentation gave an overview of the taxonomy migration and revision project I have been working on for the past couple of years.

Matt and I were delighted to have such a big and lively audience for our session, especially as it was at 8 am! Thank you to everyone who joined us, to SLA’s Taxonomy division for organzing the session, to the session sponsor Gale Cengage Learning, and to Larry Lempert for moderating.

Top

SLA Conference in Chicago

11th August, 2012 Fran Start a conversation
Estimated reading time 3–5 minutes

Last month I had a wonderful time at the SLA (Special Libraries Association) conference in Chicago. I had never previously been to an SLA conference, even though there is a lively SLA Europe division. SLA is very keen to be seen as “not just for librarians” and the conference certainly spanned a vast range of information professions. The Taxonomy Division is thriving and there seem to be far more American than British taxonomists, which, although not surprising, was a pleasure as I don’t often find myself as one of a crowd! The conference has a plethora of receptions and social events, including the “legendary” IT division dance party.

There were well over 100 presentation sessions, as well as divisional meetings, panel discussions, and networking events that ranged from business breakfasts to tours of Chicago’s architectural sights. There was plenty of scope to avoid or embrace the wide range of issues and areas under discussion and I focused on taxonomies, Linked Data, image metadata, and then took a diversion into business research and propaganda.

I also thoroughly enjoyed the vendor demonstrations, especially the editorially curated and spam-free search engine Blekko, FastCase, and Law360 legal information vendors, and EOS library management systems.

My next posts will cover a few of the sessions I attended in more detail. Here’s the first:

Adding Value to Content through Linked Data

Joseph Busch of Taxonomy Strategies offered an overview of the world of Linked Data. The majority of Linked Data available in the “Linked Data Cloud” is US government data, with Life Sciences data in second place, which reflects the communities that are willing and able to make their data freely and publicly available. It is important to keep in mind the distinction between concept schemes – Dublin Core, FOAF, SKOS, which provide structures but no meanings – and semantic schemes – taxonomies, controlled vocabularies, ontologies, which provide meanings. Meanings are created through context and relationships, and many people assume that equivalence is simple and association is complex. However, establishing whether something is the “same” as something else is often far more difficult than simply asserting that two things are related to each other.

Many people also fail to use the full potential of their knowledge organization work. Vocabularies are tools that can be used to help solve problems by breaking down complex issues into key components, giving people ways of discussing ideas, and challenging perceptions.

The presentation by Joel Richard, web developer at the Smithsonian Libraries, focused on their botanic semantic project – digitizing and indexing Taxonomic Literature II. (I assume they have discussed taxonomies of taxonomy at some point!) This is a fifteen-volume guide to the literature of systemic botany published between 1753 and 1940. The International Association for Plant Taxonomy (IAPT) granted permission to the Smithsonian to release the work on the web under an open licence.

The books were scanned using OCR, which produced 99.97% accuracy, which sounds impressive but that actually means 5,000-12,000 errors – far too many for serious researchers. Errors in general text were less of a concern than errors in citations and other structured information, where – for example, mistaking an 8 for a 3 could be very misleading. After some cleanup work, the team next identified terms such as names and dates that could be parsed and tagged, and selected sets of pre-existing identifiers and vocabularies. They are continuing to look for ontologies that may be suitable for their data set. Other issues to think about are software and storage. They are using Drupal rather than a triplestore, but are concerned about scalability, so are trying to avoid creating billions of triples to manage.

Joel also outlined some of the benefits of using Linked Data, gave some examples of successful projects, and provided links to further resources.

Top

New York Public Library and metadata

31st July, 2012 Fran Start a conversation
Estimated reading time 2–2 minutes

I spent a wonderful afternoon at the New York Public Library on July 20th, thanks to Phil Sutton, reference librarian, who was kind enough to talk to me about his work and introduce me to several of his colleagues in the NYPL Labs, website, and local history teams.

As the Library holds such vast and diverse collections, it is not surprising that the metadata work of the Labs team is varied and wide ranging. One project involves rationalising and mapping metadata across collections that use different standards, another involves creating metadata for content strategy and website navigation, while more experimental work includes looking to use Linked Data techniques to open up and cross reference data sets.

What’s on the Menu? is using crowd sourced help to transcribe the Library’s collection of restaurant menus. So far, they have completed 998,899 dishes transcribed from 14,872 menus, and are investigating ways of linking the data to enable researchers to make interesting connections. So far, the data is in a fairly raw form, but is available to access through an API.

The Labs team are also working on the Library’s numerous directories, with an emphasis on helping genealogists, starting with census data from 1940 in the DirectMe project.

Previous projects have opened up collections of stereographs and maps, as well as content related to musical theatre, theatrical lighting, and the Shelley-Godwin archive.

Top

Photo metadata conference

5th July, 2012 Fran Start a conversation
Estimated reading time 7–11 minutes

I was very grateful to Sarah Saunders of Electric Lane for inviting me to speak at the CEPIC Conference at the IPTC Congress in May.

These are just a few of my personal highlights from a very full conference.

Image content for mobile devices

Dittmar Frohmann, Director of International Product at iStock and Getty Images, the keynote speaker of the day, covered a lot of ground, but I was struck by his recognition of the need for new business models for photo libraries. As has happened to the book publishing and music industries, the photo industries are reeling from the shock of the transition to a digital world.

Professional photographers are finding it harder to manage rights and licensing of their images, as digital copies are now so cheap and easy to produce and distribute around the world, and at the same time images taken on ubiquitous mobiles phones have become fashionable. “Citizen photographers”, including those taking out-of-focus badly lit mobile phone photos, are producing huge numbers of images that often do not meet traditional professional standards. However, such images are seen as “authentic” and “intimate” and have become popular with consumers in an age of austerity where slick, aspirational hyper-reality and glamorous models (Photoshop handsome?) are increasingly failing to chime with ordinary people.

This means that “un-professional” images are actively being sought by advertising agencies. Photographic styles go in and out of fashion, but never before has it been so easy for “amateurs” to produce high resolution images. At the same time, image libraries find themselves faced with a deluge of digital files and have to manage these files to ensure they don’t inadvertently breach rights agreements, while trying to add value to their services.

For image libraries, rights management and search/retrieval have become the two hottest topics as the key areas where economies of scale can offer improvements over “DIY” online sales and marketing. Libraries are effectively aggregators, and therefore services providers – gathering independent collections and individual photographers in one place can provide a one-stop shop for purchasers. If this is combined with fast and easy rights and re-use clearing services, along with distribution, then the libraries can still provide a useful and profitable service to both the producers of content (the photographers) and the consumers.

(I was surprised that very little was said about an editorial role for image collections – another area that value can be added is through collection curation and branding. So, you know that the best place to get UK landscape shots is from such-and-such a collection, etc. However, this is much harder to maintain, manage, and promote.)

Image metadata

I gave an overview of the history of metadata for knowledge organisation, with an emphasis on aspects that are peculiar to image libraries. For example, still images do not come with text attached, so natural language processing and concept extraction techniques that can drive document and text-based search systems can only be a second step for image libraries, once some text has been generated to associate with stills.

I was very pleased that a couple of the key themes that I introduced in my talk were picked up and elaborated on by other presenters.

Linked Data and crowdsourcing

Mary Forster from Getty Images went into detail about Linked Data and how this is being used to enhance Getty’s services and image management, by using linked data concept URIs to index images. She explained the differences between text matching and concept linking, and how text matching is far more noisy and imprecise than concept linking, and how using concepts enables flexible management of metadata structures so that creation of complex associations can be automated.

Andrew Ellis from the My Paintings project with the Public Catalogue Foundation talked about how they had successfully managed crowdsourcing by putting in place a sophisticated number of ways of managing the capture of the metadata. For example, rather than only offering unconstrained free tagging, taggers were invited to select tags from a dictionary list, in order to disambiguate concepts. They were also invited to select from a number of pre-set facets driven by controlled vocabularies – image type, style, etc. This made it easy to integrate the free tagging within an existing navigational scheme.

Content-based image retrieval

Mathieu from Xerox then talked about content-based image retrieval. Xerox have been working on sophisticated image analysis techniques designed to find images that have similar qualities to other images. They have a series of algorithms that analyse image “texture” and create a “Digital fingerprint” of an image. Other images with very similar fingerprints tend to look similar. This means that you can train the system with sets of example images, and it can then identify similar images in the collection. This can be used as an image autoclassification tool, as you can set up your training sets to be useful categories (famous landmarks, pop stars, tigers, etc.) and then sort your images into these categories. Xerox trained their system’s 706 categories using 1.5 million images.

The system works very well with distinct and easily recognisable images – iconic images like the Sydney Opera house for example – and on large collections where there are clear and obvious “hits” and “misses”. It doesn’t work well with concepts such as politics or history, as it is hard to come up with key images for the training set, nor moods – inspirational, happy, tranquil, etc. However, for large collections with no metadata, it offers a good way of adding structured metadata to make a collection navigable. Another interesting use is to identify duplicate images, so you could use it to assess the contents of a collection to find gaps (“we have hundreds of images of Tower Bridge, but none of the Golden Gate bridge”, etc.).

Perhaps it even has a potential use for TV producers editing rushes on a shoot – “we already have hundreds of shots of the sunset over the mountains, but hardly any close-ups of skiers”, for example.

I guess one day there will be a market for “controlled imageries” – training sets of example images to use as basis for such autoclassification software.

You can try it here.

Rights, IPO, orphan works

Nancy Wolff and Antoinette Graves of the IPO talked about rights and the law. Nancy stated that the need to be found is becoming more critical. Orphan works legislation advocates in the US want to de-risk usage so that images can be used even when it is not clear who they belong to or the owner is known but cannot be found.
Nancy noted that proposals for rights registries are being enthusiastically supported by Google but also that whoever owns such registries will not only make a lot of money but will also control access to and usage of content.

Antoinette pointed out that in the UK at present there is no diligent search that will allow for the use of an “orphan work”. This makes it very hard for publishers to be sure that they will not be prosecuted. There is a notable difference between “old” orphan works in museums, etc. and “new” orphans caused by metadata stripping.

Future of image search and rights management

In the afternoon I attended an interesting breakout session on the future of search, with a large and impressive panel. Rights management was a cited as a huge issue to resolve, with a call for slick seamless user-friendly payment systems, to enable people to buy images and re-use them legally, without friction and effort. Technology was seen as the answer to an essentially technology-created problem. Free distribution over the internet meant that people had a sense of entitlement – a sense that content ought to be free, mistaking the differences between free content and freedom of information.

Managing digital rights is not the same as imposing “lockout” DRM systems. There is a need to devise licensing methods that are based on understanding machine-to-machine communication, rights description metadata, etc. No-one wants to invest in content creation any more, largely because the protection of rights is so difficult, making content creation a very risky business. If this trend is to be reversed, technological solutions to the problems of rights clearances must be found.

Predictions for the future were that crowd sourcing would become increasingly important. Interestingly crowd-sourcing relies on the notion of people working for nothing, and I couldn’t help noticing the contrast between the professional photographers trying to stop “amateurs” destroying their living by providing images without expecting payment, but being perfectly happy for people to add metadata without being paid for their work.

The need to get money into the system somewhere in order to enable anyone to get paid was emphasised and I suppose when an industry is facing diminishing returns, everybody involved in the supply chain puts pressure on everyone else to cut their costs or work for nothing.
I can’t help thinking that the deluge of images from all sources is going to mean that findability – and hence metadata – will become even more significant as more and more images chase fewer and fewer users willing to pay for them.

Top

Assessing information taxonomies using epistemology and the sociology of science

28th June, 2012 Fran Start a conversation
< 1 minute

I am delighted that the Journal of Documentation accepted my article about subjectivity and objectivity in taxonomy work for publication.

The article is based on the work I did for my MRes dissertation at UCL, and I am extremely grateful for the support of Vanda Broughton, everyone at the Department of Information Studies, the wonderful taxonomists and information professionals who helped me with my research, and ISKO UK.

Top

Building bridges: Linking diverse classification schemes as part of a technology change project

19th June, 2012 Fran Start a conversation
< 1 minute

My paper about my work on the linking and migration of legacy classification schemes, taxonomies, and controlled vocabularies has been published in the Journal for Business Information Review.

Top

Building, visualising and deploying taxonomies and ontologies; the reality – Content Intelligence Forum event

6th June, 2012 Fran Start a conversation
Estimated reading time 1–2 minutes

I have been trying to get to the Content Intelligence Forum meetups for some time as they always seem to offer excellent speakers on key topics that don’t tend to get the attention they deserve, so I was delighted to be able to attend Stephen D’Arcy’s talk a little while ago on taxonomies and ontologies.

Stephen has many years of experience designing semantic information systems for large organisations, ranging from health care providers, to banks, to media companies. His career illustrates the transferability and wide demand for information skills.

His 8-point checklist for a taxonomy project was extremely helpful – Define, Audit, Tools, Plan, Build, Deploy, Governance, Documentation – as were his tips for managing stakeholders, IT departments in particular. He warned against the pitfalls of not including taxonomy management early enough in search systems design, and the problems that you can be left with if you do not have a flexible and dynamic way of managing your taxonomy and ontology structures. He also included a lot of examples that illustrated the fun aspects of ontologies when used to create interesting pathways through entertainment content in particular.

The conversation after the talk was very engaging and I enjoyed finding out about common problems that information professionals face, including how best to define terms, how to encourage clear thinking, and how to communicate good research techniques.

Top

Isn’t search the same as browse?

11th March, 2012 Fran Start a conversation
Estimated reading time 4–6 minutes

I nearly wept when one of our young rising IT stars queried in a meeting why we had separated “search” and “browse” as headings for our discusssions on archive navigation functionality. So, to spare me further tears here are some distinctions and similarities. There won’t be anything new for information professionals, but I hope it will be useful if any of your colleagues in IT need a little help. I am sure this is far from comprehensive, so please leave additions and comments!

Differences between search and browse

Search is making a beeline to a known target, browse is wandering around and exploring.
Search is for when you know what you are looking for, browse is for when you don’t.
Search is for when you know what you are looking for exists, browse is for when you don’t.

Search expects you to look for something that is findable, browse shows you the sort of thing you can find.
Search is for when you already know what is available in a collection or repository, browse is how you find out what is there, especially if you are a newcomer.
Search is difficult when you don’t know the right words to use, browse offers suggestions.
Search is a quickfire answer, browse is educative.
Search is about one-off actions, browse is about establishing familiar pathways that can be followed again or varied with predictable results.

Search relies on the seeker to do all the thinking, browse offers suggestions.
Search is a tricky way of finding content on related topics, browse is an easy way of finding related content.
Search is difficult when you are trying to distinguish between almost identical content, browse can highlight subtle distinctions.
Search rarely offers completeness, browse often offers completeness.

Search is pretty much a “black box” to most people, so it is hard to tell how well it has worked, browse systems are visible so it is easy to judge them.
Search uses complex processing that most people don’t want to see, browse uses links and connections that most people like to see.
Search is based on calcuations and assumptions that are under the surface, browse systems offer frameworks that are more open.

Search works well on the web, because the web is so big no-one has had time to build an easy way to browse it, browse works well on smaller structured collections.
Search can run across vast collections, browse needs to be offered at human-readable scales.
Search does not usually give an indication of the size or scope of a collection, browse can be designed to indicate scale.

Similarities between search and browse

Search and browse are both ways of finding content.
Search and browse can both be configured in a huge variety of ways.
Search and browse both have many different mechanisms and implementations.
Search and browse should both be tailored to users’ needs.
Search and browse systems both require thought and editorial judgement in their creation so that they work effectively for any particular collection.
Search and browse systems can often both be created largely automatically.
Search and browse often both involve metadata.
Search and browse behaviours may be intertwined, with users switching from one to the other.
Search and browse may be used by the same users for different tasks at different times.
Search and browse both offer serendipity, although serendipitous opportunities are often hidden by interface design.

Should I offer my users search or browse?

Almost always, you should offer both. Unless you are very sure that your users will always be performing the same kind of task and have the same level of familiarity with your content. With small static collections of content, it may not matter too much, but for most content collections, users will probably want both, but which you make your main focus depends on the context and collection.

Shops might have lots of images and very little text, so a beautifully designed navigation system will help customers find – and buy – products they might not know about, while only a simple search system might be needed to cover searches for product names. A library will need to support lots of searches for titles and across catalogue text with a good search system, but will also need to help educate and inform users with a clear user-friendly browsable navigation system. A large incoherent collection of unstructured text with no particular purpose is likely to be difficult to navigate no matter what you design, so will need good search, but – apart from the web itself – such unbounded and unmanaged collections tend to be quite unusual.

Top

« Previous 1 2 3 4 … 10 Next »

Category Archives: KO