Category Archives: KO

KO

Building Enterprise Taxonomies – Book Review

    Start a conversation 
Estimated reading time 2–2 minutes

There aren’t many books on taxonomies, so it is good to have another on the shelf. Darin L Stewart‘s book is based on a series of lectures and provides a good introduction to key topics. As a format, that means you can pick the sections that are relevant to you. It has a very American student textbook tone, with pop quotes and definitions of key concepts in information science (e.g. precision and recall), but that doesn’t mean it isn’t a useful refresher for professionals. I particularly enjoyed the sections on XML, RDF, and ontologies as most of the coverage of these topics is either highly technical or very abstract. As the title suggests, it has a very corporate focus and so doesn’t really cover scientific taxonomies or library classifications.

The chapters introduce the concept of findability, cover the basics of metadata, types of taxonomies, how to go about developing a taxonomy and performing a content audit, general guidance on choosing terms and structures, some of the technical issues – introducing XML, XSLT, RDF, and OWL, and summarising ontologies and folksonomies.

I found a few typos and a few places slightly odd – for example I found the use of “Google whacking” to illustrate “teleporting” confusing and the descriptions of how to go about taxonomy work to be a little prescriptive. However, textbooks have to simplify the world in order to provide students with a starting point. Overall the book covers a good range of topics and concepts and is a light but informative read.

The Organizational Digital Divide

    Start a conversation 
Estimated reading time 2–2 minutes

Catching up on my reading, I found this post by Jonah Bossewitch: Pick a Corpus, Any Corpus and was particularly struck by his clear articulation of the growing information gulf between organizations and individuals.

I have since been thinking about the contrast between our localised knowledge organization systems and the semantic super-trawlers of the information oceans that are only affordable – let alone accessible – to the megawealthy. It is hard not to see this as a huge disempowerment of ordinary people, swamping the democratizing promise of the web as a connector of individuals. The theme has also cropped up in KIDMM discussions about the fragmentation of the information professions. The problem goes far beyond the familiar digital divide, beyond just keeping our personal data safe, to how we can render such meta-industrial scale technologies open for ordinary people to use. Perhaps we need public data mines to replace public libraries? It seems particularly bad timing that our public institutions – our libraries and universities – are under political and financial attack just at the point when we need them to be at the technological (and expensive) cutting edge.

We rely on scientists and experts to advise us on how to use, store and transport potentially hazardous but generally useful chemicals, radioactive substances, even weapons, and information professionals need to step up to the challenges of handling our new potentially hazardous data and data analysis tools and systems. I am reassured that there are smart people like Jonah rising to the call, but we all need to engage with the issues.

KO

Review of The Accidental Taxonomist

    Start a conversation 
Estimated reading time 1–2 minutes

Rather late to the party on this one, but I finally got around to reading The Accidental Taxonomist by Heather Hedden. I have to confess to bias as I was very pleased to see that my FUMSI article on folksonomies was mentioned in the recommended reading section. Written in a clear, sensible and readable tone, Hedden gives a very thorough overview of practical taxonomy work. The book works as a textbook, but reads pleasantly and although I anticipate referring to it as a reference resource, I enjoyed reading it chapter by chapter.

I am very pleased to have so much practical and useful information in one place (for example lists of relevant standards, definitions of taxonomies, ontologies, and thesauri, the functions of taxonomies) as in my day-to-day work, I often have to explain the basics to people. I have been recommending the book to my team, especially those who are new to taxonomies, and they have appreciated its clarity and comprehensiveness as a “field guide”. It covered familiar ground, but for me much of that was my “tacit knowledge” that I had never fully articulated to myself, so I am sure that this “knowledge capture” from the mind of an experienced taxonomy practitioner will be very useful.

UK Archives Discovery Forum

    1 comment 
Estimated reading time 6–10 minutes

I very much enjoyed the UKAD UK Archives Discovery Forum event at the National Archives. There were three tracks as well as plenary sessions, so I couldn’t attend everything.

Linked Data and archives

After an introduction from Oliver Morley, John Sheridan opened by talking about the National Archives and Linked Data. Although not as detailed as the talk he gave at the Online Information Conference last December, he still gave the rallying call for opening up data and spoke of a “new breed” of IT professionals who put the emphasis on the I rather than the T. He spoke about Henry Maudslay who invented the screw-cutting lathe, which enabled standardisation of nuts and bolts. This basically enabled the industrial revolution to happen. Previously, all nuts and bolts were made individually as matching pairs, but because the process was manual, each pair was unique and not interchangeable. If you lost the bolt, you needed a new pair. This created huge amounts of management and cataloguing of individual pairs, especially if a machine had to be taken apart and re-assembled, and meant interoperability of machinery was almost impossible. Sheridan asserted that we are at that stage with data – all our data ought to fit together but at the moment, all the nuts and bolts have to be hand crafted. Linked Data is a way of standardising so that we can make our data interchangeable with other people’s. (I like the analogy because it makes clear the importance of interoperability, but obviously getting the nuts and bolts to fit is only a very small part of what makes a successful machine, let alone a whole factory or production line. Similarly Linked Data isn’t going to solve broad publishing or creative and design problems, but it makes those big problems easy to work on collaboratively.)

Richard Wallis from Talis spoke about Linked Data. He likes to joke that you haven’t been to a Linked Data presentation unless you’ve seen the Linked Open Data cloud diagram. My version is that you haven’t been to a Linked Data event unless at least one of the presenters was from Talis! Always an engaging speaker, his descriptions of compartmentalisation of content and distinctions between Linked Data, Open Data, and Linked Open Data were very helpful. He likes to predict evangelically that the effects of linking data will be more profound to the way we do business than the changes brought about by the web itself. Chatting to him over tea, he has the impression that a year ago people were curious about Linked Data and just wanted to find out what it could do, but this year they are now feeling a bit more comfortable with the concepts and are starting to ask about how they can put them into practice. There certainly seemed to be a lot of enthusiasm in the archive sector, which is generally cash-strapped, but highly co-operative, with a lot of people passionate about their collections and their data and eager to reach as wide an audience as possible.

A Vision of Britain

Humphrey Southall introduced us to A Vision of Britain, which is a well-curated online gazetteer of Britain, with neat functions for providing alternative spellings of placenames, and ways of tackling the problems of boundaries, especially of administrative divisions, that move over time. I’m fascinated by maps, and they have built in some interesting historical map functionality too.

JISC and federated history archives

David Flanders from JISC talked about how JISC and its Resource Discovery Task Force can provide help and support to educational collections especially in federation and Linked Data projects. He called on archives managers to use hard times to skill up, so that when more money becomes available staff are full of knowledge, skills, and ideas and ready to act. He also pointed out how much can be done in the Linked Data arena with very little investment in technology.

I really enjoyed Mike Pidd’s talk about the JISC-funded Connected Histories Project. They have adopted a very pragmatic approach to bringing together various archives and superimposing a federated search system based on metadata rationalisation. Although all they are attempting in terms of search and browse functionality is a simple set of concept extractions to pick out people, places, and dates, they are having numerous quality control issues even with those. However, getting all the data into a single format is a good start. I was impressed that one of their data sets took 27 days to process and they still take delivery of data on drives through the post. They found this was much easier to manage than ftp or other electronic transfer, just because of the terabyte volumes involved (something that many people tend to forget when scaling up from little pilot projects to bulk processes). Mike cautioned against using RDF and MySql as processing formats. They found that MySql couldn’t handle the volumes, and RDF they found too “verbose”. They chose to use a fully Lucene solution, which enabled them to bolt in new indexes, rather than reprocess whole data sets when they wanted to make changes. They can still publish out to RDF.

Historypin

Nick Stanhope enchanted the audience with Historypin, an offering from wearewhatwedo.org. Historypin allows people to upload old photos, and soon also audio and video, and set them in Google streetview. Although flickr has some similar functions, historypin has volunteers who help to place the image in exactly the right place, and Google have been offering support and are working on image recognition techniques to help place photos precisely. This allows rich historical street views to be built up. What impressed me most, however, was that Nick made the distinction between subjective and objective metadata, with his definition being objective metadata is metadata that can be corrected and subjective metadata is data that can’t. So, he sees objective metadata as the time and the place that a photo was taken – if it is wrong someone might know better and be able to correct it, and subjective metadata as the stories, comments, and opinions that people have about the content, which others cannot correct – if you upload a story or a memory, no-one else can tell you that it is wrong. We could split hairs over this definition, but the point is apposite when it comes to provenance tracking. He also made the astute observation that people very often note the location that a photo is “of”, but it is far more unusual for them to note where it was taken “from”. However, where it was taken from is often more use for augmented reality and other applications that try to create virtual models or images of the world. Speaking to him afterwards, I asked about parametadata, provenance tracking, etc. and he said these are important issues they are striving to work through.

Women’s history

Theresa Doherty from the Women’s Library ended the day with a call to stay enthusiastic and committed despite the recession, pointing out that it is an achievement that archives are still running despite the cuts, and that this shows how valued data and archives are in the national infrastructure, how important recording our history is, and that while archivists continue to value their collections, enjoy their visitors and users, and continue to want their data to reach a wider audience the sector will continue to progress. She described how federating the Genesis project within the Archives hub had boosted use of their collections, but pointed out that funders of archives need to recognise that online usage of collections is just as valid as getting people to physically turn up. At the moment funding typically is allocated on visitor numbers through the doors, and that this puts too much emphasis on trying to drag people in off the street at the expense of trying to reach a potentially vast global audience online.

Serendipity and large video collections

    Start a conversation 
Estimated reading time 2–2 minutes

I enjoyed this blog post: On Serendipity. Ironically, it was recommended to me, and I am now recommending it!

Serendipity is rarely of use to the asset manager, who wants to find exactly what they expect to find, but is a delight for the consumer or leisure searcher. People sometimes cite serendipity as a being a reason to abandon classification, but in my experience classification often enhances serendipity and can be lost in simple online search systems.

For example, when browsing an alphabetically ordered collection in print, such as an encyclopedia or dictionary, you just can’t help noticing the entries that sit next to the one you were looking for. This can lead you to all sorts of interesting connections – for example, looking up crescendo, I couldn’t help noticing that crepuscular means relating to twilight, and that there is a connection between crepe paper and the crepes you can eat (from the French for “wrinkly”), but crepinette has a different derivation (from the French for “caul”). What was really interesting was the fact that there was no connection, other than an accident of alphabetical order. I wasn’t interested in things crepuscular, or crepes and crepinettes, and I can’t imagine anyone deliberately modelling connections between all these things as “related concepts”.

Wikipedia’s “random article” function is an attempt to generate serendipity alogrithmically. On other sites the “what people are reading/borrowing/watching now” functions use chronological order to throw out unsought items from a collection in the hope that they will be interesting. Twitter’s “trending topics” use a combination of chronological order and statistics on the assumption that what is popular just now is intrinsically interesting. These techniques look for “interestingness” out of what can be calculated and it is easy to see how they work, but the semantic web enthusiasts aim to open up to automated processing the kind of free associative links that human brains are so good at generating.

Online Information Conference – day three

    Start a conversation 
Estimated reading time 4–6 minutes

Managing content in a mobile world

On Day 3, I decided to try the “Mobile content” track. The speakers were Alan Pelz Sharpe from The Real Story Group, Russell Reeder from Libre Digital and Dave Kellogg from MarkLogic.

Augmented reality is one of my pet topics, so I was pleased to hear the speakers confirm it is all about data and meaning. It is just one aspect of how consumers want more and more data and information presented to them without delay on smaller and simpler devices. However, this means a greater need for metadata and more work for usability specialists.

The whole way people interact with content is very different when they are on the move, so simply trying to re-render websites is not enough. Entire patterns of information needs and user behaviour have to be considered. “A mobile person is not the same as a mobile device…the person needs to be mobile, not necessary the content.” For example, mobile workers often prefer to contact a deskbound researcher and get answers sent to them, not do the research themselves while on the move.

It is not enough just to worry about the technological aspects, or even just the information architecture aspects of versions of content for mobile users. A different editorial style needs to be used for small screens, so content needs to be edited to a very granular level for mobile – no long words!

Users don’t care about formats, so may get a very bad impression of your service if you allow them to access the wrong content. One customer was cited as complaining that they could watch You Tube videos easily on their phone (You Tube transcodes uploads so they are low res and streamable), but another video caused huge problems and took ages (it turned out to be a download of an entire HD feature film).

The session made me feel quite nostalgic, as back in 1999 we spent much time pondering how we would adapt our content to mobile devices. Of course, then we were trying to present everything on tiny text-only screens – 140 characters was seen as a luxury. There is just no comparison with today’s multifunctional multicoloured shiny touch screen devices.

New World of Search – Closing Keynote

I think every conference should include a speaker who rises above day-to-day business concerns and looks at really big pictures. Stephen Arnold outlined the latest trends in search technologies, both open source and proprietary, and how people are now getting better at picking elements from different systems, perhaps combining open source with proprietary options and building modular, integrated systems to prevent lock-in. However, he also talked engagingly about digital identity, privacy, and the balance of power between individuals and large organisations in the digital world.

He reiterated the point that Google (and other search engines) are not free. “Free is not what it seems to the hopeful customer” but that we haven’t yet worked out the value of data and content – let alone information and knowledge – in the light of the digital revolution. Traditional business models do not work and old economic theories no longer apply: “19th century accounting rules don’t work in the 21st century knowledge economy”.

He noted that Facebook has managed to entice users and developers to give up their content, work, time, and intellectual property to a locked-in proprietary walled garden. People have done this willingly and apparently without prompting, enabling Facebook to achieve something that software and content vendors such as IBM and AOL have been trying unsuccessfully to do for decades.

There is no clear way of evaluating the service that Facebook provides against the value of the content that is supplied to it free by users. However, it is clear that it is of huge value in terms of marketing. It is possibly the biggest and most sophisticated marketing database ever built.

As well as content, people are willing to surrender personal information, apparently with minimal concerns for privacy and security. It is not clear what the implications of this are: “What is the true cost of getting people to give up their digital identities?” It is clear that the combined data held by Facebook, Google, Foursquare, mobile phone companies, credit card companies, travel card companies, etc. creates a hugely detailed “digital profile” of human lives. Stephen urged the audience to take seriously the potential for this to be used to cause harm to individuals or society – from cyberstalking to covert government surveillance – because technology is moving so fast that any negative social and political effects may already be irreversible.

Online Information Conference – day two

    1 comment 
Estimated reading time 6–10 minutes

Linked Data in Libraries

I stayed in the Linked Data track for Day 2 of the Online Information Conference, very much enjoying Karen Coyle‘s presentation on metadata standards – FRBR, FRSAR, FRAD, RDA – and Sarah Bartlett‘s enthusiasm for using Linked Data to throw open bibliographic data to the world so that fascinating connections can be made. She explained that while the physical sciences have been well mapped and a number of ontologies are available, far less work has been done in the humanities. She encouraged humanities researchers to extend RDF and develop it.

In the world of literature, the potential connections are infinite and very little numerical analysis has been done by academics. For example, “intertextuality” is a key topic in literary criticism, and Linked Data that exposes the references one author makes to another can be analysed to show the patterns of influence a particular author had on others. (Google ngrams is a step in this direction, part index, part concordance.)

She stressed that libraries and librarians have a duty of care to understand, curate, and manage ontologies as part of their professional role.

Karen and Sarah’s eagerness to make the world a better place by making sure that the thoughtfully curated and well-managed bibliographic data held by libraries is made available to all was especially poignant at a time when library services in the UK are being savaged.

The Swedish Union Catalogue is another library project that has benefited from a Linked Data approach. With a concern to give users more access to and pathways into the collections, Martin Malmsten asked if APIs are enough. He stressed the popularity of just chucking the data out there in a quick and dirty form and making it as simple as possible for people to interact with it. However, he pointed out that licences need to be changed and updated, as copyright law designed for a print world is not always applicable for online content.

Martin pointed out that in a commercialised world, giving anything away seems crazy, but that allowing others to link to your data does not destroy your data. If provenance (parametadata) is kept and curated, you can distinguish between the metadata you assert about content and anything that anybody else asserts.

During the panel discussion, provenance and traceability – which the W3C is now focusing on (parametadata) – was discussed and it was noted that allowing other people to link to your data does not destroy your data, and often makes it more valuable. The question of what the “killer app” for the semantic web might be was raised, as was the question of how we might create user interfaces that allow the kinds of multiple pathway browsing that can render multiple relationships and connections comprehensible to people. This could be something a bit like topic maps – but we probably need a 13-year-old who takes all this data for granted to have a clear vision of its potential!

Tackling Linked Data Challenges

The second session of day two was missing Georgi Kobilarov of Uberblic who was caught up in the bad weather. However, the remaining speakers filled the time admirably.

Paul Nelson of Search Technologies pointed out that Google is not “free” to companies, as they pay billions in search engine optimisation (SEO) to help Google. Google is essentially providing a marketing service, and companies are paying huge amounts trying to present their data in the way that suits Google. It is therefore worth bearing in mind that Google’s algorithms are not resulting in a neutral view of available information resources, but are providing a highly commercial view of the web.

John Sheridan described using Linked Data at the National Archives to open up documentation that previously had very little easily searchable metadata. Much of the documentation in the National Archives is structured – forms, lists, directories, etc. – which present particular problems for free text searches, but are prime sources for mashing up and querying.

Taxonomies, Metadata, and Semantics: Frameworks and Approaches

There were some sensible presentations on how to use taxonomies and ontologies to improve search results in the third session.
Tom Reamy of KAPS noted the end of the “religious fervour” about folksonomy that flourished a few years ago, now that people have realised that there is no way for folksonomies to get better and they offer little help to infrequent users of a system. They are still useful as a way of getting insights into the kind of search terms that people use, and can be easier to analyse than search logs. A hybrid approach, using a lightweight faceted taxonomy over the top of folksonomic tags is proving more useful.

Taxonomies remain key in providing the structure on which autocategorisation and text analytics is based, and so having a central taxonomy team that engages in regular and active dialogue with users is vital. Understanding the “basic concepts” (i.e. Lakoff and Rosch’s “basic categories”) that are the most familiar terms to the community of users is vital for constructing a helpful taxonomy and labels should be as short and simple as possible. Labels should be chosen for their distinctiveness and expressiveness.

He also pointed out that adults and children have different learning strategies, which is worth remembering. I was also pleased to hear his clear and emphatic distinction between leisure and workplace search needs. It’s a personal bugbear of mine that people don’t realise that looking for a hairdresser in central London – where any one of a number will do – is not the same as trying to find a specific shot of a particular celebrity shortly after that controversial haircut a couple of years ago from the interview they gave about it on a chat show.

Tom highlighted four key functions for taxonomies:

  • knowledge organisation systems (for asset management)
  • labelling systems (for asset management)
  • navigation systems (for retrieval and discovery)
  • search systems (for retrieval)

He pointed out that text analytics needs taxonomy to underpin it, to base contextualisation rules on. He also stressed the importance of data quality, as data quality problems cause the majority of search project failures. People often focus on cool new features and fail to pay attention to the underlying data structures they need to put in place for effective searching.

He noted that the volumes of data and metadata that need to processed are growing at a furious rate. He highlighted Comcast as a company that is very highly advanced in the search and data management arena, managing multiple streams of data that are constantly being updated, for an audience that expects instant and accurate information.

He stated that structure will remain the key to findability for the foreseeable future. Autonomy is often hailed as doing something different to other search engines because it uses statistical methods, but at heart it still relies on structure in the data.

Richard Padley made it through the snow despite a four-hour train journey from Brighton, and spoke at length about the importance of knowledge organisation to support search. He explained the differences between controlled vocabularies, indexes, taxonomies, and ontologies and how each performs a different function.

Marianne Lykke then talked about information architecture and persuasive design. She also referred to “basic categories” as well as the need to guide people to where you want them to go via simple and clear steps.

Taxonomies, Metadata, and Semantics in Action

I spoke in the final session of the day, on metadata life cycles, asset lifecycles, parametadata, and managing data flows in complex information “ecosystems” with different “pace layers”.

Neil Blue from Biowisdom gave a fascinating and detailed overview of Biowisdom’s use of semantic technologies, in particular ontology-driven concept extraction. Biowisdom handle huge complex databases of information to do with the biological sciences and pharmaceuticals, so face very domain-specific issues, such as how to bridge the gap between “hard” scientific descriptions and “soft” descriptions of symptoms and side-effects typically given by patients.

In the final presentation of the day, Alessandro Pica outlined the use of semantic technologies by Italian News agency AGI.