I recently enjoyed reading about the work Safari are currently and doing to create a controlled vocabulary and topic aggregation pages to underpin navigation and discovery of their content.

Iterate, again

I very much liked the mix of manual and automated techniques the team used to maximise capturing value from existing resources while using machine processing to help and support the human editorial curation work. Lightweight iterative approaches have become standard in some areas of design, but achieving high quality information structures also usually requires several stages of revision and refinement. It is not always possible to predict what will happen in attempts to index or repurpose existing content, nor how users will respond to different information structures, and so the ability to iterate, correct, re-index, correct, adjust indexing methods, re-index, correct… is vital. Small samples of content are often not sufficient to find all potential issues or challenges, so it is always worth being prepared for surprises once you scale up.

Basics, as always

The Safari team identified the huge intellectual value locked into the existing human-created indexes and it is great to see them being able to extract some of that value, but then augment it using automated techniques. I was very interested to read about how the level of granularity in the individual indexes was too fine for overall aggregation. The team realised that there were “missing subtopics” – key topics that tended to be the subjects of entire books. These “missing subtopics” were found at the level of book titles and it struck me that this vital level of conceptualization aligns directly with Eleanor Rosch‘s work on basic categories and prototype theory. It is not surprising that the concepts that are “basic categories” to the likely readership would be found at book title level, rather than index level.

This is further illustrated by the fact that the very broad high level topics such as “business” did not work well either. These needed not to be “clustered up”, but broken down and refined to the level of the “basic categories” that people naturally think of first.

So, the Safari team’s work is a very clear illustration of not only how to combine manual and automated techniques but also how to find the “basic categories” that match users’ natural level of thinking about the subject area.