Category Archives: search

UK Archives Discovery Forum

    1 comment 
Estimated reading time 6–10 minutes

I very much enjoyed the UKAD UK Archives Discovery Forum event at the National Archives. There were three tracks as well as plenary sessions, so I couldn’t attend everything.

Linked Data and archives

After an introduction from Oliver Morley, John Sheridan opened by talking about the National Archives and Linked Data. Although not as detailed as the talk he gave at the Online Information Conference last December, he still gave the rallying call for opening up data and spoke of a “new breed” of IT professionals who put the emphasis on the I rather than the T. He spoke about Henry Maudslay who invented the screw-cutting lathe, which enabled standardisation of nuts and bolts. This basically enabled the industrial revolution to happen. Previously, all nuts and bolts were made individually as matching pairs, but because the process was manual, each pair was unique and not interchangeable. If you lost the bolt, you needed a new pair. This created huge amounts of management and cataloguing of individual pairs, especially if a machine had to be taken apart and re-assembled, and meant interoperability of machinery was almost impossible. Sheridan asserted that we are at that stage with data – all our data ought to fit together but at the moment, all the nuts and bolts have to be hand crafted. Linked Data is a way of standardising so that we can make our data interchangeable with other people’s. (I like the analogy because it makes clear the importance of interoperability, but obviously getting the nuts and bolts to fit is only a very small part of what makes a successful machine, let alone a whole factory or production line. Similarly Linked Data isn’t going to solve broad publishing or creative and design problems, but it makes those big problems easy to work on collaboratively.)

Richard Wallis from Talis spoke about Linked Data. He likes to joke that you haven’t been to a Linked Data presentation unless you’ve seen the Linked Open Data cloud diagram. My version is that you haven’t been to a Linked Data event unless at least one of the presenters was from Talis! Always an engaging speaker, his descriptions of compartmentalisation of content and distinctions between Linked Data, Open Data, and Linked Open Data were very helpful. He likes to predict evangelically that the effects of linking data will be more profound to the way we do business than the changes brought about by the web itself. Chatting to him over tea, he has the impression that a year ago people were curious about Linked Data and just wanted to find out what it could do, but this year they are now feeling a bit more comfortable with the concepts and are starting to ask about how they can put them into practice. There certainly seemed to be a lot of enthusiasm in the archive sector, which is generally cash-strapped, but highly co-operative, with a lot of people passionate about their collections and their data and eager to reach as wide an audience as possible.

A Vision of Britain

Humphrey Southall introduced us to A Vision of Britain, which is a well-curated online gazetteer of Britain, with neat functions for providing alternative spellings of placenames, and ways of tackling the problems of boundaries, especially of administrative divisions, that move over time. I’m fascinated by maps, and they have built in some interesting historical map functionality too.

JISC and federated history archives

David Flanders from JISC talked about how JISC and its Resource Discovery Task Force can provide help and support to educational collections especially in federation and Linked Data projects. He called on archives managers to use hard times to skill up, so that when more money becomes available staff are full of knowledge, skills, and ideas and ready to act. He also pointed out how much can be done in the Linked Data arena with very little investment in technology.

I really enjoyed Mike Pidd’s talk about the JISC-funded Connected Histories Project. They have adopted a very pragmatic approach to bringing together various archives and superimposing a federated search system based on metadata rationalisation. Although all they are attempting in terms of search and browse functionality is a simple set of concept extractions to pick out people, places, and dates, they are having numerous quality control issues even with those. However, getting all the data into a single format is a good start. I was impressed that one of their data sets took 27 days to process and they still take delivery of data on drives through the post. They found this was much easier to manage than ftp or other electronic transfer, just because of the terabyte volumes involved (something that many people tend to forget when scaling up from little pilot projects to bulk processes). Mike cautioned against using RDF and MySql as processing formats. They found that MySql couldn’t handle the volumes, and RDF they found too “verbose”. They chose to use a fully Lucene solution, which enabled them to bolt in new indexes, rather than reprocess whole data sets when they wanted to make changes. They can still publish out to RDF.

Historypin

Nick Stanhope enchanted the audience with Historypin, an offering from wearewhatwedo.org. Historypin allows people to upload old photos, and soon also audio and video, and set them in Google streetview. Although flickr has some similar functions, historypin has volunteers who help to place the image in exactly the right place, and Google have been offering support and are working on image recognition techniques to help place photos precisely. This allows rich historical street views to be built up. What impressed me most, however, was that Nick made the distinction between subjective and objective metadata, with his definition being objective metadata is metadata that can be corrected and subjective metadata is data that can’t. So, he sees objective metadata as the time and the place that a photo was taken – if it is wrong someone might know better and be able to correct it, and subjective metadata as the stories, comments, and opinions that people have about the content, which others cannot correct – if you upload a story or a memory, no-one else can tell you that it is wrong. We could split hairs over this definition, but the point is apposite when it comes to provenance tracking. He also made the astute observation that people very often note the location that a photo is “of”, but it is far more unusual for them to note where it was taken “from”. However, where it was taken from is often more use for augmented reality and other applications that try to create virtual models or images of the world. Speaking to him afterwards, I asked about parametadata, provenance tracking, etc. and he said these are important issues they are striving to work through.

Women’s history

Theresa Doherty from the Women’s Library ended the day with a call to stay enthusiastic and committed despite the recession, pointing out that it is an achievement that archives are still running despite the cuts, and that this shows how valued data and archives are in the national infrastructure, how important recording our history is, and that while archivists continue to value their collections, enjoy their visitors and users, and continue to want their data to reach a wider audience the sector will continue to progress. She described how federating the Genesis project within the Archives hub had boosted use of their collections, but pointed out that funders of archives need to recognise that online usage of collections is just as valid as getting people to physically turn up. At the moment funding typically is allocated on visitor numbers through the doors, and that this puts too much emphasis on trying to drag people in off the street at the expense of trying to reach a potentially vast global audience online.

Online Information Conference – day three

    Start a conversation 
Estimated reading time 4–6 minutes

Managing content in a mobile world

On Day 3, I decided to try the “Mobile content” track. The speakers were Alan Pelz Sharpe from The Real Story Group, Russell Reeder from Libre Digital and Dave Kellogg from MarkLogic.

Augmented reality is one of my pet topics, so I was pleased to hear the speakers confirm it is all about data and meaning. It is just one aspect of how consumers want more and more data and information presented to them without delay on smaller and simpler devices. However, this means a greater need for metadata and more work for usability specialists.

The whole way people interact with content is very different when they are on the move, so simply trying to re-render websites is not enough. Entire patterns of information needs and user behaviour have to be considered. “A mobile person is not the same as a mobile device…the person needs to be mobile, not necessary the content.” For example, mobile workers often prefer to contact a deskbound researcher and get answers sent to them, not do the research themselves while on the move.

It is not enough just to worry about the technological aspects, or even just the information architecture aspects of versions of content for mobile users. A different editorial style needs to be used for small screens, so content needs to be edited to a very granular level for mobile – no long words!

Users don’t care about formats, so may get a very bad impression of your service if you allow them to access the wrong content. One customer was cited as complaining that they could watch You Tube videos easily on their phone (You Tube transcodes uploads so they are low res and streamable), but another video caused huge problems and took ages (it turned out to be a download of an entire HD feature film).

The session made me feel quite nostalgic, as back in 1999 we spent much time pondering how we would adapt our content to mobile devices. Of course, then we were trying to present everything on tiny text-only screens – 140 characters was seen as a luxury. There is just no comparison with today’s multifunctional multicoloured shiny touch screen devices.

New World of Search – Closing Keynote

I think every conference should include a speaker who rises above day-to-day business concerns and looks at really big pictures. Stephen Arnold outlined the latest trends in search technologies, both open source and proprietary, and how people are now getting better at picking elements from different systems, perhaps combining open source with proprietary options and building modular, integrated systems to prevent lock-in. However, he also talked engagingly about digital identity, privacy, and the balance of power between individuals and large organisations in the digital world.

He reiterated the point that Google (and other search engines) are not free. “Free is not what it seems to the hopeful customer” but that we haven’t yet worked out the value of data and content – let alone information and knowledge – in the light of the digital revolution. Traditional business models do not work and old economic theories no longer apply: “19th century accounting rules don’t work in the 21st century knowledge economy”.

He noted that Facebook has managed to entice users and developers to give up their content, work, time, and intellectual property to a locked-in proprietary walled garden. People have done this willingly and apparently without prompting, enabling Facebook to achieve something that software and content vendors such as IBM and AOL have been trying unsuccessfully to do for decades.

There is no clear way of evaluating the service that Facebook provides against the value of the content that is supplied to it free by users. However, it is clear that it is of huge value in terms of marketing. It is possibly the biggest and most sophisticated marketing database ever built.

As well as content, people are willing to surrender personal information, apparently with minimal concerns for privacy and security. It is not clear what the implications of this are: “What is the true cost of getting people to give up their digital identities?” It is clear that the combined data held by Facebook, Google, Foursquare, mobile phone companies, credit card companies, travel card companies, etc. creates a hugely detailed “digital profile” of human lives. Stephen urged the audience to take seriously the potential for this to be used to cause harm to individuals or society – from cyberstalking to covert government surveillance – because technology is moving so fast that any negative social and political effects may already be irreversible.

Online Information Conference – day two

    1 comment 
Estimated reading time 6–10 minutes

Linked Data in Libraries

I stayed in the Linked Data track for Day 2 of the Online Information Conference, very much enjoying Karen Coyle‘s presentation on metadata standards – FRBR, FRSAR, FRAD, RDA – and Sarah Bartlett‘s enthusiasm for using Linked Data to throw open bibliographic data to the world so that fascinating connections can be made. She explained that while the physical sciences have been well mapped and a number of ontologies are available, far less work has been done in the humanities. She encouraged humanities researchers to extend RDF and develop it.

In the world of literature, the potential connections are infinite and very little numerical analysis has been done by academics. For example, “intertextuality” is a key topic in literary criticism, and Linked Data that exposes the references one author makes to another can be analysed to show the patterns of influence a particular author had on others. (Google ngrams is a step in this direction, part index, part concordance.)

She stressed that libraries and librarians have a duty of care to understand, curate, and manage ontologies as part of their professional role.

Karen and Sarah’s eagerness to make the world a better place by making sure that the thoughtfully curated and well-managed bibliographic data held by libraries is made available to all was especially poignant at a time when library services in the UK are being savaged.

The Swedish Union Catalogue is another library project that has benefited from a Linked Data approach. With a concern to give users more access to and pathways into the collections, Martin Malmsten asked if APIs are enough. He stressed the popularity of just chucking the data out there in a quick and dirty form and making it as simple as possible for people to interact with it. However, he pointed out that licences need to be changed and updated, as copyright law designed for a print world is not always applicable for online content.

Martin pointed out that in a commercialised world, giving anything away seems crazy, but that allowing others to link to your data does not destroy your data. If provenance (parametadata) is kept and curated, you can distinguish between the metadata you assert about content and anything that anybody else asserts.

During the panel discussion, provenance and traceability – which the W3C is now focusing on (parametadata) – was discussed and it was noted that allowing other people to link to your data does not destroy your data, and often makes it more valuable. The question of what the “killer app” for the semantic web might be was raised, as was the question of how we might create user interfaces that allow the kinds of multiple pathway browsing that can render multiple relationships and connections comprehensible to people. This could be something a bit like topic maps – but we probably need a 13-year-old who takes all this data for granted to have a clear vision of its potential!

Tackling Linked Data Challenges

The second session of day two was missing Georgi Kobilarov of Uberblic who was caught up in the bad weather. However, the remaining speakers filled the time admirably.

Paul Nelson of Search Technologies pointed out that Google is not “free” to companies, as they pay billions in search engine optimisation (SEO) to help Google. Google is essentially providing a marketing service, and companies are paying huge amounts trying to present their data in the way that suits Google. It is therefore worth bearing in mind that Google’s algorithms are not resulting in a neutral view of available information resources, but are providing a highly commercial view of the web.

John Sheridan described using Linked Data at the National Archives to open up documentation that previously had very little easily searchable metadata. Much of the documentation in the National Archives is structured – forms, lists, directories, etc. – which present particular problems for free text searches, but are prime sources for mashing up and querying.

Taxonomies, Metadata, and Semantics: Frameworks and Approaches

There were some sensible presentations on how to use taxonomies and ontologies to improve search results in the third session.
Tom Reamy of KAPS noted the end of the “religious fervour” about folksonomy that flourished a few years ago, now that people have realised that there is no way for folksonomies to get better and they offer little help to infrequent users of a system. They are still useful as a way of getting insights into the kind of search terms that people use, and can be easier to analyse than search logs. A hybrid approach, using a lightweight faceted taxonomy over the top of folksonomic tags is proving more useful.

Taxonomies remain key in providing the structure on which autocategorisation and text analytics is based, and so having a central taxonomy team that engages in regular and active dialogue with users is vital. Understanding the “basic concepts” (i.e. Lakoff and Rosch’s “basic categories”) that are the most familiar terms to the community of users is vital for constructing a helpful taxonomy and labels should be as short and simple as possible. Labels should be chosen for their distinctiveness and expressiveness.

He also pointed out that adults and children have different learning strategies, which is worth remembering. I was also pleased to hear his clear and emphatic distinction between leisure and workplace search needs. It’s a personal bugbear of mine that people don’t realise that looking for a hairdresser in central London – where any one of a number will do – is not the same as trying to find a specific shot of a particular celebrity shortly after that controversial haircut a couple of years ago from the interview they gave about it on a chat show.

Tom highlighted four key functions for taxonomies:

  • knowledge organisation systems (for asset management)
  • labelling systems (for asset management)
  • navigation systems (for retrieval and discovery)
  • search systems (for retrieval)

He pointed out that text analytics needs taxonomy to underpin it, to base contextualisation rules on. He also stressed the importance of data quality, as data quality problems cause the majority of search project failures. People often focus on cool new features and fail to pay attention to the underlying data structures they need to put in place for effective searching.

He noted that the volumes of data and metadata that need to processed are growing at a furious rate. He highlighted Comcast as a company that is very highly advanced in the search and data management arena, managing multiple streams of data that are constantly being updated, for an audience that expects instant and accurate information.

He stated that structure will remain the key to findability for the foreseeable future. Autonomy is often hailed as doing something different to other search engines because it uses statistical methods, but at heart it still relies on structure in the data.

Richard Padley made it through the snow despite a four-hour train journey from Brighton, and spoke at length about the importance of knowledge organisation to support search. He explained the differences between controlled vocabularies, indexes, taxonomies, and ontologies and how each performs a different function.

Marianne Lykke then talked about information architecture and persuasive design. She also referred to “basic categories” as well as the need to guide people to where you want them to go via simple and clear steps.

Taxonomies, Metadata, and Semantics in Action

I spoke in the final session of the day, on metadata life cycles, asset lifecycles, parametadata, and managing data flows in complex information “ecosystems” with different “pace layers”.

Neil Blue from Biowisdom gave a fascinating and detailed overview of Biowisdom’s use of semantic technologies, in particular ontology-driven concept extraction. Biowisdom handle huge complex databases of information to do with the biological sciences and pharmaceuticals, so face very domain-specific issues, such as how to bridge the gap between “hard” scientific descriptions and “soft” descriptions of symptoms and side-effects typically given by patients.

In the final presentation of the day, Alessandro Pica outlined the use of semantic technologies by Italian News agency AGI.

Augmented reality

    Start a conversation 
Estimated reading time 3–5 minutes

I went to a British Computer Society talk on Augmented Reality a few weeks ago. The BCS audience is typically highly technical, but the talks themselves are always accessible and entertaining. People often wonder why I am interested in augmented reality, because they assume it has nothing to do with information, but to me it is all about information. I would love to be able to serve up archive content to someone’s mobile phone using location data – a clip of a scene from an episode of their favourite programme that was filmed in that location, or an old news report about an event that took place there. Managing vast data sets containing huge amounts of content in a searchable form will form the backbone of many augmented reality tools and applications. If this isn’t an area that information scientists should be exploring, I don’t know what is!

The speakers were Professor Anthony Steed, Head of Virtual Environments and Computer Graphics at UCL, and Lester Madden, founder and director of Augmented Planet.

They explained the difference between visual search, true augmented reality, and virtual reality. Visual search is using an image as a search term (as in Google Goggles) and then returning results. Because this can be done via a camera, the image can be one that is in the searcher’s immediate environment, and the results can be returned as an soverlay on the original image. True augmented reality is not just adding graphics to an unrelated camera feed, but is responsive to the real surroundings. Virtual reality is an entirely computer-generated environment.

3-D models of the world are being built, but keeping them up to date is proving a challenge, and crowdsourcing may be the only pragmatic option. Another technical challenge suggested was how to render the augmentation visually indistinguishable from “real” vision, which raises all sorts of interesting philosophical and ethical questions about how we handle the behaviour of people who become confused or cease to be able to tell the difference, either temporarily or permanently. At the moment, augmented reality is quite distinct from virtual reality, but eventually the two will presumably meet. However, nobody seems to think that is likely anytime soon.

In the meantime, there was a rather lovely video of an augmented reality audience, designed to help people who have difficulty speaking in public. Apparently, this is a particular problem for those people in the software industry who are not natural extroverts but find that their careers can only advance if they get out from behind the screen and start talking at conferences, trade shows, etc., where audiences can be quite hostile. University students are hopeless at pretending to be a hostile audience – they are too polite, apparently (this week’s events notwithstanding!) – and actors are too expensive. Avatars, however, can be programmed to look bored, chat on their mobiles, get up and walk out, etc., and real people tend to have similar emotional reactions to the behaviour of avatars as they do to other humans, making an augmented reality theatre a perfect place for practising speaking and building confidence.

Augmented reality is also finding practical applications in the construction industry, to create visualisations of buildings before they are constructed, in medicine to help surgeons, and for improved videoconferencing. There are also many ways that augmented reality can be used to sell things – show me information about the restaurant or shoe shop in this street. Amusingly, identifying unique buildings is quite easy, but for the branded chains disambiguation is proving a challenge – their outlets look the same in every town – which brings us back again to familiar information science territory.

There is also a BCS blog post about the event.

Re-intermediating research

    Start a conversation 
Estimated reading time 2–2 minutes

A fine example of how much inspiration you can get from randomly talking to the people who are actually engaging with customers was given to me by our Research Guide last week.

She wants a video-tagging tool that includes chat functionality, some kind of interactive “pointing” facility, and plenty of metadata fields for adding and describing tags. When she is helping a customer to find the perfect bit of footage, she often finds herself in quite detailed discussions trying to explain why she thinks a shot meets their needs or in trying to understand what it is they don’t like about a particular scene. If they could both view the same footage in real time linked by some sort of online meeting functionality, they would be able to show each other what they meant and discuss and explain requirements far more easily and precisely.

This struck me as exactly how we should as information professionals be seizing new technologies to “re-intermediate” ourselves into the search process. Discussing bits of video footage is a particularly rich example, but what if an expert information professional could have a look at your search results and give you guidance via a little instant chat window? You could call up a real person to help you when you needed it without leaving your desk, in just the same way that online tech support chats work (I’ve had mixed experiences with those, but the principle is sound). I’m thinking especially of corporate settings, but wouldn’t it be a fantastic service for public libraries to offer?

It seems such a good idea I can’t believe it’s not already being done and would be very pleased to hear from anyone out there who is offering those sorts of services and in particular if there are any tools that support real time remote discussion around audio visual research.

Google is not perfect

    9 comments 
Estimated reading time 6–10 minutes

Perhaps I am starting to suffer from “deformation professionelle”, but I am constantly surprised by how often I am still asked “Why do we need classification now we have free text search and Google?”. This post is designed to answer the question. If you are an info pro, it won’t tell you anything you don’t already know, but as always I’d appreciate suggestions and additions.

The question seems to me a bit like asking “Why do we need scalpels now we have invented scissors?”. Scissors are a brilliant invention and they do many wonderful things – just like Google – they make all sorts of cutting quick and easy, but there are also many situations when they are not the right tool for the job. I don’t want a surgeon cutting me open with scissors except in a real emergency.

Google is excellent when searching text for something specific and known – pdf of a tube map of London, “Ode to Autumn by John Keats”; documents that contain the phrase “small furry creatures from alpha centauri”. However, you may get poor results if you don’t spell all the words correctly (or they have not been spelled correctly in your source material) or you get the form of the words wrong (“The Tales of the Arabian Nights”; “The Tales of the Arabian Knights”; “1001 Arabian Nights”; “A Thousand and One Arabian Nights”; etc.). So in order to get good results, you already need to know quite a lot about what you are looking for.

Of course most people chuck in the first couple of words that occur to them and hope for the best. This works fine if you have plenty of time to wade through lots of irrelevant results, think up lots of alternative words if the first ones you tried didn’t work, are prepared to chase around to get to where you are trying to go (sometimes misspellings are linked to correct spellings), and are not particularly fussy about the source (if you just want a rough idea of what the main exports of Ecuador are to settle a pub bet, rather than the most up-to-date analysis to help you to decide whether or not to invest a large sum in a trading company). The sheer volume of information in Google means that almost every search throws up far more results than the casual searcher will need. They may not be the best results, but they’ll usually do.

Disambiguation
It gets messier when the words you are searching on refer to a number of different things (do you mean Titanic the ship, the film, the song, etc.; “budget” and “Spain” as in the Spanish economy, not budget holidays in Spain). This sort of search can produce thousands, if not millions of irrelevant results, so classification that can provide disambiguation – sorting Spanish holiday pages from Spanish economy pages – has real value in terms of saved time. This is why enterprise search solutions – where employees’ wasted time is an expense to the company – offer classification as a fundamental aspect of the service. This is why dictionaries and encyclopedias make clear the difference between Mercury the metal, the Roman god, the planet, etc., depression in economics, meteorology, geography, psychiatry, etc., and is why Wikipedia’s disambiguation pages are so useful.

Imperfect prior knowledge
Google is not very helpful when you don’t know the exact title or an exact phrase in a document (was it Birmingham City Council’s guide to recycling, Birmingham Council guide to waste and recycling, West Midlands waste management policy…?) and practically no help at all when you only have circumstantial information relating to a subject area (what’s that story where they are captured by aliens and only get let out when they build a cage and catch a little animal in it to prove they are intelligent too? are their any laws about importing pet parrots from France? what was that sad music I heard on the radio last night?).

It is a laborious process of elimination to try different sets of search terms in Google, but a classification narrows the scope of your search so making it more likely you will find what you need (short stories >science fiction immediately means you are not searching the whole of literature, a set of documents under the heading EU>laws>animals>pets means you don’t have to wade through all EU agricultural law; radio>date of broadcast>soundtracks means you are not trawling through all the recorded music available on the Internet).

If you are researching an unfamiliar topic you probably don’t know the sort of words that are likely to have been used, so classifications are invaluable in showing you what other things are related to that topic, whether or not they use the only words or phrases you have previously encountered. Educational products have always used classification to aid knowledge discovery.

Aboutness
The words contained within the text may not give a full sense of what that text is about. If you are looking for a poem to read at a wedding, the best poems may never use the word “wedding” or “marriage” or even “love”. You’d be more likely to find a suitable poem using a classification poems>weddings. Synonym and thesaurus functions offer associated results as well as direct searching. Ontologies cluster vocabularies and taxonomies to create concept-based classifications.

Free text search on its own cannot provide the richness of suggestions that a classified system can offer. As far as I know, Google relies on source material to provide useful synonyms. (Incidentally I’ve found it remarkably tricky to find good references to how Google works via searching on Google…)

Complex queries
Google is also not helpful at answering complex queries (what is the fourth largest city in the EU by population? how many countries have majority Muslim populations?) that require combinations of sources. This is a gap spotted by “answer engines” such as True Knowledge and Wolfram Alpha, but both their systems depend on highly crafted classifications (taxonomies and ontologies). +Google Squared is Google’s own version.

Comprehensiveness
Google is not a management system. Because of the vagaries described above, you can’t use Google to tell you how many documents you hold about a particular subject, or which document is the most authoritative or up to date, unless you have been very careful to add consistent metadata to each one. Even then, Google might miss the most up-to-date document because its Page Rank is mainly based on popularity, and popularity takes time to cultivate, especially in niche areas. This is why digital asset management systems have metadata functions that provide controlled and filtered searching.

Sound and vision
Google still is a bit patchy in still image, video, and audio search. Technologies are improving all the time, but we still have to be patient. Most still rely on text attached to images or captured from audio tracks, so all the problems already mentioned with free text searching apply. Companies such as imense are using an interesting range of options in generating keywords to tag images, but still use taxonomies for specialist terminology.

Summary
In short, Google is great when you know what you are looking for, when it’s not that important, and when you have plenty of time. In other words, for casual leisure searching. For any search that requires discovery and exploration, certainty, completeness, and precision, and when you want the right results quickly, you need classification.

The future of classification will be one of increasing automation, but that means the indexer or cataloguer’s job becomes more sophisticated and complex. Indexers of the future will be constructing rules for ontology and taxonomy building, training systems for specialised domains, and investigating errors in the automated systems. This may mark a change in the nature of traditional jobs, but it certainly does not mean the end of classification. Taxonomies have been around for millennia, they aren’t likely to disappear overnight.

The very fact that Google engineers are busily working on content analysis, language processing, and other new methods in order to increase the amount of classification Google can apply to its results (e.g. How can we improve our understanding of low level representations of images that goes beyond bag of words modeling?) shows that even the master of the free text search recognises more can be done.

Public knowledge, private ignorance

    Start a conversation 
Estimated reading time 3–4 minutes

Public knowledge, private ignorance by Patrick Wilson is one of those fascinating books that reads as if it had been written yesterday, but in fact was written in 1977. In what struck me as such a contemporary theme, he discusses the importance of personal contacts and trusted authorities as sources of knowledge – a theme that has returned with a vengeance in the form of “social search” and leveraging social networks for recommendations etc. A wonderful example of this was given to me by a friend last night who told me about how their archives division suddenly gained recognition when the new media lot realised they needed metadata for their rapidly growing digital repositories. The new media folks were in a panic until the talked to the librarians and archivists and realised they had already worked out – and been assiduously cataloguing all the digital assets. Without that personal contact, the new media folk might have ended up building their own catalogue, and duplicating all that work! It’s a sadly familiar story even in these days of information abundance when you’d think such communication would be easy, but even the problem of how to make sure “public” knowledge is actually used is nothing new. Wilson quotes Lord Rayleigh at a meeting of the British Association of the Advancement of Science in1884 (yes – eighteen eighty four – it’s not a typo) as noting how much scientific knowledge was published but unread, saying “It is often forgotten that the rediscovery in the library may be a more difficult and uncertain process than the first discovery in the laboratory”. Wilson points out that “knowledge existing only ‘in the literature’ is no different from knowledge possessed by undiscoverable or inaccessible individuals”.
Wilson also claims that “where there is knowledge, there must be a knower”, which struck me as a challenge to notions that publishing alone – whether it be tweets or academic research papers – is only half the battle. This reinforces to me the absolute fundamentalness of findability and serving the needs of the user both within individual publications and across the whole of our “public” digital repositories. Again, this is nothing new. Way back in the 19th century Charles Cutter was anxious to serve library users better, Grace Kelley (disambiguated from the other one by the extra “e”!) in 1937 was what we would now refer to as a “usability evangelist” and both Ranganathan (1959) and Bliss (1935) had a passion for getting the right information out to the right people.
As Computer Science from the 1970s took over much of information retrieval and then with commercial products being heavily marketed, I worry that this sort of passion has been lost in a blur of what you can get an algorithm to do, rather than what people actually need. As Wilson says “we do not make knowledge available simply by making available documents in which knowledge is represented”. Google is wonderful, but that is, essentially, all it does. Of course the more sophisticated programmatic tools we have the better, but as information providers we should never be afraid to say “this gets us so far, but not far enough”. We need to keep reminding everyone that it is the minds of the knowers and potential knowers we need to be serving and so we should not be afraid to keep demanding ever more sophisticated systems that are mixed, variable, and downright difficult to automate.