Tag Archives: libraries

The Information Master – Louis XIV’s Knowledge Manager

Estimated reading time 4–6 minutes

I recently read The Information Master: Jean-Baptiste Colbert‘s Secret State Intelligence System by Jacob Soll. It is a very readable but scholarly book that tells the story of how Colbert used the accumulation of knowledge to build a highly efficient administrative system and to promote his own political career. He seems to have been the first person to seize upon the notion of “evidence-based” politics and that knowledge, information and data collection, and scholarship could be used to serve the interests of statecraft. In this way he is an ancestor of much of the thinking that is commonplace not only in today’s political administrations but also in all organizations that value the collection and management of information. The principle sits at the heart of what we mean by the “knowledge economy”.

The grim librarian

Jean-Baptiste Colbert (1619-83) is depicted as ruthless, determined, fierce, and serious. He was an ambitious man and saw his ability to control and organize information as a way of gaining and then keeping political influence. By first persuading the King that an informed leadership was a strong and efficient leadership, and then by being the person who best understood and knew how to use the libraries and resources he collected, Colbert rose to political prominence. However, his work eventually fell victim to the political machinations of his rivals and after his death his collection was scattered.

Using knowledge to serve the state

Before Colbert, the scholarly academic tradition in France had existed independently from the monarchy, but Colbert brought the world of scholarship into the service of the state, believing that all knowledge – even from the most unlikely of sources – had potential value. This is very much in line with modern thinking about Big Data and how that can be used in the service of corporations. Even the most unlikely of sources might contain useful insights into customer preferences or previously unseen supply chain inefficiencies, for example.

Colbert’s career was caught up with the political machinations of the time. He worked as a kind of accountant of Cardinal Mazarin, but when Mazarin’s library was ransacked by political rivals and his librarian fell out of favour, Colbert restored the library and built a unified information system based on the combination of scholarship and administrative documentation, ending the former division between academia and government bureaucracy.

Importance of metadata

Colbert also instinctively grasped the importance of good metadata, cataloguing, and an accurate network of links and cross references in order to be able to obtain relevant and comprehensive information quickly, issues that remain even more urgent than ever given the information explosions modern organizations – and indeed nations – face. This enabled him to become a better administrator than his rivals and by becoming not only the source of political expedient information but also the person who knew how to use the information resources most effectively, he was able to gain political influence and become a key minister under Louis XIV.

A personal vision

I was struck by how much his vast library, archive, and document management system was the result of his own personal vision, how it was built on the dismantling and rebuilding of work of predecessors, but also how, after his death, the system itself fell victim to political changes and was left to collapse. This pattern is repeated frequently in modern information projects. So often the work of the champion of the original system is wasted as infighting that is often not directly connected to the information project itself leads to budget cuts, staff changes, or other problems that lead to the system decaying.

Soll argues that the loss of Colbert’s system hampered political administration in France for generations. Ironically, it was Colbert’s own archives that enabled successive generations of political rivals to find the documents with which to undermine the power of the crown, showing the double-edged nature of information work. It is often the same collections that can both redeem and condemn.

Secrecy or transparency?

Another theme that ran throughout Colbert’s career, with both political and practical implications, was the tension between demands for transparent government and the desire for a secret state. Much of the distinction between public and private archives was simply a matter of who was in control of them and who had set them up, so the situation in France under the monarchy was different to the situation in England where Parliament and the Monarchy maintained entirely separate information systems. In France, an insistence on keeping government financial records secret eventually undermined trust in the economy. Throughout his career Colbert was involved in debates over which and how much government information should be made public, with different factions arguing over the issue – arguments that are especially resonant today.

Libraries, Media, and the Semantic Web meetup at the BBC

    Start a conversation 
Estimated reading time 3–4 minutes

In a bit of a blog cleanup, I discovered this post languishing unpublished. The event took place earlier this year but the videos of the presentations are still well worth watching. It was an excellent session with short but highly informative talks by some of the smartest people currently working in the semantic web arena. The Videos of the event are available on You Tube.


Jon Voss of Historypin was a true “information altruist”, describing libraries as a “radical idea”. The concept that people should be able to get information for free at the point of access, paid for by general taxation, has huge political implications. (Many of our libraries were funded by Victorian philanthropists who realised that an educated workforce was a more productive workforce, something that appears to have been largely forgotten today.) Historypin is seeking to build a new library, based on personal collections of content and metadata – a “memory-sharing” project. Jon eloquently explained how the Semantic Web reflects the principles of the first librarians in that it seeks ways to encourage people to open up and share knowledge as widely as possible.


Adrian Stevenson of MIMAS described various projects including Archives Hub, an excellent project helping archives, and in particular small archives that don’t have much funding, to share content and catalogues.


Evan Sandhaus of the New York Times explained the IPTC’s rNews – a news markup standard that should help search engines and search analytics tools to index news content more effectively.


Dan Brickley’s “compare and contrast” of Universal Decimal Classification with schema.org was wonderful and he reminded technologists that it very easy to forget that librarians and classification theorists were attempting to solve search problems far in advance of the invention of computers. He showed an example of “search log analysis” from 1912, queries sent to the Belgian international bibliographic service – an early “semantic question answering service”. The “search terms” were fascinating and not so very different to the sort of things you’d expect people to be asking today. He also gave an excellent overview of Lonclass the BBC Archive’s largest classification scheme, which is based on UDC.

BBC Olympics online

Silver Oliver described how BBC Future Media is pioneering semantic technologies and using the Olympic Games to showcase this work on a huge and fast-paced scale. By using semantic techniques, dynamic rich websites can be built and kept up to the minute, even once results start to pour in.

World Service audio archives

Yves Raimond talked about a BBC Research & Development project to automatically index World Service audio archives. The World Service, having been a separate organisation to the core BBC, has not traditionally been part of the main BBC Archive, and most of its content has little or no useful metadata. Nevertheless, the content itself is highly valuable, so anything that can be done to preserve it and make it accessible is a benefit. The audio files were processed through speech-to-text software, and then automated indexing applied to generate suggested tags. The accuracy rate is about 70% so human help is needed to sort out the good tags from the bad (and occasionally offensive!) tags, but thsi is still a lot easier than tagging everything from scratch.

Online Information Conference 2010

    Start a conversation 
Estimated reading time 3–4 minutes

Despite the recession, tube strikes, and snow, there was a fine collection of speakers, exhibitors, and delegates at a smaller than usual Online Information Conference and Exhibition this year.

Librarians seem to be getting heavily into Linked Data, while the corporate sector is still mainly concerned with business intelligence and search.

On day one I enjoyed the practical explanations of how Linked Data principles have been made to work at The Guardian, The Press Association, the Dutch Parliament, and the ALISS health project in Scotland.

Linked Data tags are a form of metadata that can be used to automatically generate content aggregations for web pages. This means that not only can you re-use your own content, increasing its lifespan, but you can gather cheap content that is openly available online. This is very familiar territory to me, as we used to build products in the same way back in the 90s, the difference being that we didn’t have much of an external web to link to back then. In the meantime, using a linkable, interoperbale format for your tags has very many benefits, and whether your focus is primarily for content within or beyond a firewall, the arguments for using standards that have the potential to link to the wider world seem very compelling. I can’t see any logical reasons not to standardise the format your metadata is held in (technical and practical issues are another matter), although standardising the semantic content of the metadata is a far more difficult problem.

It was reassuring to hear that everyone else is struggling with the problems of who mints IDs and URIs, who settles arguments about what exactly the IDs refer to – especially across domains – and who resolves and manages mappings. Such issues are difficult to resolve within a firewall, out there on the Web they become vast. The W3C is starting to work on provenance standards (the parametadata or meta-metadata), a pet worry of mine, because I am certain we need to get that layer of semantic information into our tags as soon as possible if we are going to be able to advance the semantic web beyond crunching databases together.

In the meantime, Linked Data is working very well especially for mashups and information visualisations. I particularly liked the Dutch Parliament’s “Attaquograms” – a diagram showing how often MPs were interrupted in debates and how much they interrupted others, although it doesn’t appear to have changed their behaviour yet. I also enjoyed The Guardian’s “league tables” of MPs’ performance. When MPs protested that such analyses ignored qualitative issues, The Guardian pointed out that if MPs advocate such data crunching as a way to judge schools and hospitals, then it must be good enough to apply to MPs themselves.

Andy Hyde from the ALISS project is working on ways to use Linked Data to help people manage their health, especially for patients with long term conditions such as diabetes. He stressed the importance of involving the users in any information project and addressing them on their terms, stating “The most collaborative tool we have is the cup of tea”.

My only concern about using Linked Data to pull in Wikipedia content, is whether the audience will start recognising it. If every website that mentions a topic has the same Wikipedia content attached to it, won’t people get bored? Perhaps there are just so many people out there contributing, so many varieties of alogrithmic aggregations, and so much content to read, it will never happen!

There is a related Guardian technology blog post.

I will post summaries of days two and three shortly.

Web archiving

    1 comment 
Estimated reading time 6–10 minutes

I went to an excellent Anglo-French scientific discussion seminar on web archiving on Friday at the Institut Français Cultural Centre in London. The speakers were Gildas Illien of the Bibliothèque Nationale de France (BnF) (Paris) and Dr Stephen Bury of the British Library (BL).

Gildas Ilien described the web archiving project being undertaken by the BnF, using the Heritrix open source crawler to harvest the web from “seeds” (URLs). The crawler was charmingly illustrated with a picture of a “robot” (as people like to be able to see the “robot”), but the “robot” is a bit stupid – he sometimes misses things out and sometimes falls into traps and collects the same thing over and over again. The “robot” generates a lot of code for the librarians to assess and problems include the short lifespan of websites – one figure puts this as only 44 days (although whether that refers to sites disappearing altogether or just changing through updates wasn’t clear) and the “twilight zone” of what is public and what is private. In France the Legal Deposit Act was extended in 2006 to cover the web, so the BnF can collect any French website it wants to without having to ask permission. However, librarians have to choose whether to try to collect everything or just sites that are noteworthy in some way. It is also hard to guess who the future users of the archive will be and what sort of site they will want to access.

So far some 130 terabytes of data have been collected, and some 12 billion files stored.

Harvesting is done in three stages – bulk harvesting once a year; focused crawls of specific sites; and collections of e-deposits (such as e-books) directly from publishers. Some sites would be harvested occasionally – such as the website of the Festival du Cannes – which only needs to be collected once per year – and newspaper sites, which are collected more frequently.

The archive can be searched by URL or by text, although the text search is rudimentary at present.

Classification is another challenge, as traditional library classifications are not appropriate for much web content. For example, election campaign websites were classified by what the politicians were saying about themselves and by what the public were saying about them, as this was thought to be a useful distinction.

However, the problems of how to provide full and useful access to the collection and how to catalogue it properly remain unresolved.

The process was an interesting merging of traditional library skills and software engineering skills, with some stages clearly being either one or the other but a number of stages being “midway” requiring a cross-skilled approach.

Dr Stephen Bury explained that the BL is somewhat of a latecomer to web archiving, with the BnF, the Internet Archive, and the national libraries of Sweden and Australia all having more advanced web archiving programmes. Partly this is due to the state of UK legal deposit law, which has not yet been extended to include websites.

Just as there are many books about books and books about libraries, so there are many websites about the web. It is a very self-referential medium. However, there is a paradox in the BL’s current programme. Because the BL has to seek permission to collect each and every site, it may collect sites that it cannot then provide access to at all, and it cannot provide any access to sites except to readers in its reading rooms. To be able to collect the web but then not to be able to serve it back up to people through the web seems very strange.

Another issue of preservation is that the appearance of websites is browser-dependent, so a site may not look the same to people using different technology.

It is important that online information is preserved, as now websites are considered to be authentic sources of information – cited in PhDs for example – and so some way of verfiying that they existed and what content they contained is needed.

Reports have been produced by JISC and the Wellcome Trust (2002 Collecting and Preserving the World Wide Web) and (2002 Legal issues relating to the
archiving of Internet resources in the UK, EU, USA and Australia
by Andrew Charlesworth).

The BL undertook a Domain UK project to establish what the scope of a web archiving project might be. The BL used Australian PANDAS software. The UK web Archiving Consortium (UKWAC) was set up in 2003 but the need to obtain permissions has seriously limited its scope, as most website owners simply do not respond to permissions requests (very few actively refuse permission), presumably most ignore the request as spam or simply fail to reply.

The data has now been migrated from the PANDAS format to WARC and an access tool is in development. There are some 6 million UK websites, growing at a rate of 16% per year, and they are also growing in size (on average they are about 25Mb, increasing at a rate of 5% per year).

Decisions have to be made on frequency of collection, depth of collection, and quality. There are other peripheral legal issues, such as sites that fall under terrorism-related legislation. At present the BL can collect these sites but not provide access to them.

Resource discovery remains a major challenge, including how to combine cataloguing and search engine technology. So far, a thematic approach to organisation has been taken. Scalability is also a big issue. What works for a few thousand sites will not necessarily work for a few million.

This means that the nature of the “collecting institution” is changing. It is much harder to decide if a site is in or out of scope. A site may have parts that are clearly in scope and parts that clearly aren’t or it may change through time, sometimes being in scope and sometimes not.

The Digital Lives Project in association with UCL and the University of Bristol is looking at how the web is becoming an everyday part of our social and personal lifestyles.

The talks were followed by a question and answer session. I asked for more detail about the “twilight zone” of public and private websites. Both speakers agreed that there is a great need for more education on digital awareness, so that young people appreciate that putting things up on the Internet really is a form of publishing and their blogs and comments in public forums are not just private “chats” with friends. However, in France there has been little resistance to such personal material being collected. Most people are proud to have their sites considered to be part of the national heritage. A lot of outreach work has been done by the BnF to explain the aims of the archive and discuss any concerns. Gildas Ilien also pointed out that people do not necessailry have “the right to be forgotten” and that this is in fact not new. It has happened in the past that people have asked for books and other information to be removed from libraries, perhaps because they have changed their political viewpoint, and that a library would not simply remove a book from its shelves because the author decided that they had changed their mind about something in it.

There is a recent interview with Gildas Ilien (in French) on You Tube called L’archivage d’Internet, un défi pour les bibliothécaires.

Essential Classification

    Start a conversation 
< 1 minute

Here’s a review of Essential Classification by Vanda Broughton, a core Library Studies textbook and very easy read. It’s a sound introduction to classification – very practical and really aimed at trainee librarians, but included enough background and theory to keep me interested, including some pointers to the biases in the big classification systems. I was also intrigued by the assertion that people find it easier to remember numbers, so numerical shelfmarks are generally more popular than those based on letters. I always thought it was easier to remember letters, because you can make them into little phrases, but perhaps that’s just me!