Tag Archives: archives

The Information Master – Louis XIV’s Knowledge Manager

Estimated reading time 4–6 minutes

I recently read The Information Master: Jean-Baptiste Colbert‘s Secret State Intelligence System by Jacob Soll. It is a very readable but scholarly book that tells the story of how Colbert used the accumulation of knowledge to build a highly efficient administrative system and to promote his own political career. He seems to have been the first person to seize upon the notion of “evidence-based” politics and that knowledge, information and data collection, and scholarship could be used to serve the interests of statecraft. In this way he is an ancestor of much of the thinking that is commonplace not only in today’s political administrations but also in all organizations that value the collection and management of information. The principle sits at the heart of what we mean by the “knowledge economy”.

The grim librarian

Jean-Baptiste Colbert (1619-83) is depicted as ruthless, determined, fierce, and serious. He was an ambitious man and saw his ability to control and organize information as a way of gaining and then keeping political influence. By first persuading the King that an informed leadership was a strong and efficient leadership, and then by being the person who best understood and knew how to use the libraries and resources he collected, Colbert rose to political prominence. However, his work eventually fell victim to the political machinations of his rivals and after his death his collection was scattered.

Using knowledge to serve the state

Before Colbert, the scholarly academic tradition in France had existed independently from the monarchy, but Colbert brought the world of scholarship into the service of the state, believing that all knowledge – even from the most unlikely of sources – had potential value. This is very much in line with modern thinking about Big Data and how that can be used in the service of corporations. Even the most unlikely of sources might contain useful insights into customer preferences or previously unseen supply chain inefficiencies, for example.

Colbert’s career was caught up with the political machinations of the time. He worked as a kind of accountant of Cardinal Mazarin, but when Mazarin’s library was ransacked by political rivals and his librarian fell out of favour, Colbert restored the library and built a unified information system based on the combination of scholarship and administrative documentation, ending the former division between academia and government bureaucracy.

Importance of metadata

Colbert also instinctively grasped the importance of good metadata, cataloguing, and an accurate network of links and cross references in order to be able to obtain relevant and comprehensive information quickly, issues that remain even more urgent than ever given the information explosions modern organizations – and indeed nations – face. This enabled him to become a better administrator than his rivals and by becoming not only the source of political expedient information but also the person who knew how to use the information resources most effectively, he was able to gain political influence and become a key minister under Louis XIV.

A personal vision

I was struck by how much his vast library, archive, and document management system was the result of his own personal vision, how it was built on the dismantling and rebuilding of work of predecessors, but also how, after his death, the system itself fell victim to political changes and was left to collapse. This pattern is repeated frequently in modern information projects. So often the work of the champion of the original system is wasted as infighting that is often not directly connected to the information project itself leads to budget cuts, staff changes, or other problems that lead to the system decaying.

Soll argues that the loss of Colbert’s system hampered political administration in France for generations. Ironically, it was Colbert’s own archives that enabled successive generations of political rivals to find the documents with which to undermine the power of the crown, showing the double-edged nature of information work. It is often the same collections that can both redeem and condemn.

Secrecy or transparency?

Another theme that ran throughout Colbert’s career, with both political and practical implications, was the tension between demands for transparent government and the desire for a secret state. Much of the distinction between public and private archives was simply a matter of who was in control of them and who had set them up, so the situation in France under the monarchy was different to the situation in England where Parliament and the Monarchy maintained entirely separate information systems. In France, an insistence on keeping government financial records secret eventually undermined trust in the economy. Throughout his career Colbert was involved in debates over which and how much government information should be made public, with different factions arguing over the issue – arguments that are especially resonant today.

UK Archives Discovery Forum

    1 comment 
Estimated reading time 6–10 minutes

I very much enjoyed the UKAD UK Archives Discovery Forum event at the National Archives. There were three tracks as well as plenary sessions, so I couldn’t attend everything.

Linked Data and archives

After an introduction from Oliver Morley, John Sheridan opened by talking about the National Archives and Linked Data. Although not as detailed as the talk he gave at the Online Information Conference last December, he still gave the rallying call for opening up data and spoke of a “new breed” of IT professionals who put the emphasis on the I rather than the T. He spoke about Henry Maudslay who invented the screw-cutting lathe, which enabled standardisation of nuts and bolts. This basically enabled the industrial revolution to happen. Previously, all nuts and bolts were made individually as matching pairs, but because the process was manual, each pair was unique and not interchangeable. If you lost the bolt, you needed a new pair. This created huge amounts of management and cataloguing of individual pairs, especially if a machine had to be taken apart and re-assembled, and meant interoperability of machinery was almost impossible. Sheridan asserted that we are at that stage with data – all our data ought to fit together but at the moment, all the nuts and bolts have to be hand crafted. Linked Data is a way of standardising so that we can make our data interchangeable with other people’s. (I like the analogy because it makes clear the importance of interoperability, but obviously getting the nuts and bolts to fit is only a very small part of what makes a successful machine, let alone a whole factory or production line. Similarly Linked Data isn’t going to solve broad publishing or creative and design problems, but it makes those big problems easy to work on collaboratively.)

Richard Wallis from Talis spoke about Linked Data. He likes to joke that you haven’t been to a Linked Data presentation unless you’ve seen the Linked Open Data cloud diagram. My version is that you haven’t been to a Linked Data event unless at least one of the presenters was from Talis! Always an engaging speaker, his descriptions of compartmentalisation of content and distinctions between Linked Data, Open Data, and Linked Open Data were very helpful. He likes to predict evangelically that the effects of linking data will be more profound to the way we do business than the changes brought about by the web itself. Chatting to him over tea, he has the impression that a year ago people were curious about Linked Data and just wanted to find out what it could do, but this year they are now feeling a bit more comfortable with the concepts and are starting to ask about how they can put them into practice. There certainly seemed to be a lot of enthusiasm in the archive sector, which is generally cash-strapped, but highly co-operative, with a lot of people passionate about their collections and their data and eager to reach as wide an audience as possible.

A Vision of Britain

Humphrey Southall introduced us to A Vision of Britain, which is a well-curated online gazetteer of Britain, with neat functions for providing alternative spellings of placenames, and ways of tackling the problems of boundaries, especially of administrative divisions, that move over time. I’m fascinated by maps, and they have built in some interesting historical map functionality too.

JISC and federated history archives

David Flanders from JISC talked about how JISC and its Resource Discovery Task Force can provide help and support to educational collections especially in federation and Linked Data projects. He called on archives managers to use hard times to skill up, so that when more money becomes available staff are full of knowledge, skills, and ideas and ready to act. He also pointed out how much can be done in the Linked Data arena with very little investment in technology.

I really enjoyed Mike Pidd’s talk about the JISC-funded Connected Histories Project. They have adopted a very pragmatic approach to bringing together various archives and superimposing a federated search system based on metadata rationalisation. Although all they are attempting in terms of search and browse functionality is a simple set of concept extractions to pick out people, places, and dates, they are having numerous quality control issues even with those. However, getting all the data into a single format is a good start. I was impressed that one of their data sets took 27 days to process and they still take delivery of data on drives through the post. They found this was much easier to manage than ftp or other electronic transfer, just because of the terabyte volumes involved (something that many people tend to forget when scaling up from little pilot projects to bulk processes). Mike cautioned against using RDF and MySql as processing formats. They found that MySql couldn’t handle the volumes, and RDF they found too “verbose”. They chose to use a fully Lucene solution, which enabled them to bolt in new indexes, rather than reprocess whole data sets when they wanted to make changes. They can still publish out to RDF.


Nick Stanhope enchanted the audience with Historypin, an offering from wearewhatwedo.org. Historypin allows people to upload old photos, and soon also audio and video, and set them in Google streetview. Although flickr has some similar functions, historypin has volunteers who help to place the image in exactly the right place, and Google have been offering support and are working on image recognition techniques to help place photos precisely. This allows rich historical street views to be built up. What impressed me most, however, was that Nick made the distinction between subjective and objective metadata, with his definition being objective metadata is metadata that can be corrected and subjective metadata is data that can’t. So, he sees objective metadata as the time and the place that a photo was taken – if it is wrong someone might know better and be able to correct it, and subjective metadata as the stories, comments, and opinions that people have about the content, which others cannot correct – if you upload a story or a memory, no-one else can tell you that it is wrong. We could split hairs over this definition, but the point is apposite when it comes to provenance tracking. He also made the astute observation that people very often note the location that a photo is “of”, but it is far more unusual for them to note where it was taken “from”. However, where it was taken from is often more use for augmented reality and other applications that try to create virtual models or images of the world. Speaking to him afterwards, I asked about parametadata, provenance tracking, etc. and he said these are important issues they are striving to work through.

Women’s history

Theresa Doherty from the Women’s Library ended the day with a call to stay enthusiastic and committed despite the recession, pointing out that it is an achievement that archives are still running despite the cuts, and that this shows how valued data and archives are in the national infrastructure, how important recording our history is, and that while archivists continue to value their collections, enjoy their visitors and users, and continue to want their data to reach a wider audience the sector will continue to progress. She described how federating the Genesis project within the Archives hub had boosted use of their collections, but pointed out that funders of archives need to recognise that online usage of collections is just as valid as getting people to physically turn up. At the moment funding typically is allocated on visitor numbers through the doors, and that this puts too much emphasis on trying to drag people in off the street at the expense of trying to reach a potentially vast global audience online.

World Audio Visual Archives Heritage day

    1 comment 
Estimated reading time 2–2 minutes

I went to an interesting event last Monday night for UNESCO World Audio Visual Archives Heritage day, held at BAFTA in London.

Professor John Ellis (Department of Media Arts, Royal Holloway, University of London) talked about the growing use of TV archives, particularly news footage, in academia, pointing out that over time such material becomes increasingly valuable in such diverse areas as physiology – for example in studying the effects of ageing by analysing footage of presenters and actors who have had long careers, and town planning, as footage can reveal the buildings that previously occupied a site being considered for redevelopment.

As UK law permits academic institutions to record and keep TV and radio broadcasts for purely educational purposes, a database of material has been collected. Academia remains currently a verbal rather than visual culture, but this seems to be changing. All politicians, for example, are now so TV literate that to study them without reference to their TV appearances would be strange.

Fiona Maxwell (Director of Operations at ITV Global Entertainment), then talked about the painstaking restoration of the 1948 film The Red Shoes. She provided lots of technical details about removing mould and correcting registration errors, but also showed “before and after” clips so we could see the huge improvements.

Web archiving

    1 comment 
Estimated reading time 6–10 minutes

I went to an excellent Anglo-French scientific discussion seminar on web archiving on Friday at the Institut Français Cultural Centre in London. The speakers were Gildas Illien of the Bibliothèque Nationale de France (BnF) (Paris) and Dr Stephen Bury of the British Library (BL).

Gildas Ilien described the web archiving project being undertaken by the BnF, using the Heritrix open source crawler to harvest the web from “seeds” (URLs). The crawler was charmingly illustrated with a picture of a “robot” (as people like to be able to see the “robot”), but the “robot” is a bit stupid – he sometimes misses things out and sometimes falls into traps and collects the same thing over and over again. The “robot” generates a lot of code for the librarians to assess and problems include the short lifespan of websites – one figure puts this as only 44 days (although whether that refers to sites disappearing altogether or just changing through updates wasn’t clear) and the “twilight zone” of what is public and what is private. In France the Legal Deposit Act was extended in 2006 to cover the web, so the BnF can collect any French website it wants to without having to ask permission. However, librarians have to choose whether to try to collect everything or just sites that are noteworthy in some way. It is also hard to guess who the future users of the archive will be and what sort of site they will want to access.

So far some 130 terabytes of data have been collected, and some 12 billion files stored.

Harvesting is done in three stages – bulk harvesting once a year; focused crawls of specific sites; and collections of e-deposits (such as e-books) directly from publishers. Some sites would be harvested occasionally – such as the website of the Festival du Cannes – which only needs to be collected once per year – and newspaper sites, which are collected more frequently.

The archive can be searched by URL or by text, although the text search is rudimentary at present.

Classification is another challenge, as traditional library classifications are not appropriate for much web content. For example, election campaign websites were classified by what the politicians were saying about themselves and by what the public were saying about them, as this was thought to be a useful distinction.

However, the problems of how to provide full and useful access to the collection and how to catalogue it properly remain unresolved.

The process was an interesting merging of traditional library skills and software engineering skills, with some stages clearly being either one or the other but a number of stages being “midway” requiring a cross-skilled approach.

Dr Stephen Bury explained that the BL is somewhat of a latecomer to web archiving, with the BnF, the Internet Archive, and the national libraries of Sweden and Australia all having more advanced web archiving programmes. Partly this is due to the state of UK legal deposit law, which has not yet been extended to include websites.

Just as there are many books about books and books about libraries, so there are many websites about the web. It is a very self-referential medium. However, there is a paradox in the BL’s current programme. Because the BL has to seek permission to collect each and every site, it may collect sites that it cannot then provide access to at all, and it cannot provide any access to sites except to readers in its reading rooms. To be able to collect the web but then not to be able to serve it back up to people through the web seems very strange.

Another issue of preservation is that the appearance of websites is browser-dependent, so a site may not look the same to people using different technology.

It is important that online information is preserved, as now websites are considered to be authentic sources of information – cited in PhDs for example – and so some way of verfiying that they existed and what content they contained is needed.

Reports have been produced by JISC and the Wellcome Trust (2002 Collecting and Preserving the World Wide Web) and (2002 Legal issues relating to the
archiving of Internet resources in the UK, EU, USA and Australia
by Andrew Charlesworth).

The BL undertook a Domain UK project to establish what the scope of a web archiving project might be. The BL used Australian PANDAS software. The UK web Archiving Consortium (UKWAC) was set up in 2003 but the need to obtain permissions has seriously limited its scope, as most website owners simply do not respond to permissions requests (very few actively refuse permission), presumably most ignore the request as spam or simply fail to reply.

The data has now been migrated from the PANDAS format to WARC and an access tool is in development. There are some 6 million UK websites, growing at a rate of 16% per year, and they are also growing in size (on average they are about 25Mb, increasing at a rate of 5% per year).

Decisions have to be made on frequency of collection, depth of collection, and quality. There are other peripheral legal issues, such as sites that fall under terrorism-related legislation. At present the BL can collect these sites but not provide access to them.

Resource discovery remains a major challenge, including how to combine cataloguing and search engine technology. So far, a thematic approach to organisation has been taken. Scalability is also a big issue. What works for a few thousand sites will not necessarily work for a few million.

This means that the nature of the “collecting institution” is changing. It is much harder to decide if a site is in or out of scope. A site may have parts that are clearly in scope and parts that clearly aren’t or it may change through time, sometimes being in scope and sometimes not.

The Digital Lives Project in association with UCL and the University of Bristol is looking at how the web is becoming an everyday part of our social and personal lifestyles.

The talks were followed by a question and answer session. I asked for more detail about the “twilight zone” of public and private websites. Both speakers agreed that there is a great need for more education on digital awareness, so that young people appreciate that putting things up on the Internet really is a form of publishing and their blogs and comments in public forums are not just private “chats” with friends. However, in France there has been little resistance to such personal material being collected. Most people are proud to have their sites considered to be part of the national heritage. A lot of outreach work has been done by the BnF to explain the aims of the archive and discuss any concerns. Gildas Ilien also pointed out that people do not necessailry have “the right to be forgotten” and that this is in fact not new. It has happened in the past that people have asked for books and other information to be removed from libraries, perhaps because they have changed their political viewpoint, and that a library would not simply remove a book from its shelves because the author decided that they had changed their mind about something in it.

There is a recent interview with Gildas Ilien (in French) on You Tube called L’archivage d’Internet, un défi pour les bibliothécaires.