I went to an excellent Anglo-French scientific discussion seminar on web archiving on Friday at the Institut Français Cultural Centre in London. The speakers were Gildas Illien of the Bibliothèque Nationale de France (BnF) (Paris) and Dr Stephen Bury of the British Library (BL).
Gildas Ilien described the web archiving project being undertaken by the BnF, using the Heritrix open source crawler to harvest the web from “seeds” (URLs). The crawler was charmingly illustrated with a picture of a “robot” (as people like to be able to see the “robot”), but the “robot” is a bit stupid – he sometimes misses things out and sometimes falls into traps and collects the same thing over and over again. The “robot” generates a lot of code for the librarians to assess and problems include the short lifespan of websites – one figure puts this as only 44 days (although whether that refers to sites disappearing altogether or just changing through updates wasn’t clear) and the “twilight zone” of what is public and what is private. In France the Legal Deposit Act was extended in 2006 to cover the web, so the BnF can collect any French website it wants to without having to ask permission. However, librarians have to choose whether to try to collect everything or just sites that are noteworthy in some way. It is also hard to guess who the future users of the archive will be and what sort of site they will want to access.
So far some 130 terabytes of data have been collected, and some 12 billion files stored.
Harvesting is done in three stages – bulk harvesting once a year; focused crawls of specific sites; and collections of e-deposits (such as e-books) directly from publishers. Some sites would be harvested occasionally – such as the website of the Festival du Cannes – which only needs to be collected once per year – and newspaper sites, which are collected more frequently.
The archive can be searched by URL or by text, although the text search is rudimentary at present.
Classification is another challenge, as traditional library classifications are not appropriate for much web content. For example, election campaign websites were classified by what the politicians were saying about themselves and by what the public were saying about them, as this was thought to be a useful distinction.
However, the problems of how to provide full and useful access to the collection and how to catalogue it properly remain unresolved.
The process was an interesting merging of traditional library skills and software engineering skills, with some stages clearly being either one or the other but a number of stages being “midway” requiring a cross-skilled approach.
Dr Stephen Bury explained that the BL is somewhat of a latecomer to web archiving, with the BnF, the Internet Archive, and the national libraries of Sweden and Australia all having more advanced web archiving programmes. Partly this is due to the state of UK legal deposit law, which has not yet been extended to include websites.
Just as there are many books about books and books about libraries, so there are many websites about the web. It is a very self-referential medium. However, there is a paradox in the BL’s current programme. Because the BL has to seek permission to collect each and every site, it may collect sites that it cannot then provide access to at all, and it cannot provide any access to sites except to readers in its reading rooms. To be able to collect the web but then not to be able to serve it back up to people through the web seems very strange.
Another issue of preservation is that the appearance of websites is browser-dependent, so a site may not look the same to people using different technology.
It is important that online information is preserved, as now websites are considered to be authentic sources of information – cited in PhDs for example – and so some way of verfiying that they existed and what content they contained is needed.
Reports have been produced by JISC and the Wellcome Trust (2002 Collecting and Preserving the World Wide Web) and (2002 Legal issues relating to the
archiving of Internet resources in the UK, EU, USA and Australia by Andrew Charlesworth).
The BL undertook a Domain UK project to establish what the scope of a web archiving project might be. The BL used Australian PANDAS software. The UK web Archiving Consortium (UKWAC) was set up in 2003 but the need to obtain permissions has seriously limited its scope, as most website owners simply do not respond to permissions requests (very few actively refuse permission), presumably most ignore the request as spam or simply fail to reply.
The data has now been migrated from the PANDAS format to WARC and an access tool is in development. There are some 6 million UK websites, growing at a rate of 16% per year, and they are also growing in size (on average they are about 25Mb, increasing at a rate of 5% per year).
Decisions have to be made on frequency of collection, depth of collection, and quality. There are other peripheral legal issues, such as sites that fall under terrorism-related legislation. At present the BL can collect these sites but not provide access to them.
Resource discovery remains a major challenge, including how to combine cataloguing and search engine technology. So far, a thematic approach to organisation has been taken. Scalability is also a big issue. What works for a few thousand sites will not necessarily work for a few million.
This means that the nature of the “collecting institution” is changing. It is much harder to decide if a site is in or out of scope. A site may have parts that are clearly in scope and parts that clearly aren’t or it may change through time, sometimes being in scope and sometimes not.
The talks were followed by a question and answer session. I asked for more detail about the “twilight zone” of public and private websites. Both speakers agreed that there is a great need for more education on digital awareness, so that young people appreciate that putting things up on the Internet really is a form of publishing and their blogs and comments in public forums are not just private “chats” with friends. However, in France there has been little resistance to such personal material being collected. Most people are proud to have their sites considered to be part of the national heritage. A lot of outreach work has been done by the BnF to explain the aims of the archive and discuss any concerns. Gildas Ilien also pointed out that people do not necessailry have “the right to be forgotten” and that this is in fact not new. It has happened in the past that people have asked for books and other information to be removed from libraries, perhaps because they have changed their political viewpoint, and that a library would not simply remove a book from its shelves because the author decided that they had changed their mind about something in it.
There is a recent interview with Gildas Ilien (in French) on You Tube called L’archivage d’Internet, un défi pour les bibliothécaires.