A survey of Web preservation initiatives : A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Trondheim, Norway, 17-22 August 2003
Presentation overview : Presentation overview The importance of the Web
Challenges:
Technical, legal, and organisational challenges
Approaches to collection:
Harvesting based, selective, and deposit; combined approaches
Discussion:
Collection and access policies, software, costs, long-term preservation
Importance of the Web : Importance of the Web An all pervasive communication medium:
In research:
Scientists are 'increasingly reliant' on the Web for supporting research (Hendler, 2003)
Wider societal role:
personal communication, e-commerce, etc.
"… the information source of first resort for millions of readers" (Lyman, 2002)
The UKOLN study : The UKOLN study Feasibility study produced for:
Joint Information Systems Committee (JISC)
Wellcome Library
A survey of initiatives
Recommendations for the JISC and Wellcome Library
Supplementary legal study (Charlesworth)
Published February 2003
http://library.wellcome.ac.uk/projects/archiving_reports.shtml
Technical challenges (1) : Technical challenges (1) Size of Web:
Surface web > 50 Tb (2000) … and still growing
The 'deep Web'
Scale of task means that Web-archiving needs to be a collaborative activity
Technical challenges (2) : Technical challenges (2) Dynamic nature of Web:
Web pages disappear on average after 75 days
Many leave no trace
Evolution of Web-based technologies:
Increasing reliance on databases, scripts, plug-ins, etc.
A 'moving target'
Legal challenges : Legal challenges Copyright
Content liability, e.g.:
Defamation
Data protection
In the UK:
Selective approach would be the safest solution (unless law changes)
See: Charlesworth (2003)
http://library.wellcome.ac.uk/projects/archiving_reports.shtml
Organisational challenges : Organisational challenges Decentralised organisation:
Web-archiving initiatives focus on defined sub-sets of the Web, e.g.:
National domain, subject, organisation type
Need for co-operation between initiatives
Quality:
Much on Web is low-quality (or worse)
Is there a need to preserve all of this?
Initiatives (1) : Initiatives (1) The Internet Archive
Largest initiative, running since 1996
Co-operates on special collections and with other repositories
National Libraries:
Pioneer archives in Sweden (Kulturarw3) and Australia (PANDORA)
Now many, many more
Changes to legal deposit legislation in some countries
Initiatives (2) : Initiatives (2) National archives:
Focus on government Web-sites (however defined)
Guidance for Web-site managers:
e.g., UK and Australia
Snapshots:
e.g., USA and UK
Other:
Universities and scholarly societies:
e.g., Archipol, Occasio archive, Political Communications Web Archiving (Cornell)
Approaches (1) : Approaches (1) Automatic harvesting:
Use of Web crawler technologies
Crawler follows links and downloads content
Pioneered by Internet Archive and Kulturarw3 project
Also used for the gathering of the Finnish and Austrian Web
Approaches (2) : Approaches (2) Selective approaches:
Selection of individual Web sites
Negotiate rights with site owners
Collection using gathering or mirroring software, ftp, or e-mail
Pioneered in PANDORA project
Experimented with by Library of Congress and British Library
Deposit approaches:
Site owners/administrators deposit site in repositories
Approaches (3) : Approaches (3) Combined approaches:
Combines the advantages of the harvesting and selective approaches
Pioneered by the Bibliothèque nationale de France
Experimented with enhancements to the harvesting approach
e.g., noting the change frequency of sites, and their 'importance')
Uses the selective approach for the 'deep Web'
Collection policies : Collection policies Dependent on technical approach chosen
National domain ++ (for harvesting-based approaches)
Collection guidelines (for selective approaches)
Based on relevance, provenance, quality, etc.
Frequency of capture
Possible overlap with subject gateway initiatives - e.g. the Resource Discovery Network (RDN) in the UK
Approximate size (2002) : Approximate size (2002) Source: Day (2003)
Access policies : Access policies Access policies differ:
Internet Archive and the PANDORA archive make data available
e.g., the Wayback Machine
Other collections effectively closed (for legal reasons or because experimental)
Need for specialised Web indexes that can search and navigate large collections of Web material
e.g., Nordic Web Archive (NWA) Toolset
Software : Software Various software in use:
Harvesting:
Adapted Combine harvester, NEDLIB harvester, Xyleme, Alexa
Selective:
HTTrack (popular), etc.
PANDAS (PANDORA Digital Archiving System) - helps with managing the process, adding metadata, etc.
Costs : Costs Costs vary widely:
Selective approach much more expensive (per Tb.) than bulk harvesting
But resulting archives are more widely accessible
Significant costs in undertaking rights clearance
Long-term preservation : Long-term preservation Many initiatives until now mainly focused on the collection of resources:
Need to consider the longer-term
Descriptive and technical metadata
Migration needs (e.g. for complex sites)
Need for Web archiving initiatives to become trusted repositories
Need to be embedded into the 'core activities' of their host organisation
Summing up : Summing up Much experimentation to date, but now moving into implementation phase
Co-operation and collaboration is important
Combined technical approaches offer best way forward
Legal challenges still problematic
Long-term preservation issues still to be explored in detail
Acknowledgements : Acknowledgements UKOLN is funded by Resource: the Council for Museums, Archives and Libraries, the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils, as well as by project funding from the JISC and the European Union. UKOLN also receives support from the University of Bath, where it is based.
http://www.ukoln.ac.uk/