logging in or signing up DLF NY 2003 Mahugani Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 49 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: September 27, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Global Digital Format Registry: Global Digital Format Registry Stephen L. Abrams Harvard University Library MacKenzie Smith Massachusetts Institute of Technology DLF Spring Forum New York, May 14-16, 2003Why Do We Need a Registry?: Why Do We Need a Registry? Repository functions are performed on a format-specific basis Interpretation of otherwise opaque content streams is dependent upon knowledge of how typed content is represented Interchange requires mutual agreement of format syntax and semanticsPotential Use Cases: Potential Use Cases Identification “I have a digital object; what format is it?” Validation “I have an object purportedly of format F; is it?” Transformation “I have an object of format F, but need G; how can I produce it?” Characterization “I have an object of format F; what are its significant properties?” Risk assessment “I have an object of format F; is at risk of obsolescence?” Delivery “I have an object of format F; how can I render it?”Repository Format Dependencies: Repository Format Dependencies Ingest Validation SIP-to-AIP Access AIP-to-DIP Rendering Preservation planning Migration Emulation UVCRepository Format Dependencies: Repository Format DependenciesRepository Format Dependencies: Repository Format DependenciesWhat’s Wrong with MIME Types?: What’s Wrong with MIME Types? Insufficient depth of detail Syntax and semantics Public and proprietary Insufficient granularity Both tiled RGB TIFF with LZW and striped bi-tonal TIFF with Group 4 → image/tiff All of PDF 1.0 – 1.4, PDF/X-1 – 3, and PDF/A → application/pdfA Bit of History: A Bit of History DLF-sponsored invitational meetings Ad-hoc committee Collected use cases Working groups on data and governance models During summer 2002 the Harvard LDI and MIT DSpace teams met to discuss shared concerns.Ad-Hoc Committee: Ad-Hoc Committee Bibliothèque nationale de France California Digital Library Digital Library Federation Harvard University IETF JISC JSTOR Library of Congress MIT NARA National Archives of Canada New York University NIST OCLC Public Records Office, UK RLG Stanford University University of Pennsylvania Global Digital Format Registry: Global Digital Format Registry The registry will maintain persistent, unambiguous bindings between public identifiers for digital formats and representation information for those formats.What is a Format?: What is a Format? No assumption regarding byte size An information model is a formal expression of exchangeable knowledge A format is a fixed, byte-serialized encoding of an information model.What is Representation Information?: What is Representation Information? Significant properties are those aspects of a format that are the primary carriers of the format’s intellectual value Representation information maps typed formats into more meaningful concepts by capturing the significant syntactic and semantic properties of those formats.Data Model: Data Model Registry Format Descriptive General descriptive properties Characterization Technical syntactic/semantic properties Processing Services and systems using format as input or output Administrative ProvenanceInformative, not Evaluative: Informative, not Evaluative Legal liability May discourage deposit of proprietary information Investigate ways to include (by reference?) third party evaluations/recommendations Insofar as this doesn’t hamper primary goal The format properties stored in the registry should be factual, not judgmental.Data Model Sources: Data Model Sources ISO 14721, Open archival information system -- Reference model CCSDS OAIS reference model Representation information Interpret, or provide “additional meaning” to Data Object Structure and semantic information PRONOM Public Records Office, UK “information about file formats and the application software needed to open them” Format, vendor, productData Model Sources: Data Model Sources Diffuse EC’s Information Society Technologies programme “reference and guidance information on available and emerging standards and specifications” Business Guides “application of standards and specifications in specific areas” OCLC/RLG Preservation Metadata Framework “information necessary to render/display, understand, and interpret the Content Data Object” Based on CEDARS, NEDLIB NLA, OAIS, and OCLC metadataData Model Sources: Data Model Sources NIST National Software Reference Library File profiles for the NSRL Reference Data Set Vendor, product, operating system Used for forensic identification Media features Protocol-independent content negotiation Selection of an “appropriate representation” of a resource RFCs 2506, 2533, 2534Data Model Sources: Data Model Sources Typed Object Model (TOM) “model for identifying and describing data formats … distributed system of ‘type brokers’ that maintain and interpret these descriptions” Format is aggregate of type (attributes, operations, semantics) and encoding JISC File Format Representation and Rendering Project Assessment of formats and rendering software Representation system to track formats and their rendering softwareData Model Sources: Data Model Sources Bitstream Syntax Description Language MPEG-21content adaptation XML schema to model multimedia bitstreams Useful for administrative properties and data types: ISO/IEC 11179, Specification and standardization of data elements OASIS/ebXML Registry Information ModelData Model: Data ModelHigh-Level Format Properties: High-Level Format PropertiesDescriptive Properties: Descriptive Properties Identifiers Canonical and alias Arbitrary relationships Equivalence Encapsulation Sub-typing, with strict substitutability PDF 1.0 ← … ← PDF 1.4 ← PDF/A XML ← SVG Versioning Ontological classificationFormat Ontology: Format Ontology Content stream Logical Numeric Scalar Integer Unsigned Real Floating point Complex Text Structured text Mark-up language Programming language Message Mail News Image Still Font Outline Raster Graphic Vector Raster Page description Motion Audio Music Application CAD Communication Database Executable GIS Presentation Spreadsheet Word processing Transformation Compression Lossless Lossy Container File system Transfer 7-bit safe Physical media Magnetic Disk Tape Reel Cartridge Optical Disk CD-ROM DVD Film Paper Card Tape Characterization Properties: Characterization Properties Specification documents Actionable links Public identifiers Hard copy Public, on-site, license, and escrow access Signatures External File extension, Mac OS data fork type Internal Magic numberCentralized vs. Distributed: Centralized vs. Distributed Allowing arbitrary granularity may lead to an explosion of registered formats Versions Local profiles Typed relationships support internal and external references Enable distributed architecture without mandating itCore Registry Services: Core Registry Services Management Services Approval Level of review, level of public disclosure Maintenance Add, update, delete format entries Notification Notify registry clients of new/updated format or trigger events (e.g. obsolescence, new transformation service, etc.) Introspection Determine local policies (scope, coverage, implemented services, etc.) of a given registry to identify appropriate registry to useCore Registry Services: Core Registry Services Access Services Description Representation information returned on request for single format Export Entire registry or selected subset sent to external repositorySupported Services: Supported Services Representation Services Identification services Determine format of a specific digital object by comparing its attributes to the attribute profiles retrieved from the registry Validation services Verify format of a specific DO by comparing its attributes to the attribute profile retrieved from the registry for that format.Supported Services: Supported Services Brokerage Services Rendering service Identify current rendering conditions for supplied DO Transformation service Convert DO from current (source) format to target format Metadata Extraction services Registry returns information supporting automated extraction of attribute metadata from a DO of a specific formatService Model Sources: Service Model Sources ANSI X3.285, Metamodel for Management of Shareable Data Service model for ISO/IEC 11179 IANA MIME media type registry OASIS/ebXML Registry Services SpecificationRegistry Operation: Registry Operation Trust is necessary to encourage deposit of proprietary information Sustainability is necessary to justify expense As for all preservation activities, how do we generate income today, for services not needed until tomorrow? The registry is valuable insofar as it is trustworthy and sustainable.Registry Operation: Registry Operation Will registry staff collect and manage representation information, or Will knowledgeable community members submit information? What is the level of technical review, and by whom? IETF model Is the registry self-populating, or a public bulletin board?Governance Model: Governance Model Can this initiative reasonably be placed under the umbrella of an existing organization? Is global scope in conflict with national prerogatives? How to build sufficient trust models Governance model becomes more important as the operational model becomes more pro-active (distributed and contributory)Business Model: Business Model Costs depend on level of quality and authority required (e.g. wiki vs oclc) Assuming the registry needs to be cost-recovered, options for supporting “common good” services include: Subsidy Subscription Pay to submit Format registration accompanied by fee Pay to view Queries on a for-fee basis Added-value servicesNext Steps: Next Steps Tell people what we’re doing National, academic, private libraries/archives Standards bodies Commercial Regulated industries Software vendors (developers and consumers of formats) Publishers Anyone with long-term digital preservation needs Refine documentation for a general audience Vision statement and high-level project planNext Steps: Next Steps Look for project funding Potentially two phases: Design and implementation Can be funded through grants, in-kind participation Operational Need reliable, sustainable income stream Planning grant to sustain initial activity Data and service models Governance and business model Development and operations plan Library of Congress NDIIPP and/or JISC (UK) Digital Curation CentreWhy Is This Important to You?: Why Is This Important to You? If you care about the long-term usability of your digital assets: The registry will allow typing of digital objects at an appropriate level of granularity The registry will allow the recovery in the future of the syntax and semantics associated with typed digital objects The registry is an enabling technology underlying digital repository operations and preservation activitiesSlide38: … thanks! hul.harvard.edu/formatregistry stephen_abrams@harvard.edu kenzie@mit.edu You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
DLF NY 2003 Mahugani Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 49 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: September 27, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Global Digital Format Registry: Global Digital Format Registry Stephen L. Abrams Harvard University Library MacKenzie Smith Massachusetts Institute of Technology DLF Spring Forum New York, May 14-16, 2003Why Do We Need a Registry?: Why Do We Need a Registry? Repository functions are performed on a format-specific basis Interpretation of otherwise opaque content streams is dependent upon knowledge of how typed content is represented Interchange requires mutual agreement of format syntax and semanticsPotential Use Cases: Potential Use Cases Identification “I have a digital object; what format is it?” Validation “I have an object purportedly of format F; is it?” Transformation “I have an object of format F, but need G; how can I produce it?” Characterization “I have an object of format F; what are its significant properties?” Risk assessment “I have an object of format F; is at risk of obsolescence?” Delivery “I have an object of format F; how can I render it?”Repository Format Dependencies: Repository Format Dependencies Ingest Validation SIP-to-AIP Access AIP-to-DIP Rendering Preservation planning Migration Emulation UVCRepository Format Dependencies: Repository Format DependenciesRepository Format Dependencies: Repository Format DependenciesWhat’s Wrong with MIME Types?: What’s Wrong with MIME Types? Insufficient depth of detail Syntax and semantics Public and proprietary Insufficient granularity Both tiled RGB TIFF with LZW and striped bi-tonal TIFF with Group 4 → image/tiff All of PDF 1.0 – 1.4, PDF/X-1 – 3, and PDF/A → application/pdfA Bit of History: A Bit of History DLF-sponsored invitational meetings Ad-hoc committee Collected use cases Working groups on data and governance models During summer 2002 the Harvard LDI and MIT DSpace teams met to discuss shared concerns.Ad-Hoc Committee: Ad-Hoc Committee Bibliothèque nationale de France California Digital Library Digital Library Federation Harvard University IETF JISC JSTOR Library of Congress MIT NARA National Archives of Canada New York University NIST OCLC Public Records Office, UK RLG Stanford University University of Pennsylvania Global Digital Format Registry: Global Digital Format Registry The registry will maintain persistent, unambiguous bindings between public identifiers for digital formats and representation information for those formats.What is a Format?: What is a Format? No assumption regarding byte size An information model is a formal expression of exchangeable knowledge A format is a fixed, byte-serialized encoding of an information model.What is Representation Information?: What is Representation Information? Significant properties are those aspects of a format that are the primary carriers of the format’s intellectual value Representation information maps typed formats into more meaningful concepts by capturing the significant syntactic and semantic properties of those formats.Data Model: Data Model Registry Format Descriptive General descriptive properties Characterization Technical syntactic/semantic properties Processing Services and systems using format as input or output Administrative ProvenanceInformative, not Evaluative: Informative, not Evaluative Legal liability May discourage deposit of proprietary information Investigate ways to include (by reference?) third party evaluations/recommendations Insofar as this doesn’t hamper primary goal The format properties stored in the registry should be factual, not judgmental.Data Model Sources: Data Model Sources ISO 14721, Open archival information system -- Reference model CCSDS OAIS reference model Representation information Interpret, or provide “additional meaning” to Data Object Structure and semantic information PRONOM Public Records Office, UK “information about file formats and the application software needed to open them” Format, vendor, productData Model Sources: Data Model Sources Diffuse EC’s Information Society Technologies programme “reference and guidance information on available and emerging standards and specifications” Business Guides “application of standards and specifications in specific areas” OCLC/RLG Preservation Metadata Framework “information necessary to render/display, understand, and interpret the Content Data Object” Based on CEDARS, NEDLIB NLA, OAIS, and OCLC metadataData Model Sources: Data Model Sources NIST National Software Reference Library File profiles for the NSRL Reference Data Set Vendor, product, operating system Used for forensic identification Media features Protocol-independent content negotiation Selection of an “appropriate representation” of a resource RFCs 2506, 2533, 2534Data Model Sources: Data Model Sources Typed Object Model (TOM) “model for identifying and describing data formats … distributed system of ‘type brokers’ that maintain and interpret these descriptions” Format is aggregate of type (attributes, operations, semantics) and encoding JISC File Format Representation and Rendering Project Assessment of formats and rendering software Representation system to track formats and their rendering softwareData Model Sources: Data Model Sources Bitstream Syntax Description Language MPEG-21content adaptation XML schema to model multimedia bitstreams Useful for administrative properties and data types: ISO/IEC 11179, Specification and standardization of data elements OASIS/ebXML Registry Information ModelData Model: Data ModelHigh-Level Format Properties: High-Level Format PropertiesDescriptive Properties: Descriptive Properties Identifiers Canonical and alias Arbitrary relationships Equivalence Encapsulation Sub-typing, with strict substitutability PDF 1.0 ← … ← PDF 1.4 ← PDF/A XML ← SVG Versioning Ontological classificationFormat Ontology: Format Ontology Content stream Logical Numeric Scalar Integer Unsigned Real Floating point Complex Text Structured text Mark-up language Programming language Message Mail News Image Still Font Outline Raster Graphic Vector Raster Page description Motion Audio Music Application CAD Communication Database Executable GIS Presentation Spreadsheet Word processing Transformation Compression Lossless Lossy Container File system Transfer 7-bit safe Physical media Magnetic Disk Tape Reel Cartridge Optical Disk CD-ROM DVD Film Paper Card Tape Characterization Properties: Characterization Properties Specification documents Actionable links Public identifiers Hard copy Public, on-site, license, and escrow access Signatures External File extension, Mac OS data fork type Internal Magic numberCentralized vs. Distributed: Centralized vs. Distributed Allowing arbitrary granularity may lead to an explosion of registered formats Versions Local profiles Typed relationships support internal and external references Enable distributed architecture without mandating itCore Registry Services: Core Registry Services Management Services Approval Level of review, level of public disclosure Maintenance Add, update, delete format entries Notification Notify registry clients of new/updated format or trigger events (e.g. obsolescence, new transformation service, etc.) Introspection Determine local policies (scope, coverage, implemented services, etc.) of a given registry to identify appropriate registry to useCore Registry Services: Core Registry Services Access Services Description Representation information returned on request for single format Export Entire registry or selected subset sent to external repositorySupported Services: Supported Services Representation Services Identification services Determine format of a specific digital object by comparing its attributes to the attribute profiles retrieved from the registry Validation services Verify format of a specific DO by comparing its attributes to the attribute profile retrieved from the registry for that format.Supported Services: Supported Services Brokerage Services Rendering service Identify current rendering conditions for supplied DO Transformation service Convert DO from current (source) format to target format Metadata Extraction services Registry returns information supporting automated extraction of attribute metadata from a DO of a specific formatService Model Sources: Service Model Sources ANSI X3.285, Metamodel for Management of Shareable Data Service model for ISO/IEC 11179 IANA MIME media type registry OASIS/ebXML Registry Services SpecificationRegistry Operation: Registry Operation Trust is necessary to encourage deposit of proprietary information Sustainability is necessary to justify expense As for all preservation activities, how do we generate income today, for services not needed until tomorrow? The registry is valuable insofar as it is trustworthy and sustainable.Registry Operation: Registry Operation Will registry staff collect and manage representation information, or Will knowledgeable community members submit information? What is the level of technical review, and by whom? IETF model Is the registry self-populating, or a public bulletin board?Governance Model: Governance Model Can this initiative reasonably be placed under the umbrella of an existing organization? Is global scope in conflict with national prerogatives? How to build sufficient trust models Governance model becomes more important as the operational model becomes more pro-active (distributed and contributory)Business Model: Business Model Costs depend on level of quality and authority required (e.g. wiki vs oclc) Assuming the registry needs to be cost-recovered, options for supporting “common good” services include: Subsidy Subscription Pay to submit Format registration accompanied by fee Pay to view Queries on a for-fee basis Added-value servicesNext Steps: Next Steps Tell people what we’re doing National, academic, private libraries/archives Standards bodies Commercial Regulated industries Software vendors (developers and consumers of formats) Publishers Anyone with long-term digital preservation needs Refine documentation for a general audience Vision statement and high-level project planNext Steps: Next Steps Look for project funding Potentially two phases: Design and implementation Can be funded through grants, in-kind participation Operational Need reliable, sustainable income stream Planning grant to sustain initial activity Data and service models Governance and business model Development and operations plan Library of Congress NDIIPP and/or JISC (UK) Digital Curation CentreWhy Is This Important to You?: Why Is This Important to You? If you care about the long-term usability of your digital assets: The registry will allow typing of digital objects at an appropriate level of granularity The registry will allow the recovery in the future of the syntax and semantics associated with typed digital objects The registry is an enabling technology underlying digital repository operations and preservation activitiesSlide38: … thanks! hul.harvard.edu/formatregistry stephen_abrams@harvard.edu kenzie@mit.edu