logging in or signing up data warehouse AirietDamaris Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 1688 Category: Science & Tech.. License: All Rights Reserved Like it (2) Dislike it (2) Added: May 14, 2010 This Presentation is Public Favorites: 3 Presentation Description No description available. Comments Posting comment... By: kittu505 (14 month(s) ago) i nee this one Saving..... Post Reply Close Saving..... Edit Comment Close By: tushki.kuku (14 month(s) ago) nice work Saving..... Post Reply Close Saving..... Edit Comment Close By: sareddyrajgopal (15 month(s) ago) Hi i need this presentaion give download permission Saving..... Post Reply Close Saving..... Edit Comment Close By: zeqiraj (15 month(s) ago) shume e mir veq suksese Saving..... Post Reply Close Saving..... Edit Comment Close By: rmonu007 (19 month(s) ago) how can i download this presentation Saving..... Post Reply Close Saving..... Edit Comment Close loading.... See all Premium member Presentation Transcript Data Warehouse : Data Warehouse DATA TRANSFORMATION Extract Transform Insert : Extract Transform Insert Extract data from operational system, transform and insert into data warehouse Why ETI? Will your warehouse produce correct information with the current data? How ho w can I ensure warehouse credibility? Excuses for NOT Transforming Legacy Data : Excuses for NOT Transforming Legacy Data Old data works fine, new will work as well. Data will be fixed at point of entry through GUI. If needed, data will be cleaned after new system populated; After proof-of-concept pilot. Keys join the data most of the time. Users will not agree to modifying or standardizing their data. Levels of Migration Problem : Levels of Migration Problem Existing metadata is insufficient and unreliable Metadata must hold for all occurrences Metadata must represent business and technical attributes Data values incorrectly typed and accessible Values form extracted from storage Values meaning inferred from its content Entity keys unreliable or unavailable Inferred from related values Metadata Challenge : Metadata Challenge Metadata gets out of synch with details it summarizes Business grows faster than systems designed to capture business info Not at the right level of detail Multiple values in a single field Multiple meanings to a single field No fixed format for value Expressed in awkward of limited terms Program/compiler view rather than business view Character-level Challenge : Character-level Challenge Value instance level Spelling, aliases Abbreviations, truncations, transpositions Inconsistent storage formats Named type level Multiple meanings, contextual meanings Synonyms, homonyms Entity level No common keys or representation No integrated view across records, files, systems Some Data Quality examples : Some Data Quality examples The magic shrinking vendor file 127 ways to spell... Data surprises in individual fields Cowbirds and Data Fields Magic numbers and embedded intelligence The Magic Shrinking Vendor File : The Magic Shrinking Vendor File A Medical claims processor was having trouble with their Insurance Vendor file. They thought they had 300,000 Insurance Vendors. When they cleaned up their data, they discovered they had only 27,000 unique Insurance Vendors. 127 ways to spell... : 127 ways to spell... Have over 127 different ways to spell AT&T Have over 1000 ways to spell duPont Slide 10: NAME SOC. SEC. # TELEPHONE Data surprises in individual fields Source: Vality Slide 11: NAME SOC. SEC. # TELEPHONE Meta Data surprises in individual fields Source: Vality Slide 12: NAME SOC. SEC. # TELEPHONE Denise Mario DBA Marc Di Lorenzo ETAL Tom & Mary Roberts First Natl Provident Digital 15 State St. Astorial Fedrl Savings Kevin Cooke, Receiver John Doe Trustee for K Meta Actual Data Values Data surprises in individual fields Source: Vality Slide 13: NAME SOC. SEC. # TELEPHONE Denise Mario DBA Marc Di Lorenzo ETAL Tom & Mary Roberts First Natl Provident Digital 15 State St. Astorial Fedrl Savings Kevin Cooke, Receiver John Doe Trustee for K 228-02-1975 999999999 025-37-1888 34-2671434 101010101 LN#12-756 18-7534216 111111111 Meta Actual Data Values Data surprises in individual fields Source: Vality Slide 14: NAME SOC. SEC. # TELEPHONE Denise Mario DBA Marc Di Lorenzo ETAL Tom & Mary Roberts First Natl Provident Digital 15 State St. Astorial Fedrl Savings Kevin Cooke, Receiver John Doe Trustee for K 228-02-1975 999999999 025-37-1888 34-2671434 101010101 LN#12-756 18-7534216 111111111 6173380300 3380321 415-392-2000 508-466-1200 212-235-1000 FAX 528-9825 5436 Meta Actual Data Values Data surprises in individual fields Source: Vality Cowbirds and Data Fields : Cowbirds and Data Fields Cowbirds lay their eggs in other birds nets Users use data fields that are not used for other purposes Magic Numbers and Embedded Intelligence : Magic Numbers and Embedded Intelligence Customer Number = XXXX-YY-ZZ XXXX = 1st 4 Positions of Zip Code If YY = 00-70 Then Cust = Pharmacy If YY = 80-89 Then Cust = Hospital Except if YY = 82 and ZZ = ** Which Means... Orr's Laws of Data Quality : Orr's Laws of Data Quality Law #1 - “Data that is not used cannot be correct!” Law #2 - “Data quality is a function of its use, not its collection!” Law #3 - “Data will be no better than its most stringent use!” Law #4 - “Data quality problems increase with the age of the system!” Law #5 - “Data quality laws apply equally to meta-data!” Law #6 - The less likely something is to occur, the more traumatic it will be when it happens! Legacy Data Contaminants Found in Migrations : Legacy Data Contaminants Found in Migrations Lack of standards Data surprises in individual fields Legacy information buried in free form fields Legacy myopia – multiple account numbers block consolidated view Anomaly nightmare – complex matching and consolidation 4 Fundamental Types of Transformation : 4 Fundamental Types of Transformation Simple Transformation Fundamental building blocks of all data transformations One field at a time Cleansing and Scrubbing Ensure consistent formatting and usage of field or related group of fileds Checks valid values 4 Fundamental Types of Transformation (con’t) : 4 Fundamental Types of Transformation (con’t) Integration Takes operational data from one or more sources and maps it, field by field to new data structure Aggregation and Summarization Remove low level of detail Data for data mart Simple Transformation : Simple Transformation Convert data element from one type to another semantic value same rename elements Date time conversion standard warehouse format Decode encoded fields M F vs C S MM Cleansing and Scrubbing : Cleansing and Scrubbing Actual content examined Range checking, enumerated lists, dependency checking Uniform representation for dw address information parse to components Integration : Integration Simple field level mappings -80-90% Complex integration No common identifier probable matches 2-stage process, isolation/reconciliation Multiple sources for same target element contradictory Missing data Derived/calculated data redundant? Aggregation and Summarization : Aggregation and Summarization Summarization is the addition of like values along one or more business dimensions add daily sales by stores for monthly sales by region Aggregation is the addition of different business elements into common total daily product sales plus monthly consulting sales give monthly combined sales amount Details of process available in metadata Data Re-engineering Problem : Data Re-engineering Problem Programming for the unknown Unanticipated values, structures and patterns Programming for noise and uncertainty Conflicting and missing values Programming for productivity and efficiency Changing data values, changing user requirements High volumes, non-linear searches Conventional data transformation methods do not solve the metadata and data value challenges – need data re-engineering Stephen Brown, Vality Corp. Data Re-engineering Process : Data Re-engineering Process Natural Laws of Data Re-engineering : Natural Laws of Data Re-engineering Data has no standard You can’t predict or legislate format or content Data will evolve faster than its capture and storage systems You can’t write rules for what you don’t know and can’t see Instructions for handling data are within the data Don’t trust the metadata, make the data reveal itself Revealed metadata is knowledge about the business Revealed metadata validates warehouse design Revealed metadata supports conversion project management Revealed metadata is insurance against misinformation Buy tool or manually code programs ? : Buy tool or manually code programs ? Slide 29: 3 - DW Tools 1st Generation ETL Job Schedulers RDBMS Utilities Replication/Distribution Tools Universal Repositories Q&R/MQE/MRE MOLAP/ROLAP/LowLAP Data Mining CASE DB Design Repositories DB & System Monitors Technologies Processes EIS Data Visualization Meta Data Browsers Design Mapping Extract Scrub Transform Load Index Aggregation Replication Data Set Distribution Access & Analysis Resource Scheduling & Distribution Meta Data System Monitoring 2nd Generation ETL Suites / Environments Transformation : Transformation Choosing between tool and manually coded programs Time frames - tools take longer select, configure, learn Budgets - short term or long term Size of warehouse - initial project small enough for coding Size and skills of warehouse team Tool automatically generates and maintains metadata Hand Generated Code : Hand Generated Code Upside No learning curve Inherent skills In house capabilities Usually simple No culture change/mandate (CASE) Downside Manual meta data Maintenance challenge when talent level changes No automation Tools : Tools Upside Easy to maintain as talent level changes Automatic meta data May gain efficiencies Integration with repositories Integration with other tools Schedulers Monitors Meta data management Tools : Tools Downside Cost (1st generation tools very high $) Learning curve Enforced culture change Must use tool for all changes Speed, may be slower to implement May require additional resources Slide 34: Copyright © 1997, Enterprise Group, Ltd. Manual Code / 1st Generation ETL Tools Process Extract Program Source OLTP Systems Transform Program File Transfer Program File Load Program Source Mainframe or C/S System Data Warehouse Client/Server System Index Program Aggregation Program External Job Scheduling and Control - External Meta Data Load/Maintenance Slide 35: Copyright © 1997, Enterprise Group, Ltd. 2nd Generation ETL Tools Process Source OLTP Systems Transformation Engine Monitoring Scheduling Extraction Scrubbing Transformation Load Index Aggregation Meta Data Load Meta Data Maint. Caching Source Mainframe or C/S System Transformation Engine C/S System Data Warehouse/Mart C/S System Data Warehouse or Data Mart Slide 36: Copyright © 1997, Enterprise Group, Ltd. 2nd Generation ETL Environment Process Source OLTP Systems Transformation Engine Monitoring Scheduling Extraction Scrubbing Transformation Load Index Aggregation Meta Data Load Meta Data Maint. Request Broker Caching Source Mainframe or C/S System Transformation Engine C/S System Data Warehouse C/S System Enterprise Meta Data Data Mart C/S System Data Mart Data Mart C/S System Data Mart Surf Meta Data Request Resource Schedule Delivery User Process Data Warehouse 1st Generation ETL Tools Hampered by: : 1st Generation ETL Tools Hampered by: High cost (average deal prices in the $250-400k range) Long learning curves Perceived value (most teams felt they could write better code) Cultural challenges (like a CASE tool, the team must use the code generator for all creation and changes, no matter how minor) Core capabilities (complex transformations still required manual code) Management requirements (users still had to manage all the programs generated) Performance issues (the resulting programs could not leverage parallelism) Important 2nd Generation ETL tool features: : Important 2nd Generation ETL tool features: Transformation engine design Ability to leverage parallel server technology CDC (Change Data Capture, which allows only the new data to be extracted) Incremental aggregation (ability to add CDC incremental data to existing aggregations) Limited or no use of temporary files or data base tables (virtual caching only) Common, open and extensible meta data repository Enterprise scalability Important 2nd Generation ETL tool features: : Important 2nd Generation ETL tool features: Common UI (User Interface) across all tools Extensive selection of transformation algorithms Easily extensible scrub and transform algorithm library Extensive heterogeneous source and target support Native OLAP data set target support System monitoring & management Enterprise meta data repository (content, resources, structure, etc.) Transform once, populate many (populate multiple targets with a single transformation output) Important 2nd Generation ETL tool features: : Important 2nd Generation ETL tool features: Integrated enterprise scale scrubbing capabilities Seamless interoperability with external point solution tools Integrated information access, analysis, scheduling and delivery Aggregate aware information request broker (enables virtual data warehouse) Ad hoc aggregation monitoring and management Pipeline parallelism / very high throughput Native drivers (source and target) OLTP <> OLAP : OLTP <> OLAP OLTP normalized OLAP tools must provide multidimensional conceptual view of data ??????Providing OLAP to User Analysts, E.F.Codd redundant data Multidimensional Model : Multidimensional Model Data stored as facts and dimensions Sales Fact Cube You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
data warehouse AirietDamaris Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 1688 Category: Science & Tech.. License: All Rights Reserved Like it (2) Dislike it (2) Added: May 14, 2010 This Presentation is Public Favorites: 3 Presentation Description No description available. Comments Posting comment... By: kittu505 (14 month(s) ago) i nee this one Saving..... Post Reply Close Saving..... Edit Comment Close By: tushki.kuku (14 month(s) ago) nice work Saving..... Post Reply Close Saving..... Edit Comment Close By: sareddyrajgopal (15 month(s) ago) Hi i need this presentaion give download permission Saving..... Post Reply Close Saving..... Edit Comment Close By: zeqiraj (15 month(s) ago) shume e mir veq suksese Saving..... Post Reply Close Saving..... Edit Comment Close By: rmonu007 (19 month(s) ago) how can i download this presentation Saving..... Post Reply Close Saving..... Edit Comment Close loading.... See all Premium member Presentation Transcript Data Warehouse : Data Warehouse DATA TRANSFORMATION Extract Transform Insert : Extract Transform Insert Extract data from operational system, transform and insert into data warehouse Why ETI? Will your warehouse produce correct information with the current data? How ho w can I ensure warehouse credibility? Excuses for NOT Transforming Legacy Data : Excuses for NOT Transforming Legacy Data Old data works fine, new will work as well. Data will be fixed at point of entry through GUI. If needed, data will be cleaned after new system populated; After proof-of-concept pilot. Keys join the data most of the time. Users will not agree to modifying or standardizing their data. Levels of Migration Problem : Levels of Migration Problem Existing metadata is insufficient and unreliable Metadata must hold for all occurrences Metadata must represent business and technical attributes Data values incorrectly typed and accessible Values form extracted from storage Values meaning inferred from its content Entity keys unreliable or unavailable Inferred from related values Metadata Challenge : Metadata Challenge Metadata gets out of synch with details it summarizes Business grows faster than systems designed to capture business info Not at the right level of detail Multiple values in a single field Multiple meanings to a single field No fixed format for value Expressed in awkward of limited terms Program/compiler view rather than business view Character-level Challenge : Character-level Challenge Value instance level Spelling, aliases Abbreviations, truncations, transpositions Inconsistent storage formats Named type level Multiple meanings, contextual meanings Synonyms, homonyms Entity level No common keys or representation No integrated view across records, files, systems Some Data Quality examples : Some Data Quality examples The magic shrinking vendor file 127 ways to spell... Data surprises in individual fields Cowbirds and Data Fields Magic numbers and embedded intelligence The Magic Shrinking Vendor File : The Magic Shrinking Vendor File A Medical claims processor was having trouble with their Insurance Vendor file. They thought they had 300,000 Insurance Vendors. When they cleaned up their data, they discovered they had only 27,000 unique Insurance Vendors. 127 ways to spell... : 127 ways to spell... Have over 127 different ways to spell AT&T Have over 1000 ways to spell duPont Slide 10: NAME SOC. SEC. # TELEPHONE Data surprises in individual fields Source: Vality Slide 11: NAME SOC. SEC. # TELEPHONE Meta Data surprises in individual fields Source: Vality Slide 12: NAME SOC. SEC. # TELEPHONE Denise Mario DBA Marc Di Lorenzo ETAL Tom & Mary Roberts First Natl Provident Digital 15 State St. Astorial Fedrl Savings Kevin Cooke, Receiver John Doe Trustee for K Meta Actual Data Values Data surprises in individual fields Source: Vality Slide 13: NAME SOC. SEC. # TELEPHONE Denise Mario DBA Marc Di Lorenzo ETAL Tom & Mary Roberts First Natl Provident Digital 15 State St. Astorial Fedrl Savings Kevin Cooke, Receiver John Doe Trustee for K 228-02-1975 999999999 025-37-1888 34-2671434 101010101 LN#12-756 18-7534216 111111111 Meta Actual Data Values Data surprises in individual fields Source: Vality Slide 14: NAME SOC. SEC. # TELEPHONE Denise Mario DBA Marc Di Lorenzo ETAL Tom & Mary Roberts First Natl Provident Digital 15 State St. Astorial Fedrl Savings Kevin Cooke, Receiver John Doe Trustee for K 228-02-1975 999999999 025-37-1888 34-2671434 101010101 LN#12-756 18-7534216 111111111 6173380300 3380321 415-392-2000 508-466-1200 212-235-1000 FAX 528-9825 5436 Meta Actual Data Values Data surprises in individual fields Source: Vality Cowbirds and Data Fields : Cowbirds and Data Fields Cowbirds lay their eggs in other birds nets Users use data fields that are not used for other purposes Magic Numbers and Embedded Intelligence : Magic Numbers and Embedded Intelligence Customer Number = XXXX-YY-ZZ XXXX = 1st 4 Positions of Zip Code If YY = 00-70 Then Cust = Pharmacy If YY = 80-89 Then Cust = Hospital Except if YY = 82 and ZZ = ** Which Means... Orr's Laws of Data Quality : Orr's Laws of Data Quality Law #1 - “Data that is not used cannot be correct!” Law #2 - “Data quality is a function of its use, not its collection!” Law #3 - “Data will be no better than its most stringent use!” Law #4 - “Data quality problems increase with the age of the system!” Law #5 - “Data quality laws apply equally to meta-data!” Law #6 - The less likely something is to occur, the more traumatic it will be when it happens! Legacy Data Contaminants Found in Migrations : Legacy Data Contaminants Found in Migrations Lack of standards Data surprises in individual fields Legacy information buried in free form fields Legacy myopia – multiple account numbers block consolidated view Anomaly nightmare – complex matching and consolidation 4 Fundamental Types of Transformation : 4 Fundamental Types of Transformation Simple Transformation Fundamental building blocks of all data transformations One field at a time Cleansing and Scrubbing Ensure consistent formatting and usage of field or related group of fileds Checks valid values 4 Fundamental Types of Transformation (con’t) : 4 Fundamental Types of Transformation (con’t) Integration Takes operational data from one or more sources and maps it, field by field to new data structure Aggregation and Summarization Remove low level of detail Data for data mart Simple Transformation : Simple Transformation Convert data element from one type to another semantic value same rename elements Date time conversion standard warehouse format Decode encoded fields M F vs C S MM Cleansing and Scrubbing : Cleansing and Scrubbing Actual content examined Range checking, enumerated lists, dependency checking Uniform representation for dw address information parse to components Integration : Integration Simple field level mappings -80-90% Complex integration No common identifier probable matches 2-stage process, isolation/reconciliation Multiple sources for same target element contradictory Missing data Derived/calculated data redundant? Aggregation and Summarization : Aggregation and Summarization Summarization is the addition of like values along one or more business dimensions add daily sales by stores for monthly sales by region Aggregation is the addition of different business elements into common total daily product sales plus monthly consulting sales give monthly combined sales amount Details of process available in metadata Data Re-engineering Problem : Data Re-engineering Problem Programming for the unknown Unanticipated values, structures and patterns Programming for noise and uncertainty Conflicting and missing values Programming for productivity and efficiency Changing data values, changing user requirements High volumes, non-linear searches Conventional data transformation methods do not solve the metadata and data value challenges – need data re-engineering Stephen Brown, Vality Corp. Data Re-engineering Process : Data Re-engineering Process Natural Laws of Data Re-engineering : Natural Laws of Data Re-engineering Data has no standard You can’t predict or legislate format or content Data will evolve faster than its capture and storage systems You can’t write rules for what you don’t know and can’t see Instructions for handling data are within the data Don’t trust the metadata, make the data reveal itself Revealed metadata is knowledge about the business Revealed metadata validates warehouse design Revealed metadata supports conversion project management Revealed metadata is insurance against misinformation Buy tool or manually code programs ? : Buy tool or manually code programs ? Slide 29: 3 - DW Tools 1st Generation ETL Job Schedulers RDBMS Utilities Replication/Distribution Tools Universal Repositories Q&R/MQE/MRE MOLAP/ROLAP/LowLAP Data Mining CASE DB Design Repositories DB & System Monitors Technologies Processes EIS Data Visualization Meta Data Browsers Design Mapping Extract Scrub Transform Load Index Aggregation Replication Data Set Distribution Access & Analysis Resource Scheduling & Distribution Meta Data System Monitoring 2nd Generation ETL Suites / Environments Transformation : Transformation Choosing between tool and manually coded programs Time frames - tools take longer select, configure, learn Budgets - short term or long term Size of warehouse - initial project small enough for coding Size and skills of warehouse team Tool automatically generates and maintains metadata Hand Generated Code : Hand Generated Code Upside No learning curve Inherent skills In house capabilities Usually simple No culture change/mandate (CASE) Downside Manual meta data Maintenance challenge when talent level changes No automation Tools : Tools Upside Easy to maintain as talent level changes Automatic meta data May gain efficiencies Integration with repositories Integration with other tools Schedulers Monitors Meta data management Tools : Tools Downside Cost (1st generation tools very high $) Learning curve Enforced culture change Must use tool for all changes Speed, may be slower to implement May require additional resources Slide 34: Copyright © 1997, Enterprise Group, Ltd. Manual Code / 1st Generation ETL Tools Process Extract Program Source OLTP Systems Transform Program File Transfer Program File Load Program Source Mainframe or C/S System Data Warehouse Client/Server System Index Program Aggregation Program External Job Scheduling and Control - External Meta Data Load/Maintenance Slide 35: Copyright © 1997, Enterprise Group, Ltd. 2nd Generation ETL Tools Process Source OLTP Systems Transformation Engine Monitoring Scheduling Extraction Scrubbing Transformation Load Index Aggregation Meta Data Load Meta Data Maint. Caching Source Mainframe or C/S System Transformation Engine C/S System Data Warehouse/Mart C/S System Data Warehouse or Data Mart Slide 36: Copyright © 1997, Enterprise Group, Ltd. 2nd Generation ETL Environment Process Source OLTP Systems Transformation Engine Monitoring Scheduling Extraction Scrubbing Transformation Load Index Aggregation Meta Data Load Meta Data Maint. Request Broker Caching Source Mainframe or C/S System Transformation Engine C/S System Data Warehouse C/S System Enterprise Meta Data Data Mart C/S System Data Mart Data Mart C/S System Data Mart Surf Meta Data Request Resource Schedule Delivery User Process Data Warehouse 1st Generation ETL Tools Hampered by: : 1st Generation ETL Tools Hampered by: High cost (average deal prices in the $250-400k range) Long learning curves Perceived value (most teams felt they could write better code) Cultural challenges (like a CASE tool, the team must use the code generator for all creation and changes, no matter how minor) Core capabilities (complex transformations still required manual code) Management requirements (users still had to manage all the programs generated) Performance issues (the resulting programs could not leverage parallelism) Important 2nd Generation ETL tool features: : Important 2nd Generation ETL tool features: Transformation engine design Ability to leverage parallel server technology CDC (Change Data Capture, which allows only the new data to be extracted) Incremental aggregation (ability to add CDC incremental data to existing aggregations) Limited or no use of temporary files or data base tables (virtual caching only) Common, open and extensible meta data repository Enterprise scalability Important 2nd Generation ETL tool features: : Important 2nd Generation ETL tool features: Common UI (User Interface) across all tools Extensive selection of transformation algorithms Easily extensible scrub and transform algorithm library Extensive heterogeneous source and target support Native OLAP data set target support System monitoring & management Enterprise meta data repository (content, resources, structure, etc.) Transform once, populate many (populate multiple targets with a single transformation output) Important 2nd Generation ETL tool features: : Important 2nd Generation ETL tool features: Integrated enterprise scale scrubbing capabilities Seamless interoperability with external point solution tools Integrated information access, analysis, scheduling and delivery Aggregate aware information request broker (enables virtual data warehouse) Ad hoc aggregation monitoring and management Pipeline parallelism / very high throughput Native drivers (source and target) OLTP <> OLAP : OLTP <> OLAP OLTP normalized OLAP tools must provide multidimensional conceptual view of data ??????Providing OLAP to User Analysts, E.F.Codd redundant data Multidimensional Model : Multidimensional Model Data stored as facts and dimensions Sales Fact Cube