data warehouse

Views:
 
     
 

Presentation Description

No description available.

Comments

By: kittu505 (14 month(s) ago)

i nee this one

By: tushki.kuku (14 month(s) ago)

nice work

By: sareddyrajgopal (15 month(s) ago)

Hi i need this presentaion give download permission

By: zeqiraj (15 month(s) ago)

shume e mir veq suksese

By: rmonu007 (19 month(s) ago)

how can i download this presentation

See all

Presentation Transcript

Data Warehouse : 

Data Warehouse DATA TRANSFORMATION

Extract Transform Insert : 

Extract Transform Insert Extract data from operational system, transform and insert into data warehouse Why ETI? Will your warehouse produce correct information with the current data? How ho w can I ensure warehouse credibility?

Excuses for NOT Transforming Legacy Data : 

Excuses for NOT Transforming Legacy Data Old data works fine, new will work as well. Data will be fixed at point of entry through GUI. If needed, data will be cleaned after new system populated; After proof-of-concept pilot. Keys join the data most of the time. Users will not agree to modifying or standardizing their data.

Levels of Migration Problem : 

Levels of Migration Problem Existing metadata is insufficient and unreliable Metadata must hold for all occurrences Metadata must represent business and technical attributes Data values incorrectly typed and accessible Values form extracted from storage Values meaning inferred from its content Entity keys unreliable or unavailable Inferred from related values

Metadata Challenge : 

Metadata Challenge Metadata gets out of synch with details it summarizes Business grows faster than systems designed to capture business info Not at the right level of detail Multiple values in a single field Multiple meanings to a single field No fixed format for value Expressed in awkward of limited terms Program/compiler view rather than business view

Character-level Challenge : 

Character-level Challenge Value instance level Spelling, aliases Abbreviations, truncations, transpositions Inconsistent storage formats Named type level Multiple meanings, contextual meanings Synonyms, homonyms Entity level No common keys or representation No integrated view across records, files, systems

Some Data Quality examples : 

Some Data Quality examples The magic shrinking vendor file 127 ways to spell... Data surprises in individual fields Cowbirds and Data Fields Magic numbers and embedded intelligence

The Magic Shrinking Vendor File : 

The Magic Shrinking Vendor File A Medical claims processor was having trouble with their Insurance Vendor file. They thought they had 300,000 Insurance Vendors. When they cleaned up their data, they discovered they had only 27,000 unique Insurance Vendors.

127 ways to spell... : 

127 ways to spell... Have over 127 different ways to spell AT&T Have over 1000 ways to spell duPont

Slide 10: 

NAME SOC. SEC. # TELEPHONE Data surprises in individual fields Source: Vality

Slide 11: 

NAME SOC. SEC. # TELEPHONE Meta Data surprises in individual fields Source: Vality

Slide 12: 

NAME SOC. SEC. # TELEPHONE Denise Mario DBA Marc Di Lorenzo ETAL Tom & Mary Roberts First Natl Provident Digital 15 State St. Astorial Fedrl Savings Kevin Cooke, Receiver John Doe Trustee for K Meta Actual Data Values Data surprises in individual fields Source: Vality

Slide 13: 

NAME SOC. SEC. # TELEPHONE Denise Mario DBA Marc Di Lorenzo ETAL Tom & Mary Roberts First Natl Provident Digital 15 State St. Astorial Fedrl Savings Kevin Cooke, Receiver John Doe Trustee for K 228-02-1975 999999999 025-37-1888 34-2671434 101010101 LN#12-756 18-7534216 111111111 Meta Actual Data Values Data surprises in individual fields Source: Vality

Slide 14: 

NAME SOC. SEC. # TELEPHONE Denise Mario DBA Marc Di Lorenzo ETAL Tom & Mary Roberts First Natl Provident Digital 15 State St. Astorial Fedrl Savings Kevin Cooke, Receiver John Doe Trustee for K 228-02-1975 999999999 025-37-1888 34-2671434 101010101 LN#12-756 18-7534216 111111111 6173380300 3380321 415-392-2000 508-466-1200 212-235-1000 FAX 528-9825 5436 Meta Actual Data Values Data surprises in individual fields Source: Vality

Cowbirds and Data Fields : 

Cowbirds and Data Fields Cowbirds lay their eggs in other birds nets Users use data fields that are not used for other purposes

Magic Numbers and Embedded Intelligence : 

Magic Numbers and Embedded Intelligence Customer Number = XXXX-YY-ZZ XXXX = 1st 4 Positions of Zip Code If YY = 00-70 Then Cust = Pharmacy If YY = 80-89 Then Cust = Hospital Except if YY = 82 and ZZ = ** Which Means...

Orr's Laws of Data Quality : 

Orr's Laws of Data Quality Law #1 - “Data that is not used cannot be correct!” Law #2 - “Data quality is a function of its use, not its collection!” Law #3 - “Data will be no better than its most stringent use!” Law #4 - “Data quality problems increase with the age of the system!” Law #5 - “Data quality laws apply equally to meta-data!” Law #6 - The less likely something is to occur, the more traumatic it will be when it happens!

Legacy Data Contaminants Found in Migrations : 

Legacy Data Contaminants Found in Migrations Lack of standards Data surprises in individual fields Legacy information buried in free form fields Legacy myopia – multiple account numbers block consolidated view Anomaly nightmare – complex matching and consolidation

4 Fundamental Types of Transformation : 

4 Fundamental Types of Transformation Simple Transformation Fundamental building blocks of all data transformations One field at a time Cleansing and Scrubbing Ensure consistent formatting and usage of field or related group of fileds Checks valid values

4 Fundamental Types of Transformation (con’t) : 

4 Fundamental Types of Transformation (con’t) Integration Takes operational data from one or more sources and maps it, field by field to new data structure Aggregation and Summarization Remove low level of detail Data for data mart

Simple Transformation : 

Simple Transformation Convert data element from one type to another semantic value same rename elements Date time conversion standard warehouse format Decode encoded fields M F vs C S MM

Cleansing and Scrubbing : 

Cleansing and Scrubbing Actual content examined Range checking, enumerated lists, dependency checking Uniform representation for dw address information parse to components

Integration : 

Integration Simple field level mappings -80-90% Complex integration No common identifier probable matches 2-stage process, isolation/reconciliation Multiple sources for same target element contradictory Missing data Derived/calculated data redundant?

Aggregation and Summarization : 

Aggregation and Summarization Summarization is the addition of like values along one or more business dimensions add daily sales by stores for monthly sales by region Aggregation is the addition of different business elements into common total daily product sales plus monthly consulting sales give monthly combined sales amount Details of process available in metadata

Data Re-engineering Problem : 

Data Re-engineering Problem Programming for the unknown Unanticipated values, structures and patterns Programming for noise and uncertainty Conflicting and missing values Programming for productivity and efficiency Changing data values, changing user requirements High volumes, non-linear searches Conventional data transformation methods do not solve the metadata and data value challenges – need data re-engineering Stephen Brown, Vality Corp.

Data Re-engineering Process : 

Data Re-engineering Process

Natural Laws of Data Re-engineering : 

Natural Laws of Data Re-engineering Data has no standard You can’t predict or legislate format or content Data will evolve faster than its capture and storage systems You can’t write rules for what you don’t know and can’t see Instructions for handling data are within the data Don’t trust the metadata, make the data reveal itself Revealed metadata is knowledge about the business Revealed metadata validates warehouse design Revealed metadata supports conversion project management Revealed metadata is insurance against misinformation

Buy tool or manually code programs ? : 

Buy tool or manually code programs ?

Slide 29: 

3 - DW Tools 1st Generation ETL Job Schedulers RDBMS Utilities Replication/Distribution Tools Universal Repositories Q&R/MQE/MRE MOLAP/ROLAP/LowLAP Data Mining CASE DB Design Repositories DB & System Monitors Technologies Processes EIS Data Visualization Meta Data Browsers Design Mapping Extract Scrub Transform Load Index Aggregation Replication Data Set Distribution Access & Analysis Resource Scheduling & Distribution Meta Data System Monitoring 2nd Generation ETL Suites / Environments

Transformation : 

Transformation Choosing between tool and manually coded programs Time frames - tools take longer select, configure, learn Budgets - short term or long term Size of warehouse - initial project small enough for coding Size and skills of warehouse team Tool automatically generates and maintains metadata

Hand Generated Code : 

Hand Generated Code Upside No learning curve Inherent skills In house capabilities Usually simple No culture change/mandate (CASE) Downside Manual meta data Maintenance challenge when talent level changes No automation

Tools : 

Tools Upside Easy to maintain as talent level changes Automatic meta data May gain efficiencies Integration with repositories Integration with other tools Schedulers Monitors Meta data management

Tools : 

Tools Downside Cost (1st generation tools very high $) Learning curve Enforced culture change Must use tool for all changes Speed, may be slower to implement May require additional resources

Slide 34: 

Copyright © 1997, Enterprise Group, Ltd. Manual Code / 1st Generation ETL Tools Process Extract Program Source OLTP Systems Transform Program File Transfer Program File Load Program Source Mainframe or C/S System Data Warehouse Client/Server System Index Program Aggregation Program External Job Scheduling and Control - External Meta Data Load/Maintenance

Slide 35: 

Copyright © 1997, Enterprise Group, Ltd. 2nd Generation ETL Tools Process Source OLTP Systems Transformation Engine Monitoring Scheduling Extraction Scrubbing Transformation Load Index Aggregation Meta Data Load Meta Data Maint. Caching Source Mainframe or C/S System Transformation Engine C/S System Data Warehouse/Mart C/S System Data Warehouse or Data Mart

Slide 36: 

Copyright © 1997, Enterprise Group, Ltd. 2nd Generation ETL Environment Process Source OLTP Systems Transformation Engine Monitoring Scheduling Extraction Scrubbing Transformation Load Index Aggregation Meta Data Load Meta Data Maint. Request Broker Caching Source Mainframe or C/S System Transformation Engine C/S System Data Warehouse C/S System Enterprise Meta Data Data Mart C/S System Data Mart Data Mart C/S System Data Mart Surf Meta Data Request Resource Schedule Delivery User Process Data Warehouse

1st Generation ETL Tools Hampered by: : 

1st Generation ETL Tools Hampered by: High cost (average deal prices in the $250-400k range) Long learning curves Perceived value (most teams felt they could write better code) Cultural challenges (like a CASE tool, the team must use the code generator for all creation and changes, no matter how minor) Core capabilities (complex transformations still required manual code) Management requirements (users still had to manage all the programs generated) Performance issues (the resulting programs could not leverage parallelism)

Important 2nd Generation ETL tool features: : 

Important 2nd Generation ETL tool features: Transformation engine design Ability to leverage parallel server technology CDC (Change Data Capture, which allows only the new data to be extracted) Incremental aggregation (ability to add CDC incremental data to existing aggregations) Limited or no use of temporary files or data base tables (virtual caching only) Common, open and extensible meta data repository Enterprise scalability

Important 2nd Generation ETL tool features: : 

Important 2nd Generation ETL tool features: Common UI (User Interface) across all tools Extensive selection of transformation algorithms Easily extensible scrub and transform algorithm library Extensive heterogeneous source and target support Native OLAP data set target support System monitoring & management Enterprise meta data repository (content, resources, structure, etc.) Transform once, populate many (populate multiple targets with a single transformation output)

Important 2nd Generation ETL tool features: : 

Important 2nd Generation ETL tool features: Integrated enterprise scale scrubbing capabilities Seamless interoperability with external point solution tools Integrated information access, analysis, scheduling and delivery Aggregate aware information request broker (enables virtual data warehouse) Ad hoc aggregation monitoring and management Pipeline parallelism / very high throughput Native drivers (source and target)

OLTP <> OLAP : 

OLTP <> OLAP OLTP normalized OLAP tools must provide multidimensional conceptual view of data ??????Providing OLAP to User Analysts, E.F.Codd redundant data

Multidimensional Model : 

Multidimensional Model Data stored as facts and dimensions Sales Fact Cube