dw olap

Uploaded from authorPOINT
Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

By: atabari (43 month(s) ago)

Dear Sir, . I'm doing a research on the structure of OLAP systems . So I want to know if I can have a copy of your useful presentation about OLAP . My email is : ali.tabari_univer@yahoo.com Thank you in advance.

Presentation Transcript

Data Warehousing andOLAP: 

Data Warehousing and OLAP Hector Garcia-Molina Stanford University

Warehousing: 

Warehousing Growing industry: $8 billion in 1998 Range from desktop to huge: Walmart: 900-CPU, 2,700 disk, 23TB Teradata system Lots of buzzwords, hype slice andamp; dice, rollup, MOLAP, pivot, ...

Outline: 

Outline What is a data warehouse? Why a warehouse? Models andamp; operations Implementing a warehouse Future directions

What is a Warehouse?: 

What is a Warehouse? Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history) integrated time-varying non-volatile

What is a Warehouse?: 

What is a Warehouse? Collection of tools gathering data cleansing, integrating, ... querying, reporting, analysis data mining monitoring, administering warehouse

Warehouse Architecture: 

Warehouse Architecture Metadata

Why a Warehouse?: 

Why a Warehouse? Two Approaches: Query-Driven (Lazy) Warehouse (Eager)

Query-Driven Approach: 

Query-Driven Approach

Advantages of Warehousing: 

Advantages of Warehousing High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse Modify, summarize (store aggregates) Add historical information

Advantages of Query-Driven: 

Advantages of Query-Driven No need to copy data less storage no need to purchase data More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources

OLTP vs. OLAP: 

OLTP vs. OLAP OLTP: On Line Transaction Processing Describes processing at operational sites OLAP: On Line Analytical Processing Describes processing at warehouse

OLTP vs. OLAP: 

OLTP vs. OLAP Mostly updates Many small transactions Mb-Tb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical Mostly reads Queries long, complex Gb-Tb of data Summarized, consolidated data Decision-makers, analysts as users OLTP OLAP

Data Marts: 

Data Marts Smaller warehouses Spans part of organization e.g., marketing (customers, products, sales) Do not require enterprise-wide consensus but long term integration problems?

Warehouse Models & Operators: 

Warehouse Models andamp; Operators Data Models relations stars andamp; snowflakes cubes Operators slice andamp; dice roll-up, drill down pivoting other

Star: 

Star

Star Schema: 

Star Schema

Terms: 

Terms Fact table Dimension tables Measures

Dimension Hierarchies: 

Dimension Hierarchies store sType city region è snowflake schema è constellations

Cube: 

Cube Fact table view: Multi-dimensional cube: dimensions = 2

3-D Cube: 

3-D Cube dimensions = 3 Multi-dimensional cube: Fact table view:

ROLAP vs. MOLAP: 

ROLAP vs. MOLAP ROLAP: Relational On-Line Analytical Processing MOLAP: Multi-Dimensional On-Line Analytical Processing

Aggregates: 

Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81

Aggregates: 

Aggregates Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

Another Example: 

Another Example Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId drill-down rollup

Aggregates: 

Aggregates Operators: sum, count, max, min, median, ave 'Having' clause Using dimension hierarchy average by region (within store) maximum by month (within date)

Cube Aggregation: 

Cube Aggregation day 2 day 1 129 . . . Example: computing sums

Cube Operators: 

Cube Operators day 2 day 1 129 . . . sale(c1,*,*) sale(*,*,*) sale(c2,p2,*)

Extended Cube: 

Extended Cube day 2 day 1 * sale(*,p2,*)

Aggregation Using Hierarchies: 

Aggregation Using Hierarchies customer region country (customer c1 in Region A; customers c2, c3 in Region B)

Pivoting: 

Pivoting Multi-dimensional cube: Fact table view:

Implementing a Warehouse: 

Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing, ... Managing: Metadata, Design, ...

Monitoring: 

Monitoring Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … Incremental vs. Refresh

Monitoring Techniques: 

Monitoring Techniques Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring è Advantages andamp; Disadvantages!!

Monitoring Issues: 

Monitoring Issues Frequency periodic: daily, weekly, … triggered: on 'big' change, lots of changes, ... Data transformation convert data to uniform format remove andamp; add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways

Integration: 

Integration Data Cleaning Data Loading Derived Data

Data Cleaning: 

Data Cleaning Migration (e.g., yen ð dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) Auditing: discover rules andamp; relationships (like data mining)

Loading Data: 

Loading Data Incremental vs. refresh Off-line vs. on-line Frequency of loading At night, 1x a week/month, continuously Parallel/Partitioned load

Derived Data: 

Derived Data Derived Warehouse Data indexes aggregates materialized views (next slide) When to update derived data? Incremental vs. refresh

Materialized Views: 

Materialized Views Define new warehouse relations using SQL expressions

Processing: 

Processing ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms

ROLAP Server: 

ROLAP Server Relational OLAP Server tools Special indices, tuning; Schema is 'denormalized'

MOLAP Server: 

MOLAP Server Multi-Dimensional OLAP Server multi-dimensional server M.D. tools could also sit on relational DBMS

Index Structures: 

Index Structures Traditional Access Methods B-trees, hash tables, R-trees, grids, … Popular in Warehouses inverted lists bit map indexes join indexes text indexes

Inverted Lists: 

Inverted Lists . . . age index inverted lists data records

Using Inverted Lists: 

Using Inverted Lists Query: Get people with age = 20 and name = 'fred' List for age = 20: r4, r18, r34, r35 List for name = 'fred': r18, r52 Answer is intersection: r18

Bit Maps: 

Bit Maps . . . age index bit maps data records

Using Bit Maps: 

Using Bit Maps Query: Get people with age = 20 and name = 'fred' List for age = 20: 1101100000 List for name = 'fred': 0100000001 Answer is intersection: 010000000000 Good if domain cardinality small Bit vectors can be compressed

Join: 

Join 'Combine' SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT

Join Indexes: 

Join Indexes join index

What to Materialize?: 

What to Materialize? Store in warehouse results useful for common queries Example: day 2 day 1 129 . . . total sales materialize

Materialization Factors: 

Materialization Factors Type/frequency of queries Query response time Storage cost Update cost

Cube Aggregates Lattice: 

Cube Aggregates Lattice city, product, date city, product city, date product, date city product date all 129 use greedy algorithm to decide what to materialize

Dimension Hierarchies: 

Dimension Hierarchies all state city

Dimension Hierarchies: 

Dimension Hierarchies city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...

Interesting Hierarchy: 

Interesting Hierarchy all years quarters months days weeks conceptual dimension table

Design: 

Design What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index?

Tools: 

Tools Development design andamp; edit: schemas, views, scripts, rules, queries, reports Planning andamp; Analysis what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management performance monitoring, usage patterns, exception reporting System andamp; Network Management measure traffic (sources, warehouse, clients) Workflow Management 'reliable scripts' for cleaning andamp; analyzing data

Current State of Industry: 

Current State of Industry Extraction and integration done off-line Usually in large, time-consuming, batches Everything copied at warehouse Not selective about what is stored Query benefit vs storage andamp; update cost Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything