dataware house

Views:
 
Category: Education
     
 

Presentation Description

bahut achha

Comments

Presentation Transcript

What is a Data Warehouse? :

What is a Data Warehouse? Shipra Varshney Lecture –MBA

What Is a Data Warehouse?:

What Is a Data Warehouse? Nobody can agree So I’m not actually going to define a DW Don’t feel cheated, though By the end of this talk, you’ll Understand key concepts that underlie all warehouse implementations (“talk the talk”) Understand the various components out of which DW architects construct real-world data warehouses Understand what a data warehouse project looks like

Why Are Schools Setting Up Data Warehouses?:

Why Are Schools Setting Up Data Warehouses? A data warehouse makes it easier to: Optimize classroom, computer lab usage Refine admissions ratings systems Forecast future demand for courses, majors Tie private spreadsheet data into central repositories Correlate admissions and IR data with outcomes such as: GPAs Placement rates Happiness, as measured by alumni surveys Notify advisors when extra help may be needed based on Admissions data (student vitals; SAT, etc.) Special events: A-student suddenly gets a C in his/her major Slower trends: Student’s GPA falls for > 2 semesters/terms (Many other examples could be given!) Better information = better decisions Better admission decisions Better retention rates More effective fund raising, etc.

Talking The Talk:

Talking The Talk To think and communicate usefully about data warehouses you’ll need to understand a set of common terms and concepts: OLTP ODS OLAP, ROLAP, MOLAP ETL Star schema Conformed dimension Data mart Cube Metadata Even if you’re not an IT person, pay heed: You’ll have to communicate with IT people More importantly: Evidence shows that IT will only build a successful warehouse if you are intimately involved!

OLTP:

OLTP OLTP = online transaction processing The process of moving data around to handle day-to-day affairs Scheduling classes Registering students Tracking benefits Recording payments, etc. Systems supporting this kind of activity are called transactional systems

Transactional Systems:

Transactional Systems Transactional systems are optimized primarily for the here and now Can support many simultaneous users Can support heavy read/write access Allow for constant change Are big, ugly, and often don’t give people the data they want As a result a lot of data ends up in shadow databases Some ends up locked away in private spreadsheets Transactional systems don’t record all previous data states Lots of data gets thrown away or archived, e.g.: Admissions data Enrollment data Asset tracking data (“How many computers did we support each year, from 1996 to 2006, and where do we expect to be in 2010?”)

Simple Transactional Database:

Simple Transactional Database Map of Microsoft Windows Update Service (WUS) back-end database Diagrammed using Sybase PowerDesigner Each green box is a database “table” Arrows are “joins” or foreign keys This is simple for an OLTP back end

More Complex Example:

More Complex Example Recruitment Plus back-end database Used by many admissions offices Note again: Green boxes are tables Lines are foreign key relationships Purple boxes are views Considerable expertise is required to report off this database! Imagine what it’s like for even more complex systems Colleague SCT Banner (over 4,000 tables)

ODS:

ODS ODS = operational data store ODSs were an early workaround to the “reporting problem” To create an ODS you Build a separate/simplified version of an OLTP system Periodically copy data into it from the live OLTP system Hook it to operational reporting tools An ODS can be an integration point or real-time “reporting database” for an operational system It’s not enough for full enterprise-level, cross-database analytical processing

OLAP:

OLAP OLAP = online analytical processing OLAP is the process of creating and summarizing historical, multidimensional data To help users understand the data better Provide a basis for informed decisions Allow users to manipulate and explore data themselves, easily and intuitively More than just “reporting” Reporting is just one (static) product of OLAP

OLAP Support Databases:

OLAP Support Databases OLAP systems require support databases These databases typically Support fewer simultaneous users than OL T P back ends Are structured simply; i.e., denormalized Can grow large Hold snapshots of data in OLTP systems Provide history/time depth to our analyses Are optimized for read (not write) access Updated via periodic batch (e.g., nightly) ETL processes

ETL Processes:

ETL Processes ETL = extract, transform, load Extract data from various sources Transform and clean the data from those sources Load the data into databases used for analysis and reporting ETL processes are coded in various ways By hand in SQL, UniBASIC, etc. Using more general programming languages In semi-automated fashion using specialized ETL tools like Cognos Decision Stream Most institutions do hand ETL; but note well: Hand ETL is slow Requires specialized knowledge Becomes extremely difficult to maintain as code accumulates and databases/personnel change!

Where Does the Data Go?:

Where Does the Data Go? What sort of a database do the ETL processes dump data into? Typically, into very simple table structures These table structures are: Denormalized Minimally branched/hierarchized Structured into star schemas

So What Are Star Schemas?:

So What Are Star Schemas? Star schemas are collections of data arranged into star-like patterns They have fact tables in the middle, which contain amounts, measures (like counts, dollar amounts, GPAs) Dimension tables around the outside, which contain labels and classifications (like names, geocodes, majors) For faster processing, aggregate fact tables are sometimes also used (e.g., counts pre-averaged for an entire term) Star schemas should Have descriptive column/field labels Be easy for users to understand Perform well on queries

A Very Simple Star Schema:

A Very Simple Star Schema Data Center UPS Power Output Dimensions: Phase Time Date Facts: Volts Amps Etc.

A More Complex Star Schema:

A More Complex Star Schema Freshman survey data (HERI/CIRP) Dimensions: Questions Survey years Data about test takers Facts: Answer (text) Answer (raw) Count (1) Oops Not a star Snowflaked! Oops, answers should have been placed in their own dimension (creating a “factless fact table”). I’ll demo a better version of this star later!

Data Marts:

Data Marts One definition: One or more star schemas that present data on a single or related set of business processes Data marts should not be built in isolation They need to be connected via dimensional tables that are The same or subsets of each other Hierarchized the same way internally So, e.g., if I construct data marts for… GPA trends, student major trends, enrollments Freshman survey data, senior survey data, etc. …I connect these marts via a conformed student dimension Makes correlation of data across star schemas intuitive Makes it easier for OLAP tools to use the data Allows nonspecialists to do much of the work

Simple Data Mart Example:

Simple Data Mart Example UPS Battery star By battery Run-time % charged Current Input star By phase Voltage Current Output star By phase Voltage Current Sensor star By sensor Temp Humidity Note conformed date, time dimensions!

ROLAP, MOLAP:

ROLAP, MOLAP ROLAP = OLAP via direct relational query E.g., against a (materialized) view Against star schemas in a warehouse MOLAP = OLAP via multidimensional database (MDB) MDB is a special kind of database Treats data kind of like a big, fast spreadsheet MDBs typically draw data in from a data warehouse Built to work best with star schemas

Metadata:

Metadata Metadata = data about data In a data warehousing context it can mean many things Information on data in source OLTP systems Information on ETL jobs and what they do to the data Information on data in marts/star schemas Documentation in OLAP tools on the data they manipulate Many institutions make metadata available via data malls or warehouse portals, e.g.: University of New Mexico UC Davis Rensselear Polytechnic Institute University of Illinois Good ETL tools automate the setup of malls/portals!

The Data Warehouse:

The Data Warehouse OK now we’re experts in terms like OLTP, OLAP, star schema, metadata, etc. Let’s use some of these terms to describe how a DW works: Provides ample metadata – data about the data Utilizes easy-to-understand column/field names Feeds multidimensional databases (MDBs) Is updated via periodic (mainly nightly) ETL jobs Presents data in a simplified, denormalized form Utilizes star-like fact/dimension table schemas Encompasses multiple, smaller data “marts” Supports OLAP tools (Access/Excel, Safari, Cognos BI) Derives data from (multiple) back-end OLTP systems Houses historical data, and can grow very big

A Data Warehouse is Not…:

A Data Warehouse is Not… Vendor and consultant proclamations aside, a data warehouse is not: A project With a specific end date A product you buy from a vendor Like an ODS (such as SCT’s) A canned “warehouse” supplied by iStrategy Cognos ReportNet A database schema or instance Like Oracle SQL Server A cut-down version of your live transactional database

Kimball & Caserta’s Definition:

Kimball & Caserta’s Definition According to Ralph Kimball and Joe Caserta, a data warehouse is: A system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making. Another def.: The union of all the enterprise’s data marts Aside: The Kimball model is not without some critics: E.g., Bill Inmon

Example Data Warehouse (1):

Example Data Warehouse (1) This one is RPI’s 5 parts: Sources ETL stuff DW proper Cubes etc. OLAP apps

Implementing a Data Warehouse:

Implementing a Data Warehouse In many organizations IT people want to huddle and work out a warehousing plan, but in fact The purpose of a DW is decision support The primary audience of a DW is therefore College decision makers It is College decision makers therefore who must determine Scope Priority Resources Decision makers can’t make these determinations without an understanding of data warehouses It is therefore imperative that key decision makers first be educated about data warehouses Once this occurs, it is possible to Elicit requirements (a critical step that’s often skipped) Determine priorities/scope Formulate a budget Create a plan and timeline, with real milestones and deliverables!

What Takes Up the Most Time?:

What Takes Up the Most Time? You may be surprised to learn what DW step takes the most time Try guessing which: Hardware Physical database setup Database design ETL OLAP setup Acc. to Kimball & Caserta, ETL will eat up 70% of the time. Other analysts give estimates ranging from 50% to 80%. The most often underestimated part of the warehouse project!

Conclusion:

Conclusion Information is held in transactional systems But transactional systems are complex They don’t talk to each other well; each is a silo They require specially trained people to report off of For normal people to explore institutional data, data in transactional systems needs to be Renormalized as star schemas Moved to a system optimized for analysis Merged into a unified whole in a data warehouse Note: This process must be led by “customers” Yes, IT people must build the infrastructure But IT people aren’t the main customers So who are the customers? Admissions officers trying to make good admission decisions Student counselors trying to find/help students at risk Development offers raising funds that support the College Alumni affairs people trying to manage volunteers Faculty deans trying to right-size departments IT people managing software/hardware assets, etc….