Challenges in Building a Strategic Information Infrastructure: Challenges in Building a Strategic Information Infrastructure Laura Haas
Distinguished Engineer and Manager,
Information Integration Architecture
Agenda: Agenda The integration challenge
Two motivating examples
Technologies for information integration
Areas where invention needed
The Integration Challenge: The Integration Challenge Complex and heterogeneous environments
Many different types of systems
Many inter-related applications
Escalating needs
Variety, velocity, volume
People are expensive The world produces 250MB of information every year for every man, woman and child on earth.
The Challenge Continued… : Sources: IBM & Industry Studies, Customer Interviews, Forrester The Challenge Continued… 40% of IT budgets may be spent on integration. 30% of people’s time: searching for relevant information. The average billion dollar company:
48 disparate financial systems
2.7 ERP systems 42% of transactions are still paper-based. 85% of information is unstructured. Trx. Documents Reports e-Mails Media Customers Employees Partners Databases Orgs. Financials Products Web
Content 79% of companies have more than two repositories and 25% have more than 15
60% + of CEOs: Need to do a better job capturing and understanding information rapidly in order to make swift business decisions. Only 1/3rd of CFOs believe that the information is easy to use, tailored, cost effective or integrated. 30-50% of application design time is spent on copy management.
Taikang Life Insurance : Taikang Life Insurance Business Challenge Technical Challenge 4th largest Chinese insurance company
8,000 employees, 150,000 agents
3.5 million customers
28 branches, 170 sub-branches
Data in DB2 UDB, Informix, Oracle, SQL Server, XML, e-mail, CRM and Portal applications
Goals:
Up-to-the-minute status for executives
Increased employee productivity
Better customer service Background
Taikang Information Platform – Before Integration: Taikang Information Platform – Before Integration Phone Fax SMS E-mail Web Store Front Letter CSC Personal Life
Insurance Systems Group Life,
Banking Insurance Financials Client Data Client Data Client Data Client Data Client Data Client Data Channels not effectively integrated, client data dispersed and not effectively shared
Multiple application systems, multiple application development tools
Taikang Integrated Information Platform Architecture: Taikang Integrated Information Platform Architecture Phone Fax SMS Email Web Store
Front Mail Agents Financial Planner Core
Systems Information
Integration
Platform Application
Platform Channels Group & Banking CSC Personal Life Financials Mapping (nicknames) Integrated
Information Data Service XML SQL Web Services
Creating Enterprise Reference Information: Creating Enterprise Reference Information Web Hierarchy and Sub Category Marketing Benefits Cross-Sell & Up-sell Promo. Price Sizes Colors Images
Challenges in Integrating Information : Challenges in Integrating Information Structured and unstructured data
Diversity of data sources (content repos, pricing application, databases, …)
Coming up with the model of how information fits together
Understanding what info exists
Finding related pieces
Creating a common format
Deciding how to access and transform data
What should be materialized, what accessed in real-time, how maintained
What pre-defined paths, what unplanned (navigation vs. search)
Configuring the appropriate software
Accessing information in the application
Monitoring the system and understanding usage, problems, etc
Multiple Technologies Are Needed: Multiple Technologies Are Needed Discovery and preparation
Metadata and information registries
Exploration, analysis, cleansing
Transformation
Within and across models (e.g., record -> record, relational -> XML)
Integration
Consolidation
Federation
Connection to the applications
Push and pull
Interfaces appropriate to the tasks
Services
Systems management
Maintenance, monitoring, fault tolerance, …
A Platform for Information Integration: A Platform for Information Integration -- Multiple access paradigms
-- Multiple integration disciplines
Complementary Information Integration Approaches: Complementary Information Integration Approaches Consolidate (“place”) data for local access
Access performance or availability requirements demand centralized data.
Currency requirements demand point-in-time consistency, e.g. close of business
Complex transformation is required to achieve semantically consistent data
Production applications, data warehouses, operational data stores
Typically managed by ETL (Extract, Transform, and Load) or replication technologies
Federate data for integrated access to distributed sources
Access performance and load on source systems can be traded for overall lower cost implementation
Currency requirements demand a fresh copy of the data
Data security, licensing restrictions, or industry regulations restrict data movement
Combining mixed format data, e.g. customer ODS with related contract documents or images
Query requires real-time data, e.g. stock quote, on-hand inventory
Search provides a third option
Search a local index, create a local result set (hit list)
Distributed access to live data possible through result set links
No single approach is the best for all scenarios
Consolidation prepares the data in advance: Consolidation prepares the data in advance Scripts and hand-written applications
Extract, Transform, Load (ETL) tools
Uses: build warehouses, data marts, operational data stores, …
Typically include Data Flow editor to define jobs, code generation
Libraries of functions for doing transformations
Connectors to many information sources
Replication
Uses: high availability, warehouse maintenance, application integration, …
Includes changed data capture (log readers, triggers, application programs)
Transport/storage for changes
Logic to apply changes to replica
Slide14: SQL Federation Federated Database Server Data Relational Data Source Data Global Catalog SQL API
(JDBC/ODBC) Wrappers 00001|SONY|Television|...
00002|RCA|VideoPlayer|..
00004|SONY|DVDPlayer
00003|SONY|VideoRecorder
....... Database Application SELECT I.man, count(*)
FROM transactions T,
items I
WHERE I.id=T.item_id
AND I.category='Television'
AND YEAR(T.tran_date)=2001
GROUP BY I.man; SELECT tran_date, item_id
FROM transactions
WHERE YEAR(T.tran_date)=2001 ITEMS TRANSACTIONS List the number of TV sales
per manufacturer in 2001 Desired properties:
Transparency – Heterogeneity – Extensibility
High Function – Autonomy – Performance
Search versus Query: Search versus Query Search
User doesn’t need to know where the information is
User doesn’t need to know structure of information (schema)
User doesn’t need to know precisely how the information is expressed
User can use native language
“Search” typically focuses on returning documents, but could become “information finding”
Query
Information need can be (must be) expressed precisely
Information can be combined and summarized in powerful ways
Both provide great value in integration scenarios SELECT I.man, count(*)
FROM transactions T, items I
Invention Needed for Information Integration: Invention Needed for Information Integration Semantic integration
Metadata
Discovery and Design Tools
Virtualizing large-scale systems
Information integration in grid environments
Data placement
Precise queries over unstructured information
Text analytics
Metadata Landscape (not exhaustive): Metadata Landscape (not exhaustive)
Metadata-driven Design Across Integration Disciplines: Web Service Metadata-driven Design Across Integration Disciplines Build These Using These New Business Process New Integrated View Legacy and packaged apps Relational databases XML documents New DataFlow WBI II ETL Integration Tasks:
Find and visualize related information
Connect it together
Generate useful information or artifacts
Remember what you discovered and share it
Clio: Schema Discovery and Mapping for Integration: Clio: Schema Discovery and Mapping for Integration Find it: Discovery
Use ontologies and graph algorithms to find similar objects (for mapping, e.g.)
Connect it: Mapping algorithms
Using mapping composition to handle schema evolution
Inverse mapping
Advanced features in mapping semantics
Conditional mapping, “nested” mapping, ETL-like procedural constructs
Round trip support between mappings and generated queries
Mapping-based data lineage in the context of query execution
Generate it: Transformations
XML transformation engine
Schema integration
Grid Computing: Grid Computing Storage Applications Processing Operating System Data I/O Distributed Computing Over Heterogeneous Resources, Using Open Standards to Provide Virtual Services Grid Computing
Information Integration in a Large-Scale Grid: Information Integration in a Large-Scale Grid Dynamic, distributed data access
Directory service and/or p2p discovery protocols
Logical specification of data desired
Handle dynamic arrival and departure of data sources
Automated data transformations without human intervention
Graceful degradation/Fault tolerance
To handle data source failures, missing data sources and performance issues
Use of redundancy to mask failures
Partial result ("friendly") delivery when can't hide
Automatic placement of data for performance, scalability, availability
Policy- and workload-driven
Quality of service and data guarantees
Economic model for location, placement?
Defining and working with data quality
What characteristics matter? What’s a “good” answer?
How does quality compose across sources? characteristics? For different activities?
Tesla: Tesla
Data Placement: Data Placement Goal: most critical data accesses are local, subject to constraints on space and other resources
Best blend of federation and consolidation for workload
How: policy-based advice on data caching and data movement, driven by workload
Highest priority applications see the best performance DB2 II Workload:
Gold customers who bought expensive products
Recent transactions involving expensive products
Gold customers in the 95120 zipcode
Extending enterprise search with analytics: Extending enterprise search with analytics Distinguish between different semantics of the same term
rock (stone), rock (music), rock (to move back and forth)
Search for information about higher level concepts that are not directly expressed in text tokens
named-entities (drug, gene, person)
relationships (inhibits, causes, is CEO of)
Find answers, not just documents
who is the CEO of IBM?
Support advanced applications such as
patent mining
repair record analysis for early detection of problems
drug discovery
Unstructured Information Management Architecture (UIMA): Unstructured Information Management Architecture (UIMA) A framework for integrating advanced text analytics technologies
Natural language, Machine learning, Information Retrieval, Bayesian Statistics
In combination these can given higher accuracy results than individually
Encouraging reuse and sharing across organizations
UIMA is being developed in collaboration with
The academic community
Government sponsored organizations
Being applied to problems in Life Sciences, Compliance, Finance, etc. Identify Language Find Words Find Word Roots Add Synonyms Find parts of speech Named-entity extraction (drugs, people, etc.) Find Relationships Index Documents A well-disciplined architecture for natural-language content analysis and integration
3 Research Paradigms for Exploring Text with Data: 3 Research Paradigms for Exploring Text with Data The SCORE approach:
Automatically determine a ‘context’ for queries
Use to select documents relevant to that query’s results
Request: the 5 poorest performing stocks in my portfolio over the last 6 months
Inferred ‘context’ will potentially fetch analyst reports about their common sector
Business Insights Workbench
Examine characteristics of a collection of business documents
Select a subset for more detailed text mining
Special UIMA annotators extract chemical names from patents
Special chemical converter takes names to SMILES string
SMILE strings used to retrieve chemical & physical properties
Chemical & Physical properties combined with patent metadata and Patents to augment the warehouse cube.
Competitive positioning analyses can all now be run against this more precise and data rich warehouse.
The AvatarBI approach
Enrich a Business Intelligence cube with ‘qualitative’ information extracted and quantified from business documents
Probabilistic OLAP
From which regions, and for which products, are we getting angry service calls.
UIMA
Annotators Text Search
Summary: Summary Information integration is a challenging task
Structured and unstructured data from many diverse sources
Variety, velocity, volume
Application needs vary widely
Several technologies needed – no silver bullet
Consolidation, federation, search
Metadata, cleansing, transformation
Need a unified framework for their use
Exciting research opportunities
Metadata and tools
Large-scale grids
Text analytics