Data Integration chapter outline

Revision 3

Christian Convey

Introduction

What is data integration?
Why do people want it?

i. Having integrated data can greatly simplify querying efforts (tan book, 596)

ii. Integrated data facilitates the recognition of new facts. For example, OLAP and Data Mining in cases where those applications benefit from more data than is stored in any one system.

iii. Supports organizational cooperation. I.e., bank mergers get to keep their original banksÆ OLTP systems but get unified reporting abilities.

iv. As needs change, may want more information from, or integration of information from legacy systems, but altering or replacing those systems is highly unadvisable. Information Integration lets you develop new supplemental systems while letting the old systems continue to do their jobs and to contribute the information that they have been.

v. Data Interchange (TBD: Is this the right place for a discussion on Data Interchange?)

Overview of issues that can make data integration hard

i. Heterogeneity of data sources

ii. Availability of data sources

iii. Dynamicity of individual data sources

iv. Autonomy of data sources û TBD: What kinds of autonomy exist?

v. Correctness of the integrated view of the data

vi. Query performance

The Technical Problem

Precise statement û Remember StanÆs symbolic representation on the whiteboard? Something like that.
What makes it a data management problem?

i. Affects how organizations structure their information systems

ii. Deals with some information systems issues that arise with the evolution of organizations and of their information systems goals.

What Makes the Problem Hard?

Issues

i. Heterogeneity of data sources

1. Kinds of heterogeneity: List the criteria to use for classifying different kinds of data sources, and either give examples or a full categorization of all noteworthy data sources.
(Perhaps in doing so, the following 7 items will be covered automatically)

a. The basic semantic problem

b. Data type differences (tan book, 596)

c. Value differences (tan book, 597)

d. Semantic differences (tan book, 597)

e. Missing values (tan book, 597)

f. Inconsistent data

g. Disagreeing values (Harvey, 4)

h. Intra-systems communications differences (3270 streams, HTTP, CORBA, Java RMI, raw TCP/IP, etc.)

i. Different performance characteristics for accessing data on different sources.

j. Human time required to hand-program a means of extracting data from each different source that the system must get information from. Especially hard if you want to rapidly add new sources.

2. Problems that result from heterogeneity

a. Human time required to hand-program a means of extracting data from each different source that the system must get information from. Especially hard if you want to rapidly add new sources.

b. Selection of a query plan in a mediated system can be really complicated.

ii. Availability of data sources

iii. Dynamicity of data sources

1. changing data values

2. schemas/presentation-formatting/data-model/etc.

iv. Autonomy of data sources

1. Unannounced changes to data values

2. Unannounced changes to data formats / semantics

3. Volitional oncooperativeness (MS SMB change vs. Samba?)

4. (Other kinds of autonomy?)

v. DonÆt always know when new data sources are available

vi. Trying to get hard facts from unstructured / semi-structured data

vii. Freshness of data

viii. Multi-source chronological consistency

ix. Query performance

1. Sometimes you can get same information from different sources.

2. Different sources might have different performance characteristics.

x. Caching of information from various sources.

Hardware and resource restrictions

i. TBD

Research topics

i. Extraction of information from semi-structured / unstructured text documents

ii. Automatic generation of wrappers

iii. Knowledge representation systems for doing semantic matching of data from different sources (?)

Some abstract solutions to the above problems

Commonly mentioned system architectures

i. Federated Databases

ii. Data Warehouses

iii. Mediation Systems

Wrappers (a.k.a. Translators?)

i. Basic concept

ii. Ways Wrappers / Extractors can get data from back end systems. (DoesnÆt necessarily address semantic model differences.)

1. Database servers

a. Database gateways

2. Providers of semi-structured / unstructured documents

a. AI-like systems

i. Intelligent Agents (?) (Harvey, 3)

b. Textual pattern matching systems (ôExtracting Semistructured Information from the Webö, Hammer et al)

c. Semistructured query languages: X-Query, WebSQL, etc. (Susac, 2à4)

3. Terminal-based systems

a. Screen scraping

b. Direct access to files maintained by those programs

iii. Wrappers: Automatic wrapper generation

iv. Wrappers: Templates for query patterns

Some systems and how they do/do not make use of the above solutions

WalmartÆs data warehouse (does it actually do integration?)
AT&TÆs data warehouse (does it actually do integration?)
TSIMMIS
ARANEUS
TraumaGEN (Harvey)
Netstat (The survey of what Web servers are used around the Web.)

Literature survey/Previous work

The state of the practice

For each technology covered in detail by this chapter, do products exist that actually use that technology, and if so, what are examples?
What areas of Data Integration is the software development industry focusing on developing products for?

Open questions

TBD as each author delves into his subtopic

Summary and conclusions