Data Integration chapter outline

Revision 3

Christian Convey

 

  1. Introduction
    1. What is data integration?
    2. Why do people want it?

                                                               i.      Having integrated data can greatly simplify querying efforts (tan book, 596)

                                                             ii.      Integrated data facilitates the recognition of new facts. For example, OLAP and Data Mining in cases where those applications benefit from more data than is stored in any one system.

                                                            iii.      Supports organizational cooperation. I.e., bank mergers get to keep their original banksÆ OLTP systems but get unified reporting abilities.

                                                           iv.      As needs change, may want more information from, or integration of information from legacy systems, but altering or replacing those systems is highly unadvisable. Information Integration lets you develop new supplemental systems while letting the old systems continue to do their jobs and to contribute the information that they have been.

                                                             v.      Data Interchange (TBD: Is this the right place for a discussion on Data Interchange?)

    1. Overview of issues that can make data integration hard

                                                               i.      Heterogeneity of data sources

                                                             ii.      Availability of data sources

                                                            iii.      Dynamicity of individual data sources

                                                           iv.      Autonomy of data sources û TBD: What kinds of autonomy exist?

                                                             v.      Correctness of the integrated view of the data

                                                           vi.      Query performance

  1. The Technical Problem
    1. Precise statement û Remember StanÆs symbolic representation on the whiteboard? Something like that.
    2. What makes it a data management problem?

                                                               i.      Affects how organizations structure their information systems

                                                             ii.      Deals with some information systems issues that arise with the evolution of organizations and of their information systems goals.

  1. What Makes the Problem Hard?
    1. Issues

                                                               i.      Heterogeneity of data sources

1.      Kinds of heterogeneity: List the criteria to use for classifying different kinds of data sources, and either give examples or a full categorization of all noteworthy data sources.
(Perhaps in doing so, the following 7 items will be covered automatically)

a.       The basic semantic problem

b.      Data type differences (tan book, 596)

c.       Value differences (tan book, 597)

d.      Semantic differences (tan book, 597)

e.       Missing values (tan book, 597)

f.        Inconsistent data

g.       Disagreeing values (Harvey, 4)

h.       Intra-systems communications differences (3270 streams, HTTP, CORBA, Java RMI, raw TCP/IP, etc.)

i.         Different performance characteristics for accessing data on different sources.

j.        Human time required to hand-program a means of extracting data from each different source that the system must get information from. Especially hard if you want to rapidly add new sources.

2.      Problems that result from heterogeneity

a.       Human time required to hand-program a means of extracting data from each different source that the system must get information from. Especially hard if you want to rapidly add new sources.

b.      Selection of a query plan in a mediated system can be really complicated.

                                                             ii.      Availability of data sources

                                                            iii.      Dynamicity of data sources

1.      changing data values

2.      schemas/presentation-formatting/data-model/etc.

                                                           iv.      Autonomy of data sources

1.      Unannounced changes to data values

2.      Unannounced changes to data formats / semantics

3.      Volitional oncooperativeness (MS SMB change vs. Samba?)

4.      (Other kinds of autonomy?)

                                                             v.      DonÆt always know when new data sources are available

                                                           vi.      Trying to get hard facts from unstructured / semi-structured data

                                                          vii.      Freshness of data

                                                        viii.      Multi-source chronological consistency

                                                           ix.      Query performance

1.      Sometimes you can get same information from different sources.

2.      Different sources might have different performance characteristics.

                                                             x.      Caching of information from various sources.

    1. Hardware and resource restrictions

                                                               i.      TBD

    1. Research topics

                                                               i.      Extraction of information from semi-structured / unstructured text documents

                                                             ii.      Automatic generation of wrappers

                                                            iii.      Knowledge representation systems for doing semantic matching of data from different sources (?)

  1. Some abstract solutions to the above problems
    1. Commonly mentioned system architectures

                                                               i.      Federated Databases

                                                             ii.      Data Warehouses

                                                            iii.      Mediation Systems

    1. Wrappers (a.k.a. Translators?)

                                                               i.      Basic concept

                                                             ii.      Ways Wrappers / Extractors can get data from back end systems. (DoesnÆt necessarily address semantic model differences.)

1.      Database servers

a.       Database gateways

2.      Providers of semi-structured / unstructured documents

a.       AI-like systems

                                                                                                                                       i.      Intelligent Agents (?) (Harvey, 3)

b.      Textual pattern matching systems (ôExtracting Semistructured Information from the Webö, Hammer et al)

c.       Semistructured query languages: X-Query, WebSQL, etc. (Susac, 2à4)

3.      Terminal-based systems

a.       Screen scraping

b.      Direct access to files maintained by those programs

                                                            iii.      Wrappers: Automatic wrapper generation

                                                           iv.      Wrappers: Templates for query patterns

  1. Some systems and how they do/do not make use of the above solutions
    1. WalmartÆs data warehouse (does it actually do integration?)
    2. AT&TÆs data warehouse (does it actually do integration?)
    3. TSIMMIS
    4. ARANEUS
    5. TraumaGEN (Harvey)
    6. Netstat (The survey of what Web servers are used around the Web.)
  2. Literature survey/Previous work
    1. TBD
  3. The state of the practice
    1. For each technology covered in detail by this chapter, do products exist that actually use that technology, and if so, what are examples?
    2. What areas of Data Integration is the software development industry focusing on developing products for?
  4. Open questions
    1. TBD as each author delves into his subtopic
  5. Summary and conclusions
    1. TBD

TBD: To be determined