Revision 3
Christian Convey
i. Having integrated data can greatly simplify querying efforts (tan book, 596)
ii. Integrated data facilitates the recognition of new facts. For example, OLAP and Data Mining in cases where those applications benefit from more data than is stored in any one system.
iii. Supports organizational cooperation. I.e., bank mergers get to keep their original banksÆ OLTP systems but get unified reporting abilities.
iv. As needs change, may want more information from, or integration of information from legacy systems, but altering or replacing those systems is highly unadvisable. Information Integration lets you develop new supplemental systems while letting the old systems continue to do their jobs and to contribute the information that they have been.
v. Data Interchange (TBD: Is this the right place for a discussion on Data Interchange?)
i. Heterogeneity of data sources
ii. Availability of data sources
iii. Dynamicity of individual data sources
iv. Autonomy of data sources û TBD: What kinds of autonomy exist?
v. Correctness of the integrated view of the data
vi. Query performance
i. Affects how organizations structure their information systems
ii. Deals with some information systems issues that arise with the evolution of organizations and of their information systems goals.
i. Heterogeneity of data sources
1.
Kinds of heterogeneity: List the criteria to use for
classifying different kinds of data sources, and either give examples or a full
categorization of all noteworthy data sources.
(Perhaps in doing so, the following 7 items will be covered automatically)
a. The basic semantic problem
b. Data type differences (tan book, 596)
c. Value differences (tan book, 597)
d. Semantic differences (tan book, 597)
e. Missing values (tan book, 597)
f. Inconsistent data
g. Disagreeing values (Harvey, 4)
h. Intra-systems communications differences (3270 streams, HTTP, CORBA, Java RMI, raw TCP/IP, etc.)
i. Different performance characteristics for accessing data on different sources.
j. Human time required to hand-program a means of extracting data from each different source that the system must get information from. Especially hard if you want to rapidly add new sources.
2. Problems that result from heterogeneity
a. Human time required to hand-program a means of extracting data from each different source that the system must get information from. Especially hard if you want to rapidly add new sources.
b. Selection of a query plan in a mediated system can be really complicated.
ii. Availability of data sources
iii. Dynamicity of data sources
1. changing data values
2. schemas/presentation-formatting/data-model/etc.
iv. Autonomy of data sources
1. Unannounced changes to data values
2. Unannounced changes to data formats / semantics
3. Volitional oncooperativeness (MS SMB change vs. Samba?)
4. (Other kinds of autonomy?)
v. DonÆt always know when new data sources are available
vi. Trying to get hard facts from unstructured / semi-structured data
vii. Freshness of data
viii. Multi-source chronological consistency
ix. Query performance
1. Sometimes you can get same information from different sources.
2. Different sources might have different performance characteristics.
x. Caching of information from various sources.
i. TBD
i. Extraction of information from semi-structured / unstructured text documents
ii. Automatic generation of wrappers
iii. Knowledge representation systems for doing semantic matching of data from different sources (?)
i. Federated Databases
ii. Data Warehouses
iii. Mediation Systems
i. Basic concept
ii. Ways Wrappers / Extractors can get data from back end systems. (DoesnÆt necessarily address semantic model differences.)
1. Database servers
a. Database gateways
2. Providers of semi-structured / unstructured documents
a. AI-like systems
i. Intelligent Agents (?) (Harvey, 3)
b. Textual pattern matching systems (ôExtracting Semistructured Information from the Webö, Hammer et al)
c. Semistructured query languages: X-Query, WebSQL, etc. (Susac, 2à4)
3. Terminal-based systems
a. Screen scraping
b. Direct access to files maintained by those programs
iii. Wrappers: Automatic wrapper generation
iv. Wrappers: Templates for query patterns