The 30th IPP Symposium

Turning the Web Into a Knowledge Base: Information Extraction with Finite-State Models

Andrew McCallum, UMass, Amherst

The Web is the world's largest knowledge base. However, its data is in a form intended for human reading, not manipulation, mining and reasoning by computers. Today's search engines help people find web pages. Tomorrow's search engines will also help people find "things" (like people, jobs, companies, products), facts and their relations.

Information extraction is the process of filling fields in a database by automatically extracting sub-sequences of human-readable text. Finite-state machines are the dominant model for information extraction both in research and industry. In this talk I give several examples of information extraction tasks performed at WhizBang Labs, and then describe two new finite-state models designed to take special advantage of the multifaceted nature of text on the web. Maximum entropy Markov models (MEMMs) are discriminative sequence models that allow each observation to be represented as a collection of arbitrary overlapping features (such as word identity, capitalization, part-of-speech and formatting, plus agglomerative features of the entire sequence and features from the past and future). Conditional random fields (CRFs) are a generalization of MEMMs that solve a fundamental limitation of MEMMs and all other discriminative Markov models based on directed graphical models. I introduce both models, skim over their parameter estimations algorithms, and present experimental results on real-world tasks.

(Joint work with Fernando Pereira, John Lafferty, Dayne Freitag, and many others at WhizBang Labs.)