Nestorov, S. and Abiteboul, S. and Motwani, R. (1998) Extracting Schema from Semistructured Data. In: ACM International Conference on Management of Data (SIGMOD 1998), June 2-4, 1998, Seattle, Washington.
Semistructured data is characterized by the lack of any xed and rigid schema, although typically the data has some implicit structure. While the lack of xed schema makesextracting semistructured data fairly easy and an attractive goal, presenting and querying such data is greatly impaired. Thus, a critical problem is the discovery of the structure implicit in semistructured data and, subsequently, the recasting of the raw data in terms of this structure. In this paper, we consider a very general form of semistructured data based on labeled, directed graphs. We show that such data can be typed using the greatest xpoint semantics of monadic datalog programs. We present an algorithm for approximate typing of semistructured data. We establish that the general problem of nding an optimal such typing is NP-hard, but present some heuristics and techniques based on clustering that allow efcient and near-optimal treatment of the problem. We also present some preliminary experimental results.
|Item Type:||Conference or Workshop Item (Paper)|
|Uncontrolled Keywords:||semistructured data, schema extraction, clustering, summarization|
|Subjects:||Computer Science > Semistructured Data|
|Related URLs:||Project Homepage||http://infolab.stanford.edu/lore/|
|Deposited By:||Import Account|
|Deposited On:||25 Feb 2000 16:00|
|Last Modified:||29 Dec 2008 11:18|
Repository Staff Only: item control page