Motivation
As XML data becomes a de facto standard on the Web, querying XML data is becoming an
issue of great importance, a real research challenge. Several solutions have been proposed
in the last years to this challenge, described in various working drafts of W3C or research
papers. As a core of this common effort has been identified a navigational approach for
information localization in XML data, comprised in a powerful language called XPath [1].
Initially, speaking about XML data was not so different as
speaking about other non-XML Web documents, where the size does not play an important
role. It was considered straightforward to get all the XML data into memory and then
to query it. Approaches like DOM [2] stands for this. Nowadays, there
are available huge XML repositories, where size matters. Even more, unbounded XML streams
gain more importance due to their practical applications from publish-subscribe
systems to data integration over the Web. As quickly realised by the XML community,
the random access to XML data precludes efficient query processing with no outstanding gain,
and sometimes it makes it even impossible.
Stream-based processing comes as an
attempt to overcome pitfalls of the traditional DOM-based processing:
- documents can be too big to be stored in main memory, as in DOM. Such documents are often encountered in natural language processing [3], biological [4] and astronomical data [5];
- unbounded XML streams that are automatically generated (e.g. news) impose new querying approaches to be considered, where the memory usage does really matter;
Progressive processing, i.e. stream-based processing generating partial results as soon as they are available, even before the full input is completely read, gives rise to a more efficient evaluation in certain contexts:
- Selective dissemination of information(SDI), where documents are filtered before being distributed to the subscribers [6, 7];
- Data integration, where the input of slow sources can be processed before the full data is retrieved over the internet [8];
- Pipelined processing, where the input is sent through a chain of processors each of which taking the output of the preceding processor as input. A good example of pipelined processing is Cocoon [9].
XPath Rewriter
XPath proposes constructs for back and forth navigation in the XML document tree. The backward navigation precludes efficient evaluation of XPath expressions on XML data streams, as the evaluation of backward navigation requires to buffer already encountered stream fragments. Therefore, we rewrite XPath expressions involving backward navigation into forward-only expressions.
Streamed XPath Evaluator
- SPEX is an efficient single-pass evaluator based on networks of independent deterministic pushdown transducers. Thus, it is especially suitable for implementation on devices with low-memory and simple logic as used, e.g., in mobile computing.
- The currently supported XPath fragment covers all forward axes of XPath, set operations, and restricted predicates. The reverse axes can also be treated, as direct result of previous work on XPath rewriting. As predicates, are supported (arbitrarily nested) structural predicates and value-based comparisons. Currently, SPEX does not support value-based joins.
- SPEX has a polynomial combined complexity (in the size of the stream and the query).
- A Java prototype has been implemented and shows the SPEX efficiency when comparing SPEX to current widely accepted XPath implementations, like Xalan.
SPEX Viewer: A Graphical User Interface for SPEX
SPEX has a graphical user interface. In this way, the user can see at every incoming message from the stream the processing status in terms of current states, stack configurations, and result candidates. Find a tutorial here.
References
- [1] XML Path Language (XPath) Version 1.0. W3C Recommendation. J. Clark, S. DeRose. http://www.w3.org/TR/xpath, 1999.
- [2] Document Object Model (DOM). W3C DOM Working Group. http://www.w3.org/DOM/, 2002.
- [3] XCES: An XML-based standard for linguistic corpora. N. Ide, P. Bonhomme, L. Romary. 2nd Annual Conf. on Language Processing Resources and Evaluation, 2000.
- [4] Modeling of Biological Data. P. Kroeger. University of Munich, 2001.
- [5] NASA. Astronomical Data Center, http://adc.gsfc.nasa.gov/.
- [6] Efficient Filtering of XML Documents with XPath Expressions. C. Chan, P. Felber, M. Garofalakis, R. Rastogi. In Proc. of ICDE 2002.
- [7] Efficient Filtering of XML Documents for Selective Dissemination of Information. M. Altinel, M. Franklin. In Proc. of VLDB 2000.
- [8] Efficient Evaluation of Regular Path Expressions on Streaming XML Data. A. Levy and Z. Ives and D. Weld. http://data.cs.washington.edu/integration/x-scan/, University of Washington, 2000.
- [9] Cocoon 2.0: XML publishing framework. Apache Project. http://xml.apache.org/cocoon/index.html, 2002.