Schema-Based Partitioning for Optimized Querying of XML data

Efficient querying and retrieval of data from XML databases is a challenging task, especially when these DBs contain a large number of documents, potentially millions of them. One way to scale out is by spreading the documents over multiple nodes, but running all queries on all nodes has poor scale-out qualities. Here we propose a solution to this problem whereby the document space is partitioned on the basis of some document characteristic that is referenced by the queries, and can be inferred efficiently for a given query; this allows the query dispatch system to preemptively avoid issuing the query to the nodes that are positively known to contain empty result. Our partitioning scheme provides an environment for automated placement of data and thus allows the users (i.e., query authors or clients) to operate at a higher abstraction level; in other words, the information about the location of data is not required as an input from the user. Satisifiability of XPath, which determines the emptiness and non-emptiness of its answer at compile time, plays an important role in our strategy of determining data location and query optimization. To further optimize the performance of a query in addition to what is attainable through the process of context identification, indexes are utilized which are built on such XML elements that are important from point of view of expected queries. The strategies have been designed, implemented and tested on top of xDB. We demonstrate that the performance of distributed XQueries is improved after passing through our context identification phase. The precision of our context identification is 90% on our dataset and query set. We show interesting tradeoffs to alternative approaches in terms of xDB data pages instead of time-based parameters to get highly reliable and consistent results. Large datasets and XPath/XQuery with different syntactic properties have been used which show the applicability of proposed techniques to real world XML database systems.

Subject

query
XML
data partitioning

To reference this document use:

http://resolver.tudelft.nl/uuid:313c929c-9543-4037-9fa7-06645e654cf3

Embargo date

2012-09-18

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

Thesis_-_Johar_Syed.pdf

3.65 MB

Close viewer