Information Integration

York University
Fall 2013
Small Project

This project is to integrate two XML sources using XQuery.

The project should be done as teams of two or three people.

  The Task

Do the following.

  1. Find two reasonably large XML data sets that are related topic-wise. Each should be reasonably large; say, 500 nodes or so.
  2. Compose two or three — how many tasks should be the same as the size of your team. — integration tasks in English that require integration of information from the two sources.
  3. Devise XQuery queries for each of your integration tasks.

Deliver the following in a project report.

  1. The two data sources as XML files.
  2. For each task:
    1. the integration task in English;
    2. your XQuery query for the task; and
    3. the evaluation of the XQuery query against the two sources.
  3. A brief write up.
    1. Attribution: Project title, members of the team, etc.
    2. Description of the two data sources, including where they are from and what they describe.
    3. A discussion of the strengths and weaknesses of XQuery as an approach to integration, as you experienced it in this project. (This ought not be more than half a page — a couple of paragraphs — in length.

Do all in plain text (UTF-8 is fine). Turn into godfrey by email.

  Example Queries

These we did in class, and I ran against Zorba.

Every XPath expression is also a legal XQuery.

  1. path-Authors
  2. path-AuthorsURI
  3. path-FirstAuthors
  4. path-BigPaper

Of course, XQuery syntax is more flexible and more expressive.

  1. query-titles
  2. query-AuthorsURI
  3. query-FirstAuthors
  4. query-VeryFirstAuthor
  5. query-VeryFirstAuthor-let
  6. query-Portfolios
  7. query-PortfoliosSession
  8. query-PortfoliosCount
  9. query-DistinctAuthors-fn
  10. query-DistinctAuthors-groupby
  11. query-DistinctPortfoliosCount

Using Zorba, having an XQuery query in a file query, I can run these from the command line.

% zorba --indent -f -q query
  Potential Data Sources

Here are a few leads on tracking down interesting XML data sources (data sets).

stand-alone XQuery engine

You likely will want an XQuery engine to play with, for validating your XQuery queries, and for producing the results for your integration tasks. You are welcome to use any available engine, or online XQuery engine for the task, that supports at least XQuery 2.0.

In class, I have been using Zorba. Available well supported, open source XQuery engines are

  • BaseX:
    a native XML database system with XQuery. (Requires a server to be running.)
  • eXist:
    a native XML database system with XQuery. (Requires a server to be running.)
  • XQilla:
    Supports XQuery. (A native XML database system, I believe.)
  • Zorba:
    a stand-alone, command-line XQuery evaluator.

Zorba is up and running on indigo and red! Special thanks to Paul, Seela, and the tech team of EECS for putting this up on short notice for us. (It will push out to all the PRISM machines within a day.)

in the browser

The other way to run XQuery is to expose it in a web broswer through JavaScript, such as Firefox, which has the facilities hidden underneath. While XPath is readily available through JavaScript, XQuery is not. This is a bit messy, but convenient, and quite fun.

XQIB provides a JavaScript “library” that makes running XQuery in the browser accessible. One grabs the mxqueryjs — which contains mxqueryjs.nocache.js — and plants it under one's www-site. Or link it remotely in your www-page containing an XQuery query. (You could link to mine in the example below.)

See titles.html under my www home directory as an example. Look at its source. It queries the title nodes out of the bibliography.xml example file (a local copy, in this case).

This approach has extra complications.

  • Because of name space issues — the default name space within the HTML document is for the HTML document itself, not the source XML document you are querying — in the XQuery query, node calls have to provide a proper name space, or wildcard it so any name space will match. E.g., $mydoc//*:title instead of just $mydoc//title.

  • How to return the results? Easiest is to modify the HTML page to display them. This requires modifying the DOM of the page, and the results should be cast as HTML rather than general XML, so the browser will render the results correctly. (The example shows this.)

parke godfrey