wiki:Help/UrxmlLongevity

Maintaining UR XML Longevity

Maintaining the integrity of the collection of UR XML documents in the eXist database against changes to the UR XML schema, improvements to the data mining capabilities of code translators, and bug fixes is one of the most important tasks an ESTEST system administrator can perform.

The UR XML document schema is still evolving and adapting to scientific needs and code optimizations--consequently, the document format undergoes changes on a not-too-frequent basis. It is the task of the system administrator to upgrade the collection of users documents to be consistent with the latest version of the schema. This is important because the development of ESTEST will always reflect the latest schema and documents consistent with an older version may not work with components of the ESTEST framework or its plugins. Additionally, the communication of documents' content between ESTEST servers is deeply dependent on the agreement of their formats.

XML has the flexibility to change its structure somewhat easily as compared to other formats thanks to XUpdate (the XML update language) and the use of XQuery to execute update instructions on data located by XPaths. In eXist-db the xupdate process is documented here. Using XUpdate we can move, replace, rename, and delete any XML element, attribute or data item. For example, we may add an attribute to a node element, rename the element or move it (and its children) from one location in the XML tree to another.

Schema Updates in ESTEST

In ESTEST we write XQuery scripts that execute XUpdate commands on UR XML elements. These XQuery schema update scripts are located in the www/schema/xqladmin directory of an ESTEST distribution and are named based on the revision number after which all new UR XML documents are generated consistently with the changes the schema update script implements. These XQuery scripts are meant to be applied to every document one-at-a-time sequentially starting from the revision that is newer than the current revision of the ESTEST distribution.

For example, if my ESTEST distribution is an archive of revision 500 and has not been updated since checking out that revision's source code; then I would apply schema533.xql, schema574.xql, schema605.xql in that order one-at-a-time until I have exhausted the sequence of schema XQuery script "patches". Then I would upgrade the source code of my ESTEST distribution to be up-to-date with the latest one available. All the documents generated after a software update will be consistent with the newer schema. It is important that these XQuery schema update scripts be applied in order only once because each script assumes a document schema consistent with revisions prior to itself.

How to determine the revision of your ESTEST distribution

If your distribution is cloned from a Mercurial archive it is a simple matter of checking the revision of the tip. For installations of ESTEST based on a source distribution of an archived repository (i.e. without revision control itself); the distribution will always come with plain text file .hg_archival.txt that contains a line such as

node: e04ff99fb6498cb01ea82d520236f4ccab490eee

where the first 12 digits of the string after "node:" will correspond to a number revision in the repository from which the distribution was archived.

Re-translation of UR XML documents

Production distributions of ESTEST will ship with a CGI script called urrt.cgi that implements a re-translation of the UR XML data based off the input/output files stored in the UR XML document itself. This means running the code translator on these I/O files derived from the data of the UR XML element corresponding to <stdin> and <stdout>. The <est_data> portion of the newer translation is grafted in-place of the same node in the original document and in this way a replacement of the UR XML document's data is achieved. The documents annotations as well as its original input/output storing elements are left unchanged as well as the UUID that uniquely identifies the document.

The re-translation can produce a perfect replica of the UR XML document except for perhaps by a comment stating that a re-translation has taken place. But when the translators themselves change to in order to mine new physical data into UR XML documents or correct for bugs that affect existing translations--the re-translation will likely be different. We can see the utility of the re-translation process from these two scenarios: to capture more physical data from the original I/O files or to correct for mistakes in reading, parsing or translating existing data.

An administrative version of the urrt.cgi script named urrtall.py is also available to run on the local host by the system administrator that re-translates every single UR XML document of type simulation. Depending on the size of the UR XML collection this script can take many minutes to finish.

$ cd estest/scripts/admin
$ python urrtall.py

What is the difference between schema updates and re-translations?

Schema updates can act on any and all UR XML elements whereas re-translations are limited to replacing the <est_data> element and its children. It is the opinion of the author that schema updates pertain exclusively to structural modifications of the XML--that is changes to elements and perhaps their attributes leaving the content effectively intact. A re-translation can act on the UR XML contents: changing or adding to the data as necessary. Re-translation also brings the <est_data> element and its children immediately up-to-date with the schema supported by the code translators which should be the latest up to that revision of the source code.

But can't a re-translation also bring <est_data> elements up to date with the newest schema? Yes it can, provided that the translators are compliant with the latest schema. But keep in mind that schema updates are almost always more efficient and to-the-point since they can precisely manipulate the UR XML elements down to the level of a leaf based on XPaths. Replacing the entire <est_data> element which is a second tier element of the UR XML tree is wasteful and has more side effects--even if those effects are only to produce a replicate of the XML.

But can't schema updates also act on <est_data> elements? Certainly schema updates can and should act on the <est_data> element and its children when it is an appropriate modification to the XML structure. Changes to the contents are typically more complex to code in XQuery/XUpdate and may be impossible if, for example, the change requires high level mathematical manipulation beyond the application of XQuery.

For these two reasons it is useful to adhere to the (sometimes contrived) distinctions of structure vs content domains of schema updates vs re-translations of UR XML documents.

When actually administrating changes to UR XML documents

Always backup the eXist database first!

Last modified 6 years ago Last modified on 06/10/11 15:08:02