Cocoon and 4Suite for Content Management: The Best of Both Worlds at Seattle University School of Law

Cocoon and 4Suite for Content Management:
The Best of Both Worlds at Seattle University School of Law

By Evan Lenz and Uche Ogbuji

I jointly presented this paper with Uche Ogbuji at XML Europe 2003. It describes a custom content management system I built for Seattle University School of Law.

Abstract

Seattle University School of Law based its Web site on Cocoon, an XML-based Web publishing framework, and based its custom content management system for campus announcements and events on 4Suite, a platform for XML/RDF processing. In facing the challenge to integrate the two applications, the solution we found was to continue employing both frameworks, integrating them through the use of established Web standards. This enabled us to embrace the strong points of each platform: Cocoon for its Web publishing framework, and 4Suite for its XML repository, and metadata storage and management. In this paper, we introduce Cocoon, 4Suite, and the specific techniques used to integrate them in a custom content management system. We also delve into some of the details of the application's implementation, including Cocoon configuration, example data, XSLT stylesheets, and RDF queries.

Introduction
Web site management with Cocoon
     URI design discussion
         Leave out file extensions
         Leave out topic/classification by subject
The 4Suite XML and RDF repository
     Server-side scripting in XSLT
     XML/metadata synchronization using document definitions
     Versa RDF query
Using Cocoon and 4Suite together
Future considerations
Bibliography

Introduction

The Seattle University School of Law began its content management initiative in the spring of 2002, with two primary targets for content publication:

the "Sullivan Docket", which is the name given to four 50-inch touch-screen plasma displays placed throughout Sullivan Hall, home to the Law School; and
the new School of Law Web site.

In addition to displaying local traffic, weather, building information, and live CNN streaming, the Docket's primary responsibility is to propagate campus-wide announcements and events. The first iteration of the Docket authoring application was completed in the summer of 2002. It provided a Web interface for school staff to add, update, and delete announcements and events for display on the Docket. It also provided basic workflow support, in which all new items must be approved by an editor before going live. The application was built on 4Suite ([4SUITE] ), an open-source platform for XML (Extensible Markup Language) and RDF (Resource Description Framework) ([RDFINTRO1] , [RDFINTRO2] ) processing.

Figure 1: Sullivan Docket - "Desktop Version"

Fig. 1: Sullivan Docket - "Desktop Version"

The School of Law Web site, on the other hand, was built using Cocoon ([COCOON] ), another open-source platform for XML processing. Among the requirements for the new Web site was that it provide three versions of each page ("Flash", "Standard", and "Text-only"). It also needed to be completely XML-based in order to support subsequent integration with other XML-based content management projects at the Law School. Cocoon's flexible platform for Web publishing was more than enough to meet these needs. The new site was launched in December of 2002.

Figure 2: Seattle University School of Law Web site

Fig. 2: Seattle University School of Law Web site

The SU Law Web site can be accessed at http://www.law.seattleu.edu.

Our project from that point forward was to integrate Docket content, where appropriate, into the Web site, enabling Law School staff to slate certain announcements or events for publication on the Docket, Web site, or both. It superficially appeared that a choice had to be made between the two frameworks being used, Cocoon and 4Suite. They have much in common: both are platforms for XML processing; both make heavy use of XSLT (Extensible Stylesheet Language Transformations) ; both include Web application functionality. However, while Cocoon excelled at its flexible pipeline approach to XML transformations and Web publishing, it lacked a real story for the storage and querying of metadata. On the other hand, 4Suite's XML and RDF repository provided the metadata storage and querying needed for the management of kiosk announcements and events. Although the metadata initially was primitive, we did not want to forgo the continued use of 4Suite, due to the increasing complexity of metadata introduced by the addition of the Web site as a publishing target, along with new user requirements that had arisen over time.

The solution we found was to continue using both Cocoon and 4Suite as two separate systems, loosely coupled using REST (REpresentational State Transfer) principles ([REST] ). In a nutshell, Cocoon is used for multi-channel publishing and 4Suite is used for content authoring and management. The mechanism by which they are hooked together is the most natural mode of operation in Cocoon: the FileGenerator, which in addition to reading files from disk, can make parameterized HTTP GET requests for XML from other servers. 4Suite, in turn, services these requests by executing parameterized RDF queries, aggregating resulting XML resources, and transforming them into a format suitable for convenient processing by Cocoon.

We should note that there is greater opportunity for integration between the two frameworks than what we have exploited so far in the Law School and do demonstrate in this paper. For example, Cocoon could serve as the front end to the entire application--not just the Web publishing (read-only) side, but also the content authoring (read/write) side. More thoughts on this are included in the "Future Considerations" section at the end of this paper. As it happens, our current project did not require such flexibility in the front end of the authoring side, which after all has only 10-15 users. 4Suite alone fulfills the needs of the content authoring/management application, and Cocoon simply imports the live content into select pages of the school's Web site.

Web site management with Cocoon

Cocoon (http://cocoon.apache.org) is an open-source XML-based Web publishing framework, designed to enable the separation of concerns between content, logic, and style. It achieves this goal through the use of a generic pipeline framework, based on SAX (Simple API for XML) , which allows XML content to go through a configurable series of transformations (usually XSLT). Each pipeline begins with exactly one generator, is followed by zero or more transformers, and is concluded with exactly one serializer.

Generators produce XML content using any number of mechanisms: reading a file, submitting an HTTP request, calling a database, invoking a server page script, etc.
Transformers perform subsequent processing (e.g. XSLT or XInclude (XML Inclusions) ) on the XML stream for subsequent handling by either another transformer or the serializer.
Serializers are the final step in the pipeline; they determine the actual serialization format, whether HTML (Hypertext Markup Language) , well-formed XML, browser-compatible XHTML (Extensible Hypertext Markup Language) , SVG (Scalable Vector Graphics) , PDF (Portable Document Format) (via XSL:FO (Extensible Stylesheet Language Formatting Objects) and FOP), rasterized images (via SVG and Batik), etc.

Pipelines in turn are mapped to HTTP (Hypertext Transport Protocol) requests, using URI (Uniform Resource Identifiers) patterns, so that when a matching HTTP request comes in, the corresponding pipeline is invoked.

The sitemap is the central point of configuration in Cocoon, where all pipelines are declared and associated with incoming HTTP request URI patterns. The sitemap itself (usually named sitemap.xmap) is written in XML format itself. Below is a simplified example of a pipeline (buried inside sitemap.xmap):

<map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/1.0">
 ...
 <map:pipelines>
  <map:pipeline>
    ...
    <map:match pattern="career/news">

      <map:generate
        src="http://redhawk/?xslt=getNews.xsl&category=career"/>
      <map:transform src="stylesheets/newsToHtml.xsl"/>
      <map:serialize type="xhtml"/>
    </map:match>
    ...
  </map:pipeline>

 </map:pipelines>
  ...
</map:sitemap>

The actual sitemap used by the Law School Web site is much more sophisticated than this. The need to support multiple versions of each page, served from the same URL (Uniform Resource Locators) , based on the value of a client cookie, requires a lot more sophistication and use of other Cocoon sitemap features such as actions (arbitrary Java code that can supply new context information to the sitemap) and resources (which are like subroutines that can be used by multiple pipelines).

Flow of control is handled using constructs that have a syntax familiar to users of XSLT (e.g. map:when, map:otherwise, etc.) An excerpt from the Cocoon sitemap in production in the Law School Web site is shown here:

   <map:resource name="front-door">

     <map:select type="request-parameter">
       <map:parameter name="parameter-name" value="set-version"/>
       <map:when test="flash">
         <map:call resource="check-flash"/>
       </map:when>
       <map:when test="flash-confirmed">

         <map:call resource="set-preference-to-flash"/>
       </map:when>
       <map:when test="standard">
         <map:call resource="set-preference-to-standard"/>
       </map:when>
       <map:when test="simple">

         <map:call resource="set-preference-to-simple"/>
       </map:when>
       <map:otherwise>
         <!-- more logic -->
       </map:otherwise>
     </map:select>

   </map:resource>

The map:select element is similar in behavior and syntax to xsl:choose; it functions as a conditional statement. In this case, the set-version HTTP request parameter is tested for, and other sitemap resources are invoked depending on its value. The full logic of the Cocoon sitemap as deployed in the SU Law Web site is depicted in the following flow chart. This should help convey the power and flexibility of Cocoon as a Web publishing framework.

Figure 3: Cocoon Sitemap Logic - SU Law Web site

Fig. 3: Cocoon Sitemap Logic - SU Law Web site

This flow chart depicts how the majority of requests dispatched to Cocoon (from Apache) are handled in the SU Law Web site. A more readable, full-fidelity version of this graphic is available on the SU Law Web site at http://www.law.seattleu.edu/xmleurope/sitemap-flow.png.

URI design discussion

The SU Law Web site design, in terms of URIs, was inspired by Tim Berners-Lee's 1998 essay, "Cool URIs don't change" ([COOLURIS] ). In particular, we aimed to fulfill two of the essay's suggestions regarding what should be left out of well-designed URIs:

Leave out file extensions
Leave out topic/classification by subject

Leave out file extensions

For the SU Law Web site, we chose to leave out file extensions in the URLs of all XHTML pages, regardless of whether they are statically or dynamically generated. For files of other types such as JPG, GIF, Flash, Word, etc., we used the corresponding standard file extensions. This made it easy to dispatch such requests in bulk configuration directives, using URL patterns such as *.jpg and *.gif. In Cocoon, such directives would look something like this:

    <map:match pattern="**.jpg">
      <map:read src="{1}.jpg"/>
    </map:match>

    <map:match pattern="**/*.gif">

      <map:read src="{1}/{2}.gif"/>
    </map:match>

The use of map:read is an alternative to the sequence of map:generate, map:transform, map:serialize. It is appropriate for non-XML content that is read directly from disk and passed unchanged in the HTTP response. The map:match element utilizes the default URI matcher. The value of the pattern attribute in this example contains a double asterisk (**), which is a wildcard that will match any sequence of characters in the URL path, including slash (/) characters. A single asterisk (*), as included in the second pattern, would match any sequence of characters within one step of the URL path, i.e. before the next slash character. Thus, the two patterns shown in the example are equivalent (apart from the file extensions). One final thing to note about this example is the use of curly braces ({}) to invoke the sitemap value replacement mechanism. When matching patterns, the matched value of wildcards is available for later reference using the special numeric variables 1, 2, etc.

One issue not directly addressed by Berners-Lee's essay was what to do about HTTP requests for directory URLs that do or do not contain trailing slashes. This issue particularly arises when trying to fulfill the goal of leaving off file extensions from URIs. Most Web sites' URLs include a file extension, such as .html. One advantage of this practice is that it is easy to distinguish between requests for directories (which do not contain an extension) and requests for pages (which do contain an extension). However, when removing the distinction altogether, as Berners-Lee suggests, one has to make a choice.

By far the most common approach to solving this problem is to force the use of a trailing slash via a browser redirect. For example, a request to the following URL:

http://example.com/foo/bar

would result in a redirect to the following URL:

http://example.com/foo/bar/

We decided to take the opposite approach at SU School of Law. When a request comes in with a trailing slash, we serve a redirect to the URL with the trailing slash stripped. For example, a request for the following URL:

http://www.law.seattleu.edu/career/

results in a redirect to:

http://www.law.seattleu.edu/career

This too can be easily achieved in Cocoon with the use of a simple URI matcher:

    <map:match pattern="**/">
      <map:redirect-to uri="{1}"/>
    </map:match>

The pattern in this example will match all URIs with a trailing slash and will send an HTTP redirect response (via the map:redirect-to directive) to the same URI with the trailing slash stripped.

On the SU Law Web site, the "career" in http://www.law.seattleu.edu/career, for example, can also function as a directory, in addition to identifying a page. Here is an example of such a URI path "extension":

http://www.law.seattleu.edu/career/alumni

Mapping these URLs to the filesystem (from which the SU Law Web server grabs most of its pages) would be problematic with the default configuration of most Web server setups, since it would imply the existence of a file and a directory with the same name residing in the same directory, which is not possible in most file systems. This is handled easily in Cocoon, on the other hand, because of the flexibility afforded by the sitemap in mapping external URIs to internal file names. In the SU Law Web site configuration, the above example URIs mirror the following files in the file system, whose names end with .html

    <map:match pattern="**/*">
      <map:generate src="{1}/{2}.html"/>
      <map:serialize type="xml"/>
    </map:match>

This example shows that the .html suffix is simply appended to the incoming URI (or, in the case of our production sitemap, a URI requested for further processing from another part of the sitemap via the cocoon:/ protocol, which we won't discuss here). In this way, there is no conflict between /career and /career/alumni, because they map to these distinct files on the filesystem:

{site-root}/career.html
{site-root}/career/alumni.html

One consequence of this approach is that there is a subtle distinction that has to be kept clear when using relative URIs within a document. Specifically, a relative URI inside the /career page is resolved relative to the root of the site, rather than to the career directory. That's because /career is not treated as a directory by the browser; there is no trailing slash. Consider an example HTML anchor embedded in the /career page:

<a href="alumni">Career alumni page</a>

The above is most likely a mistake, because it links to /alumni rather than to /career/alumni. The correct relative URL would instead appear as in the following link:

<a href="career/alumni">Career alumni page</a>

An advantage of this approach is that any URL path can be extended, i.e. function as a directory in another URL, without shuffling files around or reconfiguring the Web server. In contrast, to attain that flexibility with the traditional approach of forcing the trailing slash, this is usually done by storing every page in an index.html file and providing a directory for every page, resulting in an unnecessary proliferation of directories. In any event, we suspected that the tradition of forcing the trailing slash had more to do with the limitations and conveniences of standard server technology, rather than URI design.

Leave out topic/classification by subject

This one, as Berners-Lee acknowledges, is a bit trickier, as well as subjective. One surefire way to avoid including the "topic" or "classification" in the URI would be to simply use a date stamp and little else. Of course, such an approach would fail to include much meaning in resulting URIs, which we had as a distinct goal, at least for mnemonic value. Nevertheless, we took this proscription to heart, thought about how it might be interpreted, and reformulated it.

Berners-Lee provides an important clue to the nature of the problem: "Because the relationships between subjects are web-like rather than tree-like, even for people who agree on a web may pick a different tree representation." (sic)

URI structure is, of necessity, hierarchical. Site navigation tends to be hierarchical, and the SU Law Web site is no exception. To address the problem Berners-Lee raises (without, by the way, suggesting very many solutions), we formulated the following mandate:

Decouple navigational structure from URI structure.

It is very common for Web site navigation to mirror URI structure. When two pages of a site are in the same navigational "section", they are also typically in the same directory, as exposed in the pages' URIs. That correlation is precisely what we sought to avoid. We achieved this goal through the use of a custom XML configuration file that maps between the two independent hierarchies (navigation and URI structure). An excerpt from navigation.xml is shown below:

<navigation xmlns="http://law.seattleu.edu">
  <menu display="Welcome" sectionId="welcome">
    <link href="/" display="SU Law Home"/>
    <link display="Contact Information" href="/contactus"/>

    <link display="Directions" href="/directions"/>
    <link href="/welcome" display="From the Dean"/>
    <link href="/history" display="History"/>
    <link href="/calendar" display="Master Calendar"/>
    <link href="/mission" display="Mission"/>
    <link href="/search" display="Search"/>

    <link href="/sitemap" display="Site Map"/>
    <link href="http://www.seattleu.edu"
          display="Seattle University Home"/>
    <hidden href="/news" display="News"/>
    <hidden pattern="/news"/>
    <hidden href="/privacy" display="Privacy Statement"/>
  </menu>

  <menu display="Students" sectionId="students">
    <menu display="Academics">
      <link href="/academics" display="Introduction"/>
      <link href="/academics/calendar" display="Academic Calendar"/>
      <link href="/courses" display="Course Descriptions"/>
      <link href="/classassignments" display="Class Assignments"/>

      <hidden pattern="/classassignments"/>
      <link href="/finals" display="Final Exam Schedule"/>
      <link href="/academics/foci" display="Focus Areas"/>
      <hidden href="/academics/foci/businesslaw"
              display="Business Law"/>
      <hidden href="/academics/foci/civiladvocacy"
              display="Civil Advocacy"/>
    </menu>

    <!-- more submenus -->
  </menu>
  <!-- more menus -->
</navigation>

The above snippets from navigation.xml are organized hierarchically as a mirror to the hierarchical navigation of our site (menus containing submenus containing submenus), hence the name navigation.xml. It is used to automatically generate all three navigational schemes of the SU Law Web site, via XSLT: JavaScript for the DHTML (Dynamic Hypertext Markup Language) menu in the "Standard" version, straight XML processed directly by the Flash player in the "Flash" version, and nested HTML lists in the "Text-only" version. It is also used to determine which section a given page should appear in, i.e. which section-specific images and sidebar should be displayed. The hidden elements designate pages that are in a particular section of the site but that are excluded from the navigation menu.

In an early iteration of the site, Cocoon performed a lookup in navigation.xml on-the-fly for each page of the site in order to determine which section to display the page in. Since then, we have made some optimizations through the use of a static site generation script, Apache, and mod_rewrite, in order to serve static pages directly, without dispatching those requests to Cocoon. Cocoon, however, still plays a crucial role in handling the requests for all dynamic (e.g. database-driven) pages, as well as handling all the cookie-setting and detection logic, by which users' site version preferences are stored. (This updated role of Cocoon is what is reflected in the sitemap flow chart above.) In any case, navigation.xml still serves as the single, authoritative source for the site's navigation hierarchy.

In the end, some of the URIs and navigation section names do in fact coincide, but the point is that this is a coincidence rather than a necessity. This decoupling has already become immensely useful, as some users expressed dislike of the site's navigational structure when it was first released. Because we had previously decoupled site navigation from URI structure, it was very easy to immediately respond to the requests of school faculty and staff by simply moving page declarations to other sections of the site in navigation.xml. Thus, site reorganization was achieved by editing a single file. In contrast, if navigational structure and URI structure had been coupled, such a rearrangement would have necessitated the error-prone update of many HTML pages, not to mention violation of the maxim that "cool URIs don't change". To this day, we have enjoyed the freedom of being able to fine-tune the site's navigational structure without any worry of creating more work or of breaking any links.

The 4Suite XML and RDF repository

As previously mentioned, the majority of the SU Law Web site's content is currently stored on the filesystem. However, there is a growing number of specialized content types, such as announcements and events, that are being managed under a homegrown content management system that has come to be known as "Redhawk" (after the SU mascot). Redhawk is implemented as a 4Suite application.

Fig. 4: Create New Announcement Form

The forms for adding new, or modifying existing, announcements or events was built using Altova's free XML editor, Authentic 5 Browser Edition.

4Suite is a platform for XML and RDF development, especially for Web applications. It is implemented in Python and C, and has two main aspects. At the core is a library of integrated tools for XML processing, including DOM (Document Object Model) , SAX, RDF, XPath (XML Path Language) , XSLT, RELAX NG, XInclude, XPointer (XML Pointing Language) , XUpdate (XML Update Language) , XLink (XML Linking Language) , and more. A Python developer can simply use these to stitch together custom XML-processing applications, or one can use the command line tools for XML scripting. 4Suite also comes with a repository and server framework, which provides a more complete basis for applications developers. It can be used stand-alone to drive Web applications, it can also be accessed using a Python API or the command line in order to integrate into other applications and frameworks, or, as was implemented in the Redhawk project, it could be integrated into other applications using the fact that basic Web architecture (also known as REST) allows for very straightforward composition of Web frameworks.

The 4Suite repository first of all allows the storage of XML and management of persistent RDF models for the developer. It provides the basic expected facilities such as transaction management, access control with respect to authenticated users, and groups of users, various APIs for create, update and query, and administrative tools. It is designed to support multiple platforms and access protocols. The Redhawk project made special use of several particular 4Suite features. For example, in Redhawk, declarative workflow management is enabled through the use of 4Suite user groups and access control lists.

Server-side scripting in XSLT

In the architecture of Web-based applications development systems, usually there is a system for processing HTTP requests using programming scripts on the server side. These range from simple CGI (Common Gateway Interface) and Java servlets, which delegate directly to code, and the likes of ASP (Active Server Pages) , JSP (Java Server Pages) and Zope's DTML which allow one to embed code snippets into presentation templates. The 4Suite developers chose a server-side scripting mechanism along the lines of the latter approaches, but rather than inventing a new presentation template language, decided to take advantage of XSLT. XSLT offers a very flexible extension mechanism, and 4Suite includes a library of built-in extensions which allow one to access the repository and XML processing tools from XSLT scripts, and which provide some of the other facilities common to Web-based server-side scripting, such as:

HTTP header access
Access to data entered into Web-based forms (application/x-www-form-urlencoded and multipart/form-data)
Session management

The developer can then use all the presentation and templating power for XSLT as usual. Content can be separated from presentation using chained stylesheets, modularized templates and XSLT imports.

The following is an example of a 4Suite server-side script. It illustrates the basic extension functionality used in Redhawk to create new XML resources via a Web interface.

<?xml version="1.0"?>
<xsl:tranform version="1.0"
              xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
              xmlns:f="http://xmlns.4suite.org/ext"
              xmlns:fcore="http://xmlns.4suite.org/4ss/score"
              xmlns:frdf="http://xmlns.4suite.org/4ss/rdf"
              xmlns:fhttp="http://xmlns.4suite.org/4ss/http"
              xmlns:exslt="http://exslt.org/common"
              extension-element-prefixes="f frdf fhttp fcore exslt"
>

  <!-- These parameters are automatically passed from
       values entered into the Web form -->
  <xsl:param name="xml"/>
  <xsl:param name="redirect-to"/>

  <xsl:template match='/'>
    <xsl:variable name="new-file-name" select="f:generate-uuid()"/>

    <fcore:create-document
      type="http://schemas.4suite.org/4ss#xmldocument"
      path="/redhawk/announcements/data/{$new-file-name}"
      docdef="/redhawk/announcements/docdef"
    >
      <xsl:copy-of select="f:parse-xml($xml)"/>
    </fcore:create-document>

    <!-- redirect to a response "thank you" URL -->
    <fhttp:response-uri uri="{$redirect-to}"/>

  </xsl:template>

</xsl:tranform>

First you'll notice the namespaces used for extensions, described below according to the prefixes used in the example.

f: common 4Suite processor XSLT extension functions and elements that offer general, handy facilities to the developer and are not directly tied to the repository.
fcore: extensions for basic repository tasks, such as adding, updating and deleting documents (reading can be done using specialized extensions or the plain old XSLT document() function).
fhttp: extensions for handling HTTP request or response details.
frdf (not actually used in this example, but used elsewhere in Redhawk): special XSLT extensions for accessing the repository RDF database, including for embedding queries.
exslt: EXSLT (Community Extensions to Extensible Stylesheet Language Transformations) , a popular library of extension functions and elements that are supported by a variety of processors, including 4Suite.

4Suite XSLT scripts can accept data from Web forms as top-level XSLT parameters. When the user enters data into a Web browser form, usually, the fields in the form are communicated to the server encoded in the HTTP request. This form data is interpreted by 4Suite and passed to the XSLT handler. The parameter name typically matches the name of the corresponding Web form element, and must be declared in the XSLT script. Parameters can also be passed in as URL parameters, and, in fact, the above example is designed to handle an HTML form similar to the following snippet:

<form method="post"
      action="/source/?xslt=handler.xslt&redirect-to=/thankyou-page">
  <textarea name="xml"></textarea>
</form>

You can see the parameters xml and redirect-to, one of which comes from the text area field posted through the form, and the other embedded into the target handler URL for that form. The /source/?xslt=handler.xslt portion of that URL instructs 4Suite to retrieve the resource at /source/ as an XML source file for an XSLT transform, and then apply the XSLT script at /source/handler.xslt (matching the above listing) to this souce document. The XSLT script in this case doesn't really make use of the source document, and draws all its input from the passed-in parameters, so you could use any dummy file as source.

In the starting template of the XSLT script (match='/'), the processing of the form begins in earnest. The goal is to create a new XML document using a unique file name, containing the content entered into the form. The first step is to create the unique file name. This is done by generating a UUID (Universally Unique Identifier) , for example 264f537a-1e2d-496d-9118-ec15745d3be3. This is a common technique which makes chances of filename collision infinitesimally small. 4Suite provides UUID generation using the extension function f:generate-uuid(). An XML document is created in the repository using this file name, as shown in the following excerpt:

    <fcore:create-document
      type="http://schemas.4suite.org/4ss#xmldocument"
      path="/redhawk/announcements/data/{$new-file-name}"
      docdef="/redhawk/announcements/docdef"
    >
      <xsl:copy-of select="f:parse-xml($xml)"/>
    </fcore:create-document>

The fcore:create-document element creates a document in the repository at the given path using the content created by instantiating its child instructions, which are treated as an XSLT template. The type="http://schemas.4suite.org/4ss#xmldocument" attribute marks the file as a plain XML document, which triggers some special processing in 4Suite. Other types that can be specified include XSLT and RDF documents or even plain-text (non-XML) files. The path="/redhawk/announcements/data/{$new-file-name}" attribute is an XSLT AVT (Attribute Value Template) which prepends a base path to the computed file name to specify the full path of the generated file. The docdef="/redhawk/announcements/docdef" attribute associates the newly created document with an XPath document definition, which will be discussed in the next section of this paper. The content uses the f:parse-xml() extension function to parse the flat XML text coming from the form into a node set. This node set is then used as the source for the created XML file. This XML file is created in the current transaction, which will automatically be committed if the script completes without error (unless the user uses an explicit <fcore:rollback/> instruction to instruct 4Suite otherwise). The final task is to present a response page to the user, which is the next thing they'll see after submitting the form:

    <fhttp:response-uri uri="{$redirect-to}"/>

An AVT is used to provide a URL which is sent to the browser as a redirect courtesy of the fhttp:response-uri instruction. The browser is thus redirected to the page coded into the original form's action attribute. This overrides the default behavior, which is to send the XSLT result tree serialization as the body of an HTTP 200 response. If there is an fhttp:response-uri element, as in this example, then an HTTP 302 response is sent instead, so the browser never sees the XSLT result.

4Suite's administration tools include a Web-based console, called the Dashboard which can be used to view data and metadata from the repository. The following diagram shows the view of a document created using the Redhawk forms handled by the above XSLT script.

Fig. 5: 4Suite Dashboard

After a resource is stored in the repository, it can be seen in the 4Suite Dashboard, a Web-based administrative interface that comes with 4Suite.

XML/metadata synchronization using document definitions

4Suite allows a mechanism for defining rules for managing documents in the repository. This includes facilities to synchronize XML content with metadata in the RDF model. These rules are called document definitions. They create a set of rules for processing certain documents in the 4Suite repository. These include the schema type and location for validation, whether or not to maintain a full-text index of associated documents, and mappings for updating the repository's RDF database with information from the XML file. The following is an example of content in the system, an announcement. You can see that there are some particularly human-readable bits, and some formatted more for machine processing. 4Suite allows the latter to be easily mirrored into the RDF model.

<?xml version="1.0" encoding="UTF-8"?>
<announcement>
  <headline>SU School of Law Launches Redhawk</headline>
  <body>
    <p>The School of Law Technology Department today launched the new
Redhawk Content Management System. Redhawk enables information owners
to post announcements and events to The Sullivan Docket as well as the
School of Law Web site. A user's ability to post is dictated by the
type of information they are responsible for.</p>

  </body>
  <display-properties>
    <docket priority="normal" end-dateTime="2003-03-17T01:00:00"
            start-dateTime="2003-03-14T01:00:00"/>
    <web priority="high" category="home"
         start-date="2003-03-17" end-date="2003-03-31"/>
  </display-properties>
</announcement>

One form of mapping from the XML into the 4Suite RDF database uses a set of triples of XPath expressions to create corresponding RDF triples in the repository database. Document definitions that use this approach are called XPath document definitions. The following is an example of a document definition designed for processing the above announcement format:

<!DOCTYPE DocDef [
<!ENTITY ann "http://redhawk.seattleu.edu/announcements/schema#">
<!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#">
]>
<DocDef xmlns="http://xmlns.4suite.org/reserved" xmlns:rdf="&rdf;">
  <RdfMappings>
    <RdfMapping>

      <Subject>$uri</Subject>
      <Predicate>'&rdf;type'</Predicate>
      <Object type="rdf:Resource">'&ann;Announcement'</Object>

    </RdfMapping>
    <RdfMapping>
      <Subject>$uri</Subject>
      <Predicate>'&ann;headline'</Predicate>

      <Object type="rdf:Literal">/announcement/headline</Object>
    </RdfMapping>
    <RdfMapping>
      <Subject>$uri</Subject>
      <Predicate>'&ann;docket-priority'</Predicate>

      <Object type="rdf:Literal">
        /announcement/display-properties/docket/@priority
      </Object>
    </RdfMapping>
    <RdfMapping>
      <Subject>$uri</Subject>

      <Predicate>'&ann;docket-start-dateTime'</Predicate>
      <Object type="rdf:Literal">
        /announcement/display-properties/docket/@start-dateTime
      </Object>
    </RdfMapping>
    <RdfMapping>

      <Subject>$uri</Subject>
      <Predicate>'&ann;docket-end-dateTime'</Predicate>
      <Object type="rdf:Literal">
        /announcement/display-properties/docket/@end-dateTime
      </Object>

    </RdfMapping>
    <RdfMapping>
      <Subject>$uri</Subject>
      <Predicate>'&ann;web-category'</Predicate>

      <Object type="rdf:Literal">
        /announcement/display-properties/web/@category
      </Object>
    </RdfMapping>
    <RdfMapping>
      <Subject>$uri</Subject>

      <Predicate>'&ann;web-priority'</Predicate>
      <Object type="rdf:Literal">
        /announcement/display-properties/web/@priority
      </Object>
    </RdfMapping>
    <RdfMapping>

      <Subject>$uri</Subject>
      <Predicate>'&ann;web-start-date'</Predicate>
      <Object type="rdf:Literal"> 
        /announcement/display-properties/web/@start-date
      </Object>

    </RdfMapping>
    <RdfMapping>
      <Subject>$uri</Subject>
      <Predicate>'&ann;web-end-date'</Predicate>

      <Object type="rdf:Literal">
        /announcement/display-properties/web/@end-date
      </Object>
    </RdfMapping>
  </RdfMappings>
</DocDef>

The RdfMappings element wraps a series of mappings, each of which has three child elements that contain XPath expressions. The expressions are evaluated to become the corresponding part of an RDF statement. Take the following excerpt from the document definition:

    <RdfMapping>
      <Subject>$uri</Subject>
      <Predicate>'&ann;headline'</Predicate>
      <Object type="rdf:Literal">/announcement/headline</Object>

    </RdfMapping>

The Subject content is an XPath expression that evaluates to the ID of the XML document that is being added or updated, by referencing the special variable uri. The Predicate expression is a simple string (partially supplied by an entity reference). The object, which is specified as an RDF literal type rather than a resource, comes from another XPath expression that retrieves the XPath string value of the headline element child of the root announcement element.

We have made a graph available (at http://www.law.seattleu.edu/xmleurope/sample-announcement-rdf.jpg) that displays the resulting RDF model from the operation of this document definition on the sample announcement above. It should help make the operation of the RDF mappings clearer. This graph was produced using the Triclops feature of the 4Suite administrative console, which allows you to produce interactive graphs of portions of the RDF model, including the results of Versa queries.

Versa RDF query

The metadata needed to process Redhawk announcements and events is managed in the 4Suite RDF database system. 4Suite provides for a strong binding between XSLT and RDF processing, making it straightforward to express the queries that guide the content. The query language used is Versa ([VERSA] ), and is embedded in XSLT through extension elements. Versa is designed to be integrated into other programming languages and systems, such as XSLT. It is inspired in many ways by XPath. Versa is designed from a very different point of view than other RDF query languages, which are either based on SQL idioms or formal logic idioms. Some of the advantages of this approach are:

Strong alignment with XML. From the start Versa was designed for integration into other XML processing technologies. An example is the use of Versa directly from XSLT, through which one can process query results in XML form.
XPath-like idiom. The Versa syntax is inspired by XPath. There are some key deviations in order to accommodate the fact that RDF is a graph rather than a hierarchy, but Versa chained traversals should be familiar to those who have worked with XPath location paths. Other matters such as the emphasis on functions borrow strengths from XPath.
Extensibility. Versa borrows XPath's extension mechanism, allowing users to add needed features to Versa in a standard way.
Ease of learning. Many users have reported that they became proficient very quickly with Versa. Some other RDF query languages have the advantage of similarity to the popular SQL, but this doesn't help those with no SQL background. Also, RDF differs enough from SQL that it is easy to run into model mismatches that the syntax doesn't make clear.
Querying features. Versa offers features that are unfortunately sparse in RDF query languages, such as a full complement of logical and string operations.

Here are some examples of Versa queries, as used in the Redhawk application:

type(ann:Announcement) - ann:headline -> *

This basic construct is called a traversal expression, and retrieves all the ann:headline values for resources of RDF type ann:Announcement. The abbreviated names use the RDF serialization convention, concatenating a base URI associated with the prefix to the tail part. Versa, like RDF/XML, calls this mapping from prefixes to URI "namespaces", but it doesn't in itself define a mechanism for namespace mappings. However, 4Suite allows you to embed Versa in XSLT, and in this case, Versa borrows the namespace declarations in scope on the element in which it is invoked. In the above example, all resources of RDF type ann:Announcement are taken as the starting point of the traversal. It returns the results of traversing along the arcs identified by the resource ann:headline. Another way of putting this is that it returns the objects of all statements with subject of RDF type ann:Announcement and predicate ann:headline.

all() |- ann:web-start-date -> eq("2003-03-17")

This is also a traversal expression, but with some differences from the above example. First of all, it uses the |- notation, which means "return the starting points, or subjects of statements, rather than the objects". The starting point in this case is not a set of resources limited to a specific type, but rather the result of the all() function, which returns a set of all resources in the RDF model. In addition to using a different predicate resource, there is an added filter on the objects. In this case, they must be equal to the string "2003-03-17". So this query returns all resources which have an ann:web-start-date property with the value "2003-03-17".

Using Cocoon and 4Suite together

To understand how Cocoon is used with 4Suite at the Seattle University School of Law, recall the first Cocoon configuration example we presented:

    <map:match pattern="career/news">
      <map:generate
        src="http://redhawk/?xslt=getNews.xsl&category=career"/>

      <map:transform src="stylesheets/newsToHtml.xsl"/>
      <map:serialize type="xhtml"/>
    </map:match>

Again, this example is a simplification of the actual logic used in the SU Law Web site (which uses internal redirects among other things), but it clearly shows the extent of the contract between the Web site (Cocoon) and the Redhawk application (4Suite). To generate the content for the /career/news page, an HTTP GET request is made to the redhawk server (which may or may not reside on the same machine), passing two query string parameters: xslt, which identifies the stylesheet to invoke on request; and category, which indicates the Web site category we are currently interested in. 4Suite holds up its end of the contract by invoking the getNews.xsl stylesheet, which includes embedded Versa queries such as the following:

<frdf:versa-query query="
    type(ann:Announcement) |- ann:web-category -> eq('{$category}')
"/>

The frdf:versa-query extension element executes a Versa query and instantiates its result (using a standard XML serialization for Versa results) in its place in the result tree. In this case, an AVT is used to parameterize the query with the value of the category XSLT parameter as passed from the HTTP GET request initiated by Cocoon. The stylesheet can subsequently process the resulting list of resource URIs by aggregating their contents via use of the standard XSLT document() function, which retrieves the contents of individual XML resources in the 4Suite repository. Finally, the resulting aggregation of XML announcements is sent as an HTTP response to Cocoon, completing 4Suite's role in the interchange.

Future considerations

To utilize the flexibility of Cocoon on the authoring side in addition to the publishing side would require a proxy-like generator to faithfully forward HTTP requests (POST form submissions in addition to GET requests) to another server (4Suite in our case), and return the response as the generated XML. Certain efforts are underway in the Cocoon project to provide this sort of functionality. In particular, the WebServiceProxyGenerator (http://xml.apache.org/cocoon/userdocs/generators/wsproxy-generator.html) shows some promise in this area. Another aspect of this problem is providing support for reverse redirects (a la Apache's ProxyPassReverse directive).

The current Redhawk application uses HTTP POST and request body content of type application/x-www-form-urlencoded, along with an HTTP parameter (named "xml"), to submit a new XML document. Given that 4Suite has budding support for WebDAV (Web-based Distributed Authoring and Versioning) , future versions of the Redhawk application may be based on WebDAV instead.

Currently, 4Suite API calls and HTML output instructions are merged into single XSLT scripts. It would be a more flexible design to follow the model/view split in the scripts, which has recently become possible with 4Suite's adding of stylesheet chaining primitives (similar to Cocoon's). This would also allow the current apparatus to be more smoothly re-purposed towards other output media.

Finally, there has been a lot of discussion among the 4Suite community about schema-driven processing where RELAX NG schemata could provide not only the RDF/XML mapping hints currently provided through document definitions, but could also drive automatic forms generation and handling technologies using technologies such as XForms. This would allow the Redhawk portion of the application to be designed in an even more declarative manner than at present, improving maintenance and extensibility. The current Redhawk application uses Altova's Authentic 5 browser-based XML editor, which fits a similar bill (declarative validation, auto-generated forms, etc.). However, since it runs only on Internet Explorer, this approach is only appropriate when you have control over what Web browsers are being used. Thus, an XForms or RELAX NG server-side solution would provide many of the same benefits without sacrificing interoperability.

Bibliography

[4SUITE]: 4Suite http://4Suite.org
[RDFINTRO1]: U. Ogbuji, An introduction to RDF, http://www-106.ibm.com/developerworks/xml/library/w-rdf/
[RDFINTRO2]: U. Ogbuji, The Languages of the Semantic Web, http://www.newarchitectmag.com/documents/s=2453/new1020218556549/index.html
[REST]: REpresentational State Transfer (REST) http://internet.conveyor.com/RESTwiki/moin.cgi/
[COCOON]: Cocoon http://cocoon.apache.org
[COOLURIS]: T. Berners-Lee, Cool URIs don't change http://www.w3.org/Provider/Style/URI.html
[VERSA]: The Versa Home Page, http://uche.ogbuji.net/tech/rdf/versa/
[VERSAINTRO]: U. Ogbuji, RDF Query using Versa, http://www-106.ibm.com/developerworks/xml/library/x-think10/index.html
[WEBDAV]: WebDAV, http://www.webdav.org/
[SUSL]: Seattle University School of Law, http://www.law.seattleu.edu/
[SAX]: Simple API for XML, http://www.saxproject.org/
[XHTML]: XHTML, http://www.xhtml.org/
[SVG]: Scalable Vector Graphics (SVG), http://www.w3.org/TR/SVG/
[XSLFOINTRO]: G. K. Holman, What is XSL-FO? http://www.xml.com/pub/a/2002/03/20/xsl-fo.html
[XINCLINTRO]: E. R. Harold, Using XInclude http://www.xml.com/pub/a/2002/07/31/xinclude.html
[LINKING]: W3C XML Pointer, XML Base and XML Linking, http://www.w3.org/XML/Linking
[RELAX NG]: RELAX NG, http://www.oasis-open.org/committees/relax-ng/
[URI]: Universal Resource identifiers in WWW http://www.w3.org/Addressing/URL/uri-spec.html http://www.w3.org/Addressing/URL/uri-spec.html
[XFORMS]: J. Rivera, L. Taing, Get ready for XForms http://www-106.ibm.com/developerworks/xml/library/x-xforms/?dwzone=xml
[XSLT]: W3C XSL Transformations (XSLT), http://www.w3.org/TR/xslt
[RFC2388]: RFC 2388 defines multipart/form-data, http://www.faqs.org/rfcs/rfc2388.html