+ Page 1 + ----------------------------------------------------------------- The Public-Access Computer Systems Review Volume 5, Number 3 (1994) ISSN 1048-6542 ----------------------------------------------------------------- To retrieve an article file as an e-mail message, send the GET command given after the article information to listserv@uhupvm1.uh.edu. (Files are also available from the University of Houston Libraries' Gopher server: info.lib.uh.edu, port 70.) CONTENTS COMMUNICATIONS Using the World-Wide Web to Deliver Complex Electronic Documents: Implications for Libraries By John Price-Wilkin (pp. 5-21) To retrieve this file: GET PRICEWIL PRV5N3 F=MAIL The World-Wide Web (also called the Web) is a very promising tool for libraries to use to explore the delivery of rich and complex documents. Nevertheless, there are many limitations in the Web's HTML markup language and the ability of Web servers to deliver structured information. This paper explores the benefits and limitations of the Web in the context of several projects taking place at the University of Virginia, both in the Library and in the University's Institute for Advanced Technology in the Humanities. A gateway between the Web and the SGML-based PAT system that helps to overcome the Web's inherent limitations is also described. + Page 2 + ----------------------------------------------------------------- The Public-Access Computer Systems Review ----------------------------------------------------------------- Editor-in-Chief Charles W. Bailey, Jr. University Libraries University of Houston Houston, TX 77204-2091 (713) 743-9804 Internet: lib3@uhupvm1.uh.edu Associate Editors Columns: Leslie Pearse, OCLC Communications: Dana Rooks, University of Houston Editorial Board Ralph Alberico, University of Texas, Austin George H. Brett II, Clearinghouse for Networked Information Discovery and Retrieval Priscilla Caplan, University of Chicago Steve Cisler, Apple Computer, Inc. Walt Crawford, Research Libraries Group Lorcan Dempsey, University of Bath Pat Ensor, University of Houston Nancy Evans, Pennsylvania State University, Ogontz Charles Hildreth, READ, Ltd. Ronald Larsen, University of Maryland Clifford Lynch, Division of Library Automation, University of California David R. McDonald, Tufts University R. Bruce Miller, University of California, San Diego Paul Evan Peters, Coalition for Networked Information Mike Ridley, University of Waterloo Peggy Seiden, Skidmore College Peter Stone, University of Sussex John E. Ulmschneider, North Carolina State University + Page 3 + Technical Support Tahereh Jafari, University of Houston Publication Information Published on an irregular basis by the University Libraries, University of Houston. Technical support is provided by the Information Technology Division, University of Houston. Circulation: 8,202 subscribers in 65 countries (PACS-L) and 2,562 subscribers in 52 countries (PACS-P). Back issues are available from listserv@uhupvm1.uh.edu. To retrieve a cumulative index to the journal, send the following e- mail message to the list server: GET INDEX PR F=MAIL. Back issues are also available from the University of Houston Libraries' Gopher server. Point your Gopher client at info.lib.uh.edu, port 70, and follow this menu path: Looking for Articles Electronic Journals University of Houston Libraries E-Journals The Public-Access Computer Systems Review The journal's URL is gopher://info.lib.uh.edu:70/11/articles/e- journals/uhlibrary/pacsreview. The first three volumes of The Public-Access Computer Systems Review are also available in book form from the American Library Association's Library and Information Technology Association (LITA). The price of each volume is $17 for LITA members and $20 for non-LITA members. All three volumes can be ordered as a set for $45 (indicate that you want the PACS Review set, order number 7712-X). To order, contact: ALA Publishing Services, Order Department, 50 East Huron Street, Chicago, IL 60611-2729, (800) 545-2433. + Page 4 + ----------------------------------------------------------------- The Public-Access Computer Systems Review is an electronic journal that is distributed on the Internet and on other computer networks. There is no subscription fee. To subscribe, send an e-mail message to listserv@uhupvm1.uh.edu that says: SUBSCRIBE PACS-P First Name Last Name. The Public-Access Computer Systems Review is Copyright (C) 1994 by the University Libraries, University of Houston. All Rights Reserved. Copying is permitted for noncommercial use by academic computer centers, computer conferences, individual scholars, and libraries. Libraries are authorized to add the journal to their collection, in electronic or printed form, at no charge. This message must appear on all copied material. All commercial use requires permission. ----------------------------------------------------------------- + Page 5 + ----------------------------------------------------------------- Price-Wilkin, John. "Using the World-Wide Web to Deliver Complex Electronic Documents: Implications for Libraries." The Public- Access Computer Systems Review 5, no. 3 (1994): 5-21. To retrieve this file, send the following e-mail message to listserv@uhupvm1.uh.edu: GET PRICEWIL PRV5N3 F=MAIL. (The file is also available from the University of Houston Libraries' Gopher server: info.lib.uh.edu, port 70.) ----------------------------------------------------------------- 1.0 Introduction The World-Wide Web (also called the Web) is a very promising tool for libraries to use to explore the delivery of rich and complex documents. [1] Nevertheless, there are many limitations in the Web's HTML markup language and the ability of Web servers to deliver structured information. This paper explores the benefits and limitations of the Web in the context of several projects taking place at the University of Virginia, both in the Library and in the University's Institute for Advanced Technology in the Humanities. A gateway between the Web and the SGML-based PAT system that helps to overcome the Web's inherent limitations is also described. 2.0 SGML and TEI The most worthwhile products that libraries can buy are ones that conform to standards and are not tied to a specific software package or operating system. These are the only products with enduring value. Certainly, there are exciting electronic resources being produced for specific software packages and operating systems, but the extent to which libraries can build collections of hypertext resources that are usable in the future will depend entirely on the conformance of their resources to true national and international standards. The most important standard for this discussion is SGML, a standard designed to express the organization of documents and to accommodate even the most complex multimedia materials. + Page 6 + A brief (and admittedly superficial) discussion of SGML and the Text Encoding Initiative may be helpful. SGML (Standard Generalized Markup Language, ISO 8879) is a standard approved by the ISO for the descriptive markup of documents. The language of SGML is sufficiently flexible that the sense of "document" has been expanded to include coordinated time-based elements of hypermedia (e.g., animated dance, music, and character-based score and choreography moving in synchrony at a pace controllable by the user). SGML is not a tag set: there are no pre-set tags. Instead, SGML is a set of rules (or a grammar) for articulating that vocabulary. These rules are sufficiently rigorous so that specialized software can check the validity or conformance of a document. The specification of that grammar is a DTD (Document Type Definition); the DTD can also function to document many decisions about the organization of a text. Without that validity--i.e., without being parsed against a DTD--the document is not SGML encoded, although it may share many of the characteristics of SGML. For our work at Virginia, the most notable of these characteristics has been the descriptive nature of the tagging. Rather than saying that an element of the text appears in bold, 17 point Helvetica, centered at the top of a new page, we use the tags to define the function of a textual element (e.g., a title). The tag set used must necessarily elaborate the elements of the texts we see in an academic environment: a tag set designed for articles or documentation, for example, will omit important elements needed for encoding poetry. To serve those needs, the Text Encoding Initiative (or TEI) has published a set of guidelines for the application of SGML to texts in the humanities. Functions of the text or hypertext, expressed descriptively and with a standard language, are freed from the constraints of a specific software package or application. SGML- encoded works can serve a variety of functions, depending on the user's needs and available software. + Page 7 + 3.0 The Potential of the Web The Web uses a client/server architecture. Sophisticated Web clients, such as Mosaic, offer an exciting sense of the possibilities of electronic publishing on the network. Several revolutionary concepts that have been awaited with anticipation are incipient in all aspects of that relationship between client, server, and publication. These characteristics are: o Open systems--the ability to make resources available to a variety of operating systems and a variety of applications is evident throughout the Web. Computers running X Windows, Microsoft Windows, and the Macintosh System 7 all participate equally. In addition to Mosaic, other clients, such as Cello and OmniWeb, are available. Multimedia tools, such as image viewers, are a matter of personal choice. o Standards--given the Web's use of HTML, the importance of standards is heightened, and HTML is inexorably moving toward greater expressiveness and greater conformance to the SGML standard. o Distributed information--the notion of a universe of distributed information, scattered throughout the Internet while being conceptually linked to other information, is becoming a reality through the use of the Web. 4.0 Representative Web Projects Over the past two years at the University of Virginia, faculty and staff involved in several projects began to develop a variety of electronic materials using the SGML standard. Partly this was to serve already apparent needs, but it was also to take advantage of the potentials of electronic publishing. While the Library's Electronic Text Center and, later, its Digital Image Center began to develop skills in creating electronic materials in standard formats for networked access, scholars at the Institute for Advanced Technology in the Humanities undertook the daunting task of composing advanced, standards-based electronic research materials without having the tools with which to publish these materials. With the introduction of Mosaic, the Web was quickly seen as a way to deliver these materials, and, with relative ease, large bodies of SGML-encoded material were converted to HTML for Web access. In order to focus on particular aspects of those projects, the following example projects are divided into sections on editions, history, image archives, and instruction. + Page 8 + 4.1 Editions In general, the Web offers creators of editions of literary or other works the ability to represent a vast, interconnected web of scholarly resources in a variety of different ways. The user might view the resources simply, as in an edition of a work without the introduction of a critical apparatus. A more complex approach is also possible, with the user following the critical apparatus at every turn. And finally, a rich and scholarly approach is possible, allowing the user to view manuscript (or printing) evidence or to examine the editor's assessment of the evidence by comparing high-quality scans of original pages to the marked-up transcriptions. With proper markup, an edition can be viewed in as many ways as the reader desires. It can be a variorum, a study edition, a critical edition, or historical evidence. The form the edition takes is defined by the user's needs or preferences. 4.1.1 British Poetry The British Poetry Archive documents are perhaps the simplest of those discussed here. (The project's URL is http:// www.lib.virginia.edu/etext/britpo/britpo.html.) The two texts now available were transcribed by students in Jerome McGann's graduate courses. In addition to the SGML- encoded text itself, each work includes material such as introductions, notes, and glosses as well as high-quality digital facsimiles of pages from the original editions. The materials are freely available on the Internet, and Mr. McGann hopes that others will contribute to the archive. These texts represent the simplest of the hypertext editions available on the University of Virginia's Web, with supporting materials providing potential deviations from an otherwise linear progression. The texts were encoded in TEI-conformant SGML with the assistance of the Library's Electronic Text Center, and they were then converted to HTML for the purpose of making them available on the Web. + Page 9 + 4.1.2 Dante Gabriel Rossetti To date, the most fully developed project is Jerome McGann's ongoing edition--or archive--of the works of Dante Gabriel Rossetti. (The project's URL is http:// jefferson.village.virginia.edu/rossetti/rossetti.html.) According to McGann, the Rossetti archive is: a hypermedia environment for studying the works of the Pre-Raphaelite poet and painter D. G. Rossetti (1828-1882). The archive is a structured database holding digitized images of Rossetti's works in their original documentary forms. Rossetti's poetical manuscripts, early printed texts --including proofs and first editions--as well as his drawings and paintings are stored in the archive, in full color as needed. The materials are marked up for electronic search and analysis, and they are supplied with full scholarly annotations and notes. [2] The organization of the archive is designed to capitalize on the uniquely intertwined nature of Rossetti's artistic process, linking image to text and text to image. When Rossetti accompanied a painting by sonnets, the poems are included in the archive along with an image of the painting. When Rossetti illustrated a poem with a painting, an image of the painting is included. Since Rossetti frequently designed his own editions, electronic versions of his print works, with linked text and images, are also available. McGann describes the difficulty of studying Rossetti's works in a traditional print environment, and then sets about trying to overcome those difficulties by melding the resources in a way that allows the reader to follow the threads of art, poetry, or translations without losing access to the other materials. 4.1.3 Piers Plowman The third project was begun in the 1994-95 academic year by one of the most recent Institute fellows, Hoyt Duggan. (The project's URL is http://jefferson.village.virginia.edu/piers /archive.goals.html.) + Page 10 + Mr. Duggan, an accomplished editor of Middle English texts, created an edition of the Piers Plowman B text using the Web. More in the model of the traditional scholarly edition, Mr. Duggan's project brings together transcription and facsimile to resolve vexing editorial problems. When the scribe uses an abbreviation to represent a letter combination (e.g., a barred "p" for "pre"), the reader typically wants the editor's best judgement in rendering what was intended (i.e., "pre"). Many of those decisions deal with unambiguous evidence, and some with less certain evidence. Through SGML, both the suspension or abbreviation is registered as well as the reading of the character. To the greatest extent possible, digital facsimiles of all seventeen surviving manuscripts will be included. With facsimile evidence, it is always possible to return to something resembling the original document to evaluate the editor's decision. Duggan has also found that it is possible to create extremely high-resolution images that, with enlargement and other digital treatments, can reveal important new information about the original composition. 4.2 History With new technological tools, historians are offered both challenges and opportunities. Electronic resources allow them to blend evidence and interpretation in ways that help both student and researcher. A simple approach in using the materials is possible, where the reader follows the argument without examining evidence. It is also possible for the reader to examine the methodology of the researcher, either to scrutinize the research or to be instructed in the methodology of research. The process of bringing evidence and interpretation together brings challenges of immense proportions. For example, the role geography plays in defining an event can be brought to bear on the problem, but it may involve the use of sophisticated systems of geographic analysis. Two projects at the Institute have used many diverse resources to explore their topics, incorporating nineteenth Census data, geographic models, and animated sequences. 4.2.1 Ayers (Valley of the Shadow) Edward Ayers, a historian of the Civil War and the Reconstruction, was one of the Institute's first two fellows. (The project's URL is http://jefferson.village.virginia.edu/ vshadow/vshadow.html.) According to Ayers, the project: + Page 11 + interweaves the histories of places on both sides of the Mason-Dixon line. It is the story of two communities relatively close to one another, sharing considerable prewar characteristics and similar experiences in the war itself. There was one area in the United States for which that was most clearly the case: the Great Valley that stretched from Pennsylvania, through Maryland and Virginia, into Tennessee. [3] Ayers focuses on two towns--Staunton, Virginia and Chambersburg, Pennsylvania--as representative communities from that Valley that served as such an important economic, cultural, and military locus of the War. The Web serves the historical ends by balancing narrative--a filtering or interpretation of evidence-- with the presentation of that evidence. Ayers has described one dilemma of the historian as a tight-rope act between providing access to evidence and creating an organizing argument that does not also obscure that evidence. His approach, providing the deepening layers of evidence as "rhizomes" beneath the surface of narrative, has been well-supported by the Web. 4.2.2 Dobbins (The Forum at Pompeii) Dobbins, a classical archaeologist, reconstructs Pompeii from archaeological evidence in a virtual space to advance his argument. (The project's URL is http:// jefferson.village.virginia.edu/pompeii/page-1.html.) He uses computer-aided design (CAD) tools to bring precision to his reconstruction. Animation is being added to the CAD representations to provide a three-dimensional perspective of buildings and space. Structures that are normally seen in isolation from each other are assembled in a total vision of Pompeii that may suggest a degree of planning and coordination. 4.3 Image Archives The Digital Image Center's image collections can be seen as passive collections of standards-based images. (The project's URL is http://www.lib.virginia.edu/dic/class/arh102.) The image collections are organized to reflect the focus of an individual class or an art exhibit. All of the images are TIFF files subjected to JPEG compression. As such, they can be examined with a variety of image tools, ranging from simple viewers to software with analytical capabilities. Most importantly, the tool used is largely the choice of the user. As a result of planning and philosophy, all images are durable enough to stand close scrutiny: they were scanned in 24-bit color at a sufficiently high resolution to be enlarged several times without significant degradation. + Page 12 + The most developed collection is representative of this archival philosophy. William Westphal's graduate architectural history course on urban form includes hundreds of architectural images, primarily from the Italian Renaissance, organized around his lectures. Students can access these resources at all times over the network as well as in a closed classroom environment designed to efficiently access the images. Since they were scanned at high resolutions, the images compare favorably with the original slides, and they can be examined closely on screen. The original slides have frequently degraded or had imperfections that were corrected in the scanning process. 4.4 Instruction The final project demonstrates the instructional capabilities of the Web. (The project's URL is http://www.lib.virginia.edu/ etext/scanner.html.) Using the Web to provide access to training materials has many strengths. It gives variation to what would otherwise be a flat, linear document. The document is dynamic and can easily accommodate other elements as they are created by staff. Scanning text is one of the most repetitive training operations provided in the Electronic Text Center. Unlike searching electronic texts, where every research need may entail a different approach and different training needs, many of the scanning decisions are generalizable and can be represented in a training document. The project's instructional Web pages on scanning were designed to reduce the amount of staff intervention and give a greater degree of freedom to users. 4.5 Evaluation of the Projects While the majority of the projects discussed here could be supported by numerous stand-alone, operating-system specific hypertext products, the Web has several advantages. The projects' electronic resources are widely available on the Internet, and users can access them on a variety of computer platforms, regardless of the fact that the Web server is running on a UNIX computer. (Attractive graphical Web clients, such as Mosaic and OmniWeb, are available for Macintoshes, IBM-compatible computers using Microsoft Windows, UNIX computers with X Windows, and NeXTs.) + Page 13 + Another key advantage is that the source material for the editions either conforms to or is in the process of being composed using international standards; it is marked up to suggest the functional characteristics of the collections, rather than their representational characteristics. Elements, such as titles, quotations, and headings, are marked to suggest their functional role in the document, rather than any presumed display value. Displays depend instead on the capabilities of the user's software, which utilizes the functional characteristics of the elements to determine how to present the information. This reliance on functional--not representational-- characteristics means that the same materials can be used in a variety of different ways, supporting the creation of editions with other software packages (e.g., Electronic Book Technology's DynaText), use with different analytical tools (e.g., morphological parsers), and access through different database schemes (e.g., text-specific systems or relational database managers designed for images). A high degree of flexibility, viability, and multi-platform access can be maintained. Each of the mentioned editions and historical analyses was first composed in a very rich SGML format that was designed to discriminate between the functional characteristics of low-level elements. They were subsequently converted (as automatically as possible) to static HTML versions for use with the Web. Elements, such as discrete descriptive bibliographic characteristics, become simple list items, and most complex prose and verse elements are reduced to paragraphs and line breaks. After this conversion, it was discouraging to see that richness disappear, but the original document remained unchanged. There is a continued expectation by the scholars who created these resources that better tools will be developed to tap the inherent complexity of these materials. The standards-based format of the materials ensures that these scholars will be able to take advantage of these new tools when they become available. 5.0 The Web as an Authoring and Document Delivery Environment The authoring and document delivery capabilities of the Web are significantly limited for documents of even moderate complexity. Authoring for the Web is usually done in HTML. HTML has many virtues, not least of which is its striving for expressiveness and SGML validity. It is, however, an impoverished tag set with little ability to reflect the complexities of most of the documents discussed earlier, despite their being offered through the Web. It is important to note that the Web is a limited document delivery environment. Its inability to recognize or use structural features of documents forces unpleasant administrative decisions that will likely restrict the later use of these documents. + Page 14 + 5.1 HTML's Lack of Expressiveness The range of HTML tags available to users is limited. In contrast to the hundreds of tags made available by the TEI guidelines, roughly two dozen tags are made available in HTML. While HTML will be expanded with HTML+ to give greater precision in areas such as tabular data, HTML+ cannot be expected to provide the breadth needed to support literary and historical documents, or even to support standard journal literature. This lack of expressiveness and insufficient breadth of tags also leads to the author's inability to differentiate important elements with HTML. In HTML, the same small set of tags is necessarily used for diverse sets of elements. For example, the
code (line break) is used for verse lines, table elements, stanza divisions, dramatis personae, and many features. Authors are also left with little ability to represent the structural organization of a document. Where the author wishes to define a bounded segment of text, such as a stanza or chapter, no tag is available for this purpose. Instead, authors rely extensively on dividing documents into files representing major structural divisions. Elements that are normally defined as structural tags in SGML, such as the paragraph (or

) tag, are not defined by HTML in a way that reliably defines the contents of a paragraph. This paucity of tags in HTML results in the author of any document of moderate complexity using many tags to effect a desired appearance, rather than to characterize the content. This type of tagging confuses function and appearance. The inability of HTML to represent complexity is often closely linked to the inability of Web servers to provide access to complex representations of documents. This inability is fundamentally linked to the notion of structure. Where structural distinctions exist in the markup language, there is no inherent ability in the Web to deliver that individual element. So, for example, HTML defines glossaries and glossary entries, but, in order to provide access to an individual glossary entry from a hypertext link, the server must send the entire file (i.e., the file containing the glossary) to the user. Smaller glossaries cause few problems, but this makes providing access to individual "glossary" entries in a document such as the Oxford English Dictionary, where all 500 MB would be transferred across the network, effectively impossible. While Web browsers are intelligent enough to move automatically within the file to the chosen glossary entry, the file transfer paradigm is impractical for large-scale information delivery. Given this, it must also be pointed out that there are very few HTML tags that define structural relationships. Structures such as chapters, sections, or poems are not represented. + Page 15 + The Web's deficiency with regard to structural features leads to decisions with serious negative administrative consequences. Because the Web does not include structure awareness in its protocol and because HTML markup provides so little support for structural representation of features, the author and the administrator are forced to fragment documents into a sets of reasonably sized components. In converting the ARL book University Libraries and Scholarly Communication (URL: http://www.lib.virginia.edu/mellon/mellon.html) to HTML, I found that, using the Web and HTML alone, it was necessary to divide the dozen chapters into separate files. While this may not sound onerous, extending this practice to a large collection of documents--or even a small collection of large documents--would be very difficult. An HTML version of the OED would become a set of 300,000 files. Chadwyck-Healey's English Poetry Database would become either 2,500 files (if the administrator wished to provide access at the volume level) or 65,000 files (if access to individual poems were supported). Even this severe approach does not solve needs that might arise for substructures, such as quotations and definitions within the OED or specific stanzas within a poem. 5.2 Overall Limitations of HTML For documents of limited complexity, HTML is an effective authoring environment; however, it seriously limits the ways in which a more complex document or a set of documents can be used. No differentiation of important elements (e.g., stanzas and subdivisions of prose) can take place, and it will be necessary to upgrade the coding of HTML documents within the year. The Web also lacks inherent document management or document access capabilities. In part because of the limitations of the markup language and in part because of the design of the protocol, there is a paucity of structure represented and no structure recognized. I emphasize "inherent," however, because the Web also provides a gateway capability that can more than compensate for this deficiency. 6.0 Exploring Alternatives I have been developing a gateway from the Web to an indexed collection of texts in an SGML-aware system to take advantage of the complexity of the documents and yet make them available through the Web. The texts are nearly all in fully validated SGML tag sets, each with significant expressiveness. In contrast to an HTML collection, potentially consisting of many files representing the many component parts of the collection, each text is a single file with as many as hundreds of thousands of structural components. + Page 16 + 6.1 Collections Three diverse examples are provided to help understand the nature of the collections used in the gateway. 6.1.1 University of Virginia Middle English Collection The Middle English collection assembled by the University of Virginia's Electronic Text Center is approximately thirty texts in a single file. (The collection's URL is http:// etext.virginia.edu/Mideng.query.html.) Texts vary in size from several dozen pages to several hundred pages. One of the Library's smaller collections is approximately 11 MB of raw text, but it grows as new materials become available. The markup language used is SGML complying with the Oxford Text Archive's DTD, a tag set that will eventually represent a valid subset of the TEI DTD. The tags differentiate major structural elements, such as tales in the Canterbury Tales, bibliographic elements, and elements of composition (e.g., verse lines, stanzas, and paragraphs). Markup is rich enough to support a wide range of analytical requirements, and the texts have been made available for the purpose of analysis to the University of Virginia community for much of the past two years. With the permission of Open Text, the Oxford Text Archive, and creators of individual texts, access to this collection is unrestricted. It can be accessed in a variety of ways, including the Web. 6.1.2 Chadwyck-Healey English Poetry Database The Chadwyck-Healey English Poetry Database is purchased on tape from the publisher and made available indexed by PAT. Access to this collection is restricted to a consortium of five universities in Virginia. As yet incomplete, the collection currently consists of nearly 1,600 works with more than 64,000 poems and 233,000 pages. The raw text is relatively large (340 MB), but, indexed with PAT, searches usually yield results in less than one second. The SGML used with the English Poetry Database is a very rich set of tags designed in consultation with a TEI representative. It is more than adequately expressive about the poems, including structural markup for poems, poem divisions such as stanzas, lineation, and attributes such as whether rhyme is used. + Page 17 + 6.1.3 Oxford English Dictionary The Oxford English Dictionary is the largest and arguably the most complex resource made available through this service. The 570 MB document contains approximately 300,000 entries, many with more than fifty subelements. Strictly speaking, it is not in SGML form because it has not been validated against a DTD. The electronic version was, however, designed to take advantage of SGML's characteristics, and it significantly benefits from the file's structural and descriptive markup. 6.2 Web to PAT Gateway I have constructed a gateway between the Web and the more sophisticated SGML texts using the Web's CGI (Common Gateway Interface) and PAT, an SGML-aware text retrieval program. Text is returned from PAT to the Web in the richer SGML, and it is converted on the fly to HTML, primarily using HTML to control the appearance of the text on the screen. This gateway is being documented elsewhere (URL: http://sansfoy.lib.virginia.edu/pub /www-to-pat/), but several facets are relevant to this discussion. 6.2.1 Expressive Representation of Text is Retained The original unmodified texts are accessed through the gateway without compromising the expressiveness of the original markup. Although the sophisticated SGML markup is dynamically rendered as HTML as the user retrieves results, the text remains in the original rich SGML form behind the Web representation. Decisions about the way that the fuller tag set maps to HTML are registered in filters, and, as HTML becomes more expressive, a better match between the original tags and the HTML can be made. 6.2.2 Simple Queries and Simple Access Users need not be familiar with PAT's query language to search texts and take advantage of the structural characteristics of the more expressive markup. A word or phrase search returns keywords-in-context (KWIC) views to the user, from which a view of larger context is possible. Eventually, this process may lead the user to retrieval of entire sections (e.g., chapters or acts). All expanded views are made from hypertext links that initiate structural retrievals such as "the chapter that includes this search result." + Page 18 + 6.2.3 Menu-Driven Structural Queries It is possible to facilitate complex queries through menus. For example, in the OED, the word lookup function facilitated by the Web includes queries such as: "give me entries that include my word within the Lookup field of the Headword Group field," or "give me entries that include my word in the Variant Form field." The user is not aware of the complexity of the query taking place, but can modify the type of query by selecting different variations on the search menus. Boolean queries that ask for the intersection of document structures have been challenging to users employing command-line and analytically oriented interfaces. However, through simple fill-out forms and menu selections, queries such as "(stanzas including [word/phrase]) INTERSECT (stanzas including [word/phrase])" are executed without the user needing to understand the system's command syntax. While we also offer access through several complex, analytical interfaces (PatMotif and PowerSearch from Open Text as well as a locally developed VT 100 interface), most users can avoid these more complicated interfaces. 6.2.4 Access to Structure Finally, the administrator of a collection need not resort to fragmenting files to make it possible to provide access to the component parts of a collection. As mentioned earlier, an HTML approach to the OED would require us to divide it into 300,000 files. I was recently able to represent the dozens of parts, chapters, sections, and subsections of a voluminous SGML technical document through this strategy, making hypertext links and each component accessible by utilizing the fairly rich markup; however, the document remained a single file. Resource management is made more reasonable through a system cognizant of a file's structure. + Page 19 + 6.2.5 Future Approaches This strategy has many possibilities. Journal literature coded in SGML may be successfully accessed through this sort of strategy. For example, a journal run marked up according to the more elaborate Association of American Publishers DTD could return articles to the user through PAT queries. Another approach would facilitate browsing by recognizing the structural relationship of author and abstract to article, article to issue, issue to volume, and volume to collection. Throughout, the collection would exist as a single file, searchable across all articles by a single query. The collection would not need to be compromised by converting the articles to HTML, but would instead continue to remain in the more expressive AAP SGML format, filtered for display in the process of retrieving information. Through this strategy, the Web can be an effective means of accessing the original files in a fuller SGML, without resorting to fragmenting the material into files corresponding to the individual articles or even parts of articles. Similar strategies for books and documentation are possible. 7.0 What Does the Web Offer Libraries? The Web is a complex system with great potential and serious limitations. We should use caution as we consider composing in HTML: it is a short-term coding strategy. Documents composed in HTML will have limited expressiveness, and, because HTML is not yet stable, they are likely to need continuing enhancement to be used in the Web. There is much to be excited about with the Web: it is a viable system that suggests what electronic publishing on the Internet can be. We have lacked credible, demonstrable examples of standards-based, networked hypertext in the past, and the Web has changed that. There is a great deal of untapped potential in the Web. By exploiting the Web's ability to talk to other more sophisticated programs, we can begin to take advantage of that potential and make tomorrow's promise real today. + Page 20 + A subtext of this article has been the importance of standards--both employing them in creating hypertexts and extending the Web to take greater advantage of them. Standards have been attractive to libraries because they help ensure long- term viability. However, as Jefferson remarked in 1790, standards are also an important key to information being generally useful, regardless of context: Measures, weights and coins, thus referred to standards unchangeable in their nature . . . will themselves be unchangeable. These standards, too, are such as to be accessible to all persons, in all times and places. The measures and weights derived from them . . . are within the calculation of every one who possesses the first elements of arithmetic, and of easy comparison, both for foreigners and citizens, with the measures, weights, and coins of other countries. [4] Notes 1. A version of this article was presented as a paper at the Yale Hypertext Conference, May 1994. An HTML version of the original speech, with active links to the resources discussed, is available via the World-Wide Web; URL: http:// sansfoy.lib.virginia.edu/pub/yale.html. 2. Jerome McGann, The Complete Writings and Pictures of Dante Gabriel Rossetti: A Hypermedia Research Archive (Charlottesville, VA: Institute for Advanced Technology in the Humanities, University of Virginia, 1994). (Electronic document available via the World-Wide Web; URL: http:// jefferson.village.virginia.edu/rossetti/rossetti.html.) 3. Edward Ayers, The Valley of the Shadow: Living the Civil War in Pennsylvania and Virginia (Charlottesville, VA: Institute for Advanced Technology in the Humanities, University of Virginia, 1994). (Electronic document available via the World-Wide Web; URL: http://jefferson.village.virginia.edu/vshadow/vshadow.html.) 4. Thomas Jefferson, "Public Papers," in Writings (New York: Literary Classics of the U.S., 1984), 410. About the Author John Price-Wilkin, Systems Librarian for Information Services, Alderman Library, University of Virginia, Charlottesville, VA 22903. Internet: jpw@virginia.edu. + Page 21 + ----------------------------------------------------------------- The Public-Access Computer Systems Review is an electronic journal that is distributed on the Internet and on other computer networks. There is no subscription fee. To subscribe, send an e-mail message to listserv@uhupvm1.uh.edu that says: SUBSCRIBE PACS-P First Name Last Name. This article is Copyright (C) 1994 by John Price-Wilkin. All Rights Reserved. The Public-Access Computer Systems Review is Copyright (C) 1994 by the University Libraries, University of Houston. All Rights Reserved. Copying is permitted for noncommercial use by academic computer centers, computer conferences, individual scholars, and libraries. Libraries are authorized to add the journal to their collection, in electronic or printed form, at no charge. This message must appear on all copied material. All commercial use requires permission. -----------------------------------------------------------------