| Acquisition of Content: MathML in an Academic Setting
Michael Kohlhase, Matthew Szudzik, Dana Scott, and Klaus
Sutner
The CCAPS ProjectIn this paper, we will describe a case study in transforming courseware to the Content-MATHML based OMDOC format. This effort is part of a larger project currently under way at Carnegie Mellon University, the Course Capsules Project (CCAPS). Its goal is to design and implement a system that will store, organize, index, and present course content. A course capsule is a collection of electronic documents together with an infrastructure that supports presentation over the web, detailed indexing, sophisticated search algorithms, and reuse of course materials. Capsules need not be bound to traditional units such as a semester long course, but can be constructed around specific topics that are of relevance in various places in the curriculum. Ultimately, CCAPS is expected to cover virtually all academic disciplines, but at present our focus is on highly technical course content that is typically encountered in mathematics, computer science, and the sciences in general. As a generic example, consider a mathematical tool such as recurrence relations. In our curriculum, any course dealing with algorithms will rely on recurrence relations. They are systematically analyzed at different levels of complexity in several discrete mathematics courses. Hence, a course capsule on recurrence relations needs to be able to respond intelligently to requests by a user in a variety of different situations. The presentation appropriate for the first encounter in an introductory course is very different from one in an advanced course, or from a quick overview needed to brush up on established knowledge in connection with an assignment or in connection with a test years after the initial course. The recurrence relations capsule has to provide examples, both of a theoretical and computational type, allow for assessment, and link seamlessly to other capsules. To support these features, we rely on substantial content markup. In particular, the meaning of mathematical expressions, assertions, and theories is expressed in terms of established standards such as MATHML, OPENMATH, and OMDOC.
Content Markup in MATHML, OPENMATH, and OMDOCIn this paper we concern ourselves with two complementary case studies: content extraction from courses represented as Mathematica notebooks and or as PowerPoint presentations, and their transformation to OMDOC format. Once these courses are available they have to be integrated into existing computer-supported education systems.
Several of us (Scott, Sutner) have used Mathematica extensively for a number of years in teaching topics such as calculus, algebra, discrete mathematics, and automata theory. Mathematica notebooks are electronic documents that contain text, code, graphics, executable commands, and inline mathematics. They provide a computational environment that is ideally suited to interactive experimental mathematics, and can be used as a presentation system in lectures and recitations, as well as in self-study. They are less suitable as a permanent repository for managing course content, and do not provide enough structure for search and systematic reuse. Even though Mathematica 4 includes a notebook-to-MathML converter, the generated documents are marked up only for presentation, which is insufficient for our purposes. The overall document structure of a notebook with titles, sections, subsections, and so forth, is readily represented in OMDOC format. In order to facilitate translation into OMDOC, we have developed a special notebook style-sheet XMLStyle.nb that provides for special cell types that are typically encountered in a course notebook: purely mathematical statements such as Theorem, Lemma, Corollary, Definition, as well as auxiliary material such as Question, Comment, Hint, and so forth. We have a conversion tool written in Mathematica that translates legacy notebooks to the XMLStyle format, with some degree of human intervention. The converter from XMLStyle to OMDOC again is written in Mathematica and incorporates an extension of H. Wilbrink's FullForm2OM translator for OPENMATH. Needless to say, the key difficulty in the conversion process is the use of inline mathematics. Recent versions of the Mathematica front-end have sophisticated typesetting capabilities, forcing us to contend with expressions such as a0 + a1 x + ... an xn or f:N → N. Wholesale conversion of such expressions is exceedingly difficult, so we have focused our efforts on supporting automatic conversion of frequently encountered idiomatic constructs, such as the two mentioned above. To this end, we have developed a special Mathematica palette that accompanies XMLStyle and that represents these expressions in a form suitable for parsing while maintaining the visual representation of the expressions within the notebooks.
PowerPoint is another standard authoring tool in our curriculum. Unlike the Mathematica notebooks, which are naturally highly structured, PowerPoint slides address exclusively the issue of presentation - the placement of text, symbols, and images on the screen, carefully sequenced and possibly animated or embellished by sound. While such slide shows can be used to great effect in classroom presentations, they fail to provide a suitable vehicle for building a long-term repository of information. Indexing and searching, for example, become very nearly impossible. To eventually be able to incorporate the large quantity of existing material (some courses are associated with thousands of slides!) into appropriate course capsules, we have focused on building tools that semi-automatically extract content from PowerPoint slides, and convert the result to OMDOC format. Representational details are largely ignored in this process. Microsoft provides PowerPoint with a built-in PowerPoint-to-HTML converter. We used the resulting so-called published web-pages as input to a suite of Perl programs and XSLT stylesheets that extract and categorize its content. The overall structure of a slide (as visualized in the outline view) can be mapped onto structural components of the OMDOC format. While PowerPoint offers, in theory, the capability of keeping track of semantic classifications by the use of special place-holders, in practice it turns out that people tend to ignore this. Furthermore, it is not always easy to make the distinction between semantic and presentational features. Consider, for instance, the use of the Greeks alphabet to denote special constants, indices, or parameters in a mathematical context. While the difference between the integer m and an index μ is as far as PowerPoint is concerned simply a difference in font - a presentational issue, in fact, there is no semantic connection between them at all! Part of our effort here consisted of singling out special fonts like the Symbol font and translating those fonts into OPENMATH objects by establishing temporary content dictionaries. Mathematical content for PowerPoint slides is often generated with TexPoint, a PowerPoint add-in which converts LATEX code so that it can be used in a PowerPoint show, see http://raw.cs.berkeley.edu/texpoint. Whenever these TexPoint sections are used in PowerPoint's display mode, it is possible to capture the LATEX source code, and migrate it along within the usual transformation process. However, when used in inline mode, no LATEX directives are kept within the PowerPoint document. While the semantic markup for PowerPoint presentations can be supported, the task of automating it fully is currently beyond our scope.
Once the course content is available in OMDOC form, it can immediately be used for computer-supported education. To this end, we are currently using the ACTIVEMATH system for the generation of personalized and user-adaptive course documents, and CMU's installation of the BLACKBOARD system for general course delivery. In the former , the content markup in OMDOC is used in various ways: the structure of mathematical expressions is used for dependency analysis, mathematical expressions provide links to the definitions of the concepts involved, and those expressions can also be sent to external mathematical software systems (in our case Mathematica) for interaction and experimentation. Finally, user interaction data is stored in a detailed, concept-oriented user model that can be used for navigation and course content personalization. We are currently working on upgrading the XSL style sheets for HTML to MATHML. Of course, the OMDOC content can be transformed back to notebooks and PowerPoint slides, albeit with loss of special visual effects. In fact, this use of OMDOC as an Interlingua makes it possible to mutually import course materials into course materials available in a different original representation. E.g., part of a PowerPoint-based course could be integrated into a Mathematica notebook. Note that the successful delivery of a Mathematica-based course poses significant administrative problems beyond those one encounters in a more traditional setting. Apart from the usual course materials such as lecture notes, assignments, tests, and so forth, one also needs to organize and provide for access to notebooks and macro packages. All of these components have to be seamlessly integrated into the course environment, and should be easily accessible over the web. As a case in point, consider the maintenance of code packages that typically accompany a Mathematica-based course and that implement the requisite algorithms or implement other special purpose computational tools. For example, the package used in our automata theory course currently contains some 500 functions and is comprised of some 300K of Mathematica code. In our system, the source code for these functions, together with usage messages, option settings, and browser documentation, are considered as primary course content and are stored in OMDOC's code management system. From there, XSL transformations, like the ones for presentation, are used to generate add-on packages complete with help messages, protection, and stubs. Likewise, HTML- or browser-based help can be generated automatically. Maintaining Mathematica code in a heavily marked-up format turns out to be hardly more cumbersome than in a notebook, and it allows us to achieve full integration in a course capsule. Thus, the Capsule becomes the unique central repository from which appropriate representations of the material such as notebooks, dotm-files, HTML pages, lecture slides, etc. can be generated.
ConclusionWe have described a case study in transforming courseware from Mathematica notebooks and PowerPoint presentations to the Content-MathML based OMDOC format. The case of Mathematica notebooks is especially interesting, as notebooks are highly structured documents that contain content-markup for mathematical expressions, code, and interaction points with the Mathematica kernel. Producing content markup from legacy material is a difficult and tedious task at best, and we are in the process of developing an OMDOC editor that will provide some support for the inevitable manual processing. Transformation into OMDOC format allows us to use alternative, open front-ends for the material, including ones like the ACTIVEMATH system that are geared towards computer-supported education. We intend to make our first experimental Capsules available for the Fall 2002 semester, and to test them in several computer science courses. We expect that in a few years all the core computer science courses at CMU will have been ``encapsulated.'' Ultimately, we hope to help to restructure and improve a large part of the curriculum, based on systematic reuse and iterated improvement of all the core content. 1 Olga Caprotti and Arjeh M. Cohen. Draft of the Open Math standard. The Open Math Society, 1998. 2 Casl -- the CoFI algebraic specification language -- summary, version 1.0. 1998. 3 Mathematical Markup Language (MATHML) 2.0. W3C Recommendation, 2001. 4 Michael Kohlhase. OMDOC: An Open Markup Format for Mathematical Documents (Version 1.1)''. Open Specification 5 Extensible Markup Language (XML) Specification. Version 1.0. W3C Recommendation, 1999,
6
XSL Transformations (XSLT). Version
1.0, W3C Recommendation, 1999.
|