Powerful Use of New Database Technologies in Indexing Mathematics and Displaying Dynamic Math on the Web

Safdar Raza Syed
Wysitech Inc.

Abstract

Digital information is stored in both database columns and in the file system as unstructured data, primarily text. Some text data is stored in database character-type columns as well. Traditionally, retrieving specific text data from database columns or file systems has been a cumbersome and expensive process, often requiring third-party tools. New database technologies have solved this problem by allowing full-text queries to be issued against plain character-based data in a relational database table. The text format of MathML [1] and presence of these new database technologies have allowed us to build search engines powered by mathematical formulas.

Since MathML [1] is plain text, I started thinking if scientific documents containing mathematical formulas or expressions encoded in MathML are stored in a relational database, then new database technologies can be used to create a mechanism for searching these documents based on mathematical formulas or expressions they contain. Acting upon these thoughts, I have designed a tool mathSearch that one can use to store scientific documents in a relational database installed on a web server -- and search them based on mathematical formulas or expressions they contain. What is demonstrated here is part of a larger project that I am working on. My aim is to come up with some sort of advanced Mathematical Web Services.


Document Indexing and Storage Process

Documents are written using Microsoft Word and are saved directly from the MS Word to a database in XML + MathML format with Universal MathML Stylesheet [2] compliance. This makes it possible to view the documents in almost any browser. If users have plug-ins MathPlayer [3] or TechExplorer [4] installed, they are given a choice of viewing the documents using these plug-ins. Moments after a document is saved to the database, it is available on a website for searching and viewing. mathSearch stores documents in one table and mathematical formulas contained in the documents into another table. Advanced triggers from relational database are used to make sure that only distinct formulas are stored in the database. These formulas are indexed by assigning a unique formula ID to each formula. A separate table stores the formula ID's and the corresponding document ID's for the documents that contain the formula. This new table and the table containing the formulas are used by the database to find the documents containing the formula used in a search criteria. So, the database system does not have to scan through all the documents to find out the documents containing the formula used in search criteria. Users can use Amaya [5], a freeware from W3C, as an equation editor to embed the MathML contents in their documents. If Amaya is installed on client system, mathSearch makes Amaya, directly accessible from menu bar of MS Word. At present, only presentation markup is supported.


Search Process

The application has a web user interface that has a menu command available for opening Amaya [5] directly from the user interface. A text box is provided to enter MathML [1] contents for the formula the documents are searched for. Users can use Amaya to generate MathML contents for mathematical formulas. If the documents based on the search criteria are found, the application displays the titles and names of the authors of the documents. By clicking on a title, the document is displayed on user's browser. For example, if a student notices that her math instructor uses the integral Gamma Function for Gamma function. Whereas, her physics professor uses Gamma Function for Gamma function. Student gets confused and, using mathSearch, she searches for Gamma Function in the class notes of her math professor, posted on her math department's website. She finds out a class note where the following was proved [or maybe, just noted]:

Gamma Function = Gamma Function

She is now convinced that the above two representations for Gamma function are equivalent.


Difficulties

At this point, one may ask: "what if the integrals given in the above example use a variable t in some documents and a variable u in some other documents". Advanced search options are used to handle those cases. But, issues similar to this one and the ones discussed in [6] still require more work to be done. Some new relational database systems, such as Microsoft SQL Server 2000, Oracle 9i, DB2 or Sybase, use certain predicates to search columns containing character-based data types for precise or fuzzy (less precise) matches to single words and phrases, the proximity of words within a certain distance of one another, or weighted matches. For example, SQL predicate "CONTAINS [8]" is used for this purpose in Full Text Search Queries of SQL Server 2000. These predicates help us resolve some of the issues related to the pattern matching for mathematical formulas encoded in MathML [1]. A progress report on this project will be available in future on [7].


References
[1] MathML spec: http://www.w3.org/TR/MathML2/
[2] Universal MathML stylesheet: http://www.w3.org/Math/XSL/
[3] MathPlayer - A plug-in to display MathML in IE 5.5 or later: http://www.dessci.com/webmath/mathplayer/
[4] TechExplorer - A plug-in to display MathML in Netscape and IE 5.5 or later: http://www-3.ibm.com/software/network/techexplorer/
[5] Amay - W3C's Editor/Browser: http://www.w3.org/Amaya/
[6] Indexing Mathematics with SearchFor: http://www.mathmlconference.org/2000/Talks/dalmas/
[7] Progress report on this project in future: http://www.eMathByRaza.com
[8] SQL Predicates for Full Text Search: "In Books online for MS SQL Server 2000, search for the keyword "Contains" in the Transact-SQL Reference"