MathML Logo


#Presentations

#Schedule of Events

    

Indexing Mathematics with SearchFor

Stéphane Dalmas and Marc Gaëtano
INRIA Sophia Antipolos


Abstract

There is an ever-increasing number of electronic documents embedding mathematical formulas, such as electronic books, electronic journals, and online handbooks for mathematics. Classical text-oriented indexing and searching facilities can be used on such documents, but searching for the mathematical information encoded in the formulas requires new solutions. MathML is now providing a standard way to represent all these formulas and thus is enabling new generic tools to be built for indexing and searching.

This talk will describe the various problems one has to face when trying to build such a tool and the way in which MathML helps. We will explain the various algorithmic problems (what kind of pattern matching is needed) as well as more practical ones (presentation versus semantics). We will present the solutions we have implemented in SearchFor, a dedicated toolkit (independent from any encoding or representation of mathematics) that we have been developing for three years. SearchFor is really a toolkit to build search engines powered by mathematical formulas. SearchFor has been partially developed under the support of the OpenMath European project.

Searching for formulas can be done using a very well-known technique called pattern matching. The formulas appearing within a document to be indexed are collected in a database. An entry of the database is a formula, and the value associated to that index is one or more references to parts of the document where the formula appears. Searching for a mathematical expression (i.e., for the parts of the document where this expression occurs) consists of matching a pattern with all of the formulas of the database and selecting the ones for which the matching succeeded. Of course, a careful organization of the formulas using suitable data structures is needed to ensure acceptable performances.

Mathematical expressions can be viewed as trees (or terms), but basic tree pattern matching is usually not enough. Properties of operators such as associativity and commutativity complicate the algorithm and require you to represent terms in a normal form. Suppose, for example, that the periodic property of the sine function given by \(\sin(z) = \sin(z + 2 k \pi)\)appears somewhere inside a document. A student who doesn't recall the exact property may try to find the corresponding chapter in the document using the pattern \(\sin(\alpha \pi + \beta) = \gamma\) where \(\alpha\), \(\beta\), and \(\gamma\) are variables. In this case, the formula is an instance of the pattern with the substitutions \(\alpha \rightarrow 2 k\), \(\beta \rightarrow z\), and \(\gamma \rightarrow \sin(z)\). This is called associative/commutative pattern matching (or AC pattern matching).

Unfortunately, in some situations AC pattern matching may still fail in finding a correct match. Let's say that our student is looking for the formula \(\sin(k
\pi) = 0\) with the same pattern \(\sin(\alpha \pi + \beta) = \gamma\). Now the formula is an instance of the pattern with the substitutions \(\alpha
\rightarrow k\), \(\beta \rightarrow 0\), and \(\gamma \rightarrow 0\). To find this match, the pattern-matching algorithm has to take into account the identity element of the + operator. This is done by associative/commutative/identity pattern matching (or ACI pattern matching).

We will explain why ACI pattern matching is still not powerful enough in some situations, so that a kind of deduction is needed to extract from a document all the formulas close enough to a given mathematical expression.

SearchFor is a toolkit offering various components to organize, classify, and search for mathematical formulas. These components can be assembled and customized to build indexers and search engines operating on XML documents with embedded mathematical expressions. These expressions can be encoded in MathML or OpenMath. Operating on LaTeX documents and formulas is also possible but requires the use of some external tools.

SearchFor offers a number of different modules, most of them being parametrized (thanks to the implementation language, initially Standard ML and now Objective Caml).

  • The data manager module is basically an abstraction for a whole set of indexed formulas (stored in a data structure that is appropriate for the various pattern-matching and retrieval operations that could be performed).

  • The indexing module provides all the necessary facilities for indexing the mathematics in a set of XML documents.

  • The clause module provides a datatype for representing mathematical expressions with the needed basic algebraic operations, simplifications, and pattern-matching facilities (modulo associativity and commutativity). More generally, it actually implements a datatype where validity conditions can be associated with a mathematical relation.

  • The search engine module offers searching facilities that can be quite basic (using the simple mathematical knowledge and pattern matching provided by the clause module) or much more sophisticated. The engine is equipped with a parametrized deduction capability that can use particular instances of data managers to store known mathematical relations and use them to deduce new formulas (using a mechanism where a failed match generates new equations that are solved using the known relations). This really enables the search engine to be used as a clever handbook of mathematical formulas.

  • The query language module is the interface module that is used to express queries to the search engine. It implements a simple abstract syntax and a socket interface (where XML expressions are read and written) for communicating with another application (that could thus use its own concrete syntax for the queries).

The current implementation of SearchFor already gives promising results. We have been able to use it to build a prototype electronic handbook of formulas using approximately six hundred formulas taken from Abramovitz and Stegun's Handbook of Mathematical Functions. This application uses the full deduction capabilities with reasonable performances. An application to index a database of mathematical abstracts is currently being developed as part of the OpenMath European project. It will be very interesting to understand how far we can go with "real life" mathematical documents, where mathematics is mostly encoded as presentation.