Indexing Mathematics with SearchFor
Stéphane Dalmas and Marc Gaëtano
INRIA Sophia Antipolos
Abstract
There is an ever-increasing number of electronic documents embedding
mathematical formulas, such as electronic books, electronic journals, and
online handbooks for mathematics.
Classical text-oriented indexing and searching
facilities can be used on such documents, but searching for the mathematical
information encoded in the formulas requires new solutions.
MathML is now providing a standard way to represent all these formulas
and thus is enabling new generic tools to be built for indexing and searching.
This talk will describe the various problems one has to face when trying to
build such a tool and the way in which MathML helps. We will explain the various
algorithmic problems (what kind of pattern matching is needed) as well as
more practical ones (presentation versus semantics).
We will present the solutions we have implemented in
SearchFor, a dedicated
toolkit (independent from any encoding or representation of mathematics)
that we have been developing for three years. SearchFor is really a toolkit
to build search engines powered by mathematical formulas.
SearchFor has been partially developed under the support of the
OpenMath European project.
Searching for formulas can be done using a very well-known technique called
pattern matching. The formulas appearing within a document to be
indexed are collected in a database. An entry of the database is a formula,
and the value associated to that index is one or more references to parts
of the document where the formula appears. Searching for a mathematical
expression (i.e., for the parts of the document where this expression
occurs) consists of matching a pattern with all of the formulas of the database and
selecting the ones for which the matching succeeded. Of course, a careful
organization of the formulas using suitable data structures is needed to
ensure acceptable performances.
Mathematical expressions can be viewed as trees (or terms), but basic tree
pattern matching is usually not enough. Properties of operators such as
associativity and commutativity complicate the algorithm and require
you to represent terms in a normal form. Suppose, for example, that the periodic
property of the sine function given by
appears somewhere inside a document. A student who doesn't recall the exact
property may try to find the corresponding chapter in the document using
the pattern
where ,
,
and
are variables. In this case, the formula is an
instance of the pattern with the substitutions
,
,
and
.
This is called
associative/commutative pattern matching (or AC pattern matching).
Unfortunately, in some situations AC pattern matching may still fail in
finding a correct match. Let's say that our student is looking for the formula
with the same pattern
.
Now
the formula is an instance of the pattern with the substitutions
,
,
and
.
To
find this match, the pattern-matching algorithm has to take into account the
identity element of the + operator. This is done by
associative/commutative/identity pattern matching (or ACI pattern matching).
We will explain why ACI pattern matching is still not powerful enough in some
situations, so that a kind of deduction is needed to extract from a
document all the formulas close enough to a given mathematical
expression.
SearchFor is a toolkit offering various components to organize,
classify, and
search for mathematical formulas.
These components can be assembled and customized to build indexers and search
engines operating on XML documents with embedded mathematical expressions. These
expressions can be encoded in MathML or OpenMath. Operating on
LaTeX
documents and formulas is also possible but requires the use of some external
tools.
SearchFor offers a number of different modules, most of them being parametrized
(thanks to the implementation language, initially Standard ML and now
Objective Caml).
- The data manager module is basically an abstraction for a whole set
of indexed formulas (stored in a data structure that is appropriate for the
various pattern-matching and retrieval operations that could be performed).
- The indexing module provides all the necessary facilities for indexing
the mathematics in a set of XML documents.
- The clause module provides a datatype for representing mathematical
expressions with the needed basic algebraic operations, simplifications, and
pattern-matching facilities (modulo associativity and commutativity). More
generally, it actually implements a datatype where validity conditions can be
associated with a mathematical relation.
- The search engine module offers searching facilities that
can be quite basic (using the simple mathematical knowledge and
pattern matching provided by the clause module) or much more
sophisticated. The engine is equipped with a parametrized deduction
capability that can use particular instances of data managers to store
known mathematical relations and use them to deduce new formulas (using a
mechanism where a failed match generates new equations that are solved using
the known relations). This really enables the search engine to be used as a
clever handbook of mathematical formulas.
- The query language module is the interface module that is used to
express queries to the search engine. It implements a simple abstract syntax
and a socket interface (where XML expressions are read and written) for
communicating with another application (that could thus use its own
concrete syntax for the queries).
The current implementation of SearchFor already gives promising results. We have
been able to use it to build a prototype electronic handbook of formulas
using approximately six hundred formulas taken from Abramovitz and Stegun's Handbook
of Mathematical Functions. This application uses the full deduction
capabilities with reasonable performances. An application to index a database
of mathematical abstracts is currently being developed as part of the OpenMath
European project. It will be very interesting to understand how far we can
go with "real life" mathematical documents, where mathematics is mostly
encoded as presentation.
|