System Description: Statistical Parsing of Informalized Mizar Formulas

We describe a statistical system that learns parsing of ambiguous Mizar-like formulas from a large training corpus of aligned informal/formal formulas. We describe the methodology, the overall ideas, evaluate the performance of the system, and provide a web interface for using the system.


I. Introduction and Summary of the Statistical Parsing Approach
In this work we describe a system for statistical parsing of ambiguous Mizar-like [1] formulas, its implementation and evaluation.This is the next step in our larger project [12] of automatically formalizing informal mathematics by using statistical parsing methods and large-theory automated reasoning.The main components of this autoformalization approach were defined in [10], [11], and were used there on the Flyspeck corpus [5] based on the HOL Light [6] system.The general approach (applied here to parsing of Mizar-like formulas) is as follows: 1) Using corresponding (aligned) pairs of informal/formal formulas from a large corpus to train statistical disambiguation.While both informal (e.g.L A T E X) and formal math corpora are quite large today, there are not many formulas that would be written both informally and formally with a consistent alignment in mind.Therefore we produce the aligned pairs by informalizing (ambiguating) the formal corpus.In particular, for the formal Flyspeck corpus of about 20000 lemmas, we ambiguated the formal formulas by introducing overloaded symbols, forgetting types, brackets and casting functors [10].The ambiguated formulas (strings of characters) and the correponding formal parse trees then provide a treebank [4] to train statistical parsing on.2) Learning (augmented [10]) probabilistic context-free grammar (PCFG) [13] from the treebank.3) Using the grammar in our CYK [22] chart parser which is modified by semantic checks and by more involved probabilistic (context-aware) processing [10], [11].Fast discrimination trees [14] are used to match deeper subtrees from the treebank and to boost their probabilities when parsing new formulas.

II. Parsing Informalized Mizar
A distinctive feature of Mizar [1] is human-style and naturallanguage-like representation of formal mathematics.This includes Jaśkowski-style natural deduction [7], soft-typing Prolog-like mechanisms [21], propagating implicit knowledge using Mizar adjectives and registrations [19], hidden arguments, syntactic macros, and ubiquitous parametric and ad-hoc overloading [1].This poses very interesting challenges, for example the symbol + is (re-)defined more than 100-times in the Mizar library (MML), see Table I.But it also takes our project closer to parsing true naturallanguage/L A T E X corpora such as ProofWiki. 1In the following subsections we briefly describe the components of this work.

A. Treebank creation for Mizar
For Flyspeck we have ambiguated the parsed HOL Light formulas by several transformations applied on the formal HOL representation, creating the training trees in a form suitable for treebank learning.To create a suitable treebank for Mizar we apply transformations to the Mizar internal XML layer [18] used previously to produce both the userlevel HTML representation2 of the articles and also the semantic (MPTP [17], [20]) representation used by automated theorem provers (ATPs).Already the XML-to-HTML transformation is complex and sometimes imperfect.It needs to recover the user-level syntax from the internal XML representation and align the two.
Our new code for creating a Mizar treebank (producing now about 60000 parse trees from the Mizar Mathematical Library -MML version 1147) is based on the XML-to-HTML code, mainly modifying the hyperlinks into annotating nonterminals.As in the XML-to-HTML code, this annotation (alignment) is nontrivial and there are interesting issues described below.
Initially we have tried to directly use the semantic (constructor in the Mizar terminology) disambiguation layer, which is also used for theorem proving (both in Mizar and via the MPTP translation to TPTP [16]), and thus would be a suitable target for ATP experiments with the parsed statements.The first detailed evaluation of our statistical parser was done on this treebank.An example where this approach works well, connecting the user-level syntax directly with the semantic MPTP/TPTP layer is the following theorem RCOMP_1:5:3 for s,g being real number holds [.s,g.] is closed which in TPTP becomes: The internal Mizar XML representation is transformed to the parse tree shown in Fig. 1, which can be easily postprocessed into the TPTP format above.While this direct use of the semantic (constructor) layer can provide a lot of disambiguation, there are several issues when connecting it with the user-level syntax.First, there are syntactic macros like Mizar expandable modes (types) [1].These macros do not exist in the semantic layer and are expanded by the Mizar processing into larger collections of adjectives and types, taking various parameters from the context.For example the user-level type Function of X,Y is recursively expanded via the following macros (expandable modes): This leads to the following semantic representation of Function of X,Y as: It is quite a nontrivial requirement for the statistical parser to go from the input string Function of X,Y to the above representation.The original symbol Function needs to be replaced by the (possibly parameterized) adjectives, and the type Element of is applied to a single argument bool [:X,Y:] consisting of two functions applied to the arguments X and Y. Furthermore, similar phenomena are often encountered also in other situations.For example, the function composition * changes the order of arguments To deal with such phenomena we have in the second version of our export switched to the syntactic (pattern or notation) layer of Mizar [1], where our task is limited to the symbol disambiguation.This layer is going to be mapped to the semantic layer through a large number of "syntactic processing" Prolog-like rules, such as those needed for the above examples.For instance, the change of the function composition arguments can be encoded in TPTP as follows: This says, that under appropriate type constraints (A and B being functions) the arguments of the syntactic pattern nk3_funct_1 should be swapped to obtain the proper order of arguments of its parent pattern nk6_relat_1.This pattern may be again mapped to some parent syntactic pattern, or to the semantic (constructor) level.This means that the theorem-proving phase will either have to be preceded by a phase that processes (expands) these Hornlike rules, or the TPTP encoding of such rules will have to be added to the generated ATP problems.Our initial experiments show that both these approaches should be feasible.Our theorem is then represented as the parse tree in Fig. 2 (with the different nonterminals in bold), which is easily postprocessed into the following "syntactic TPTP": The parsing performance (Section III) takes some penalty when using a much higher number of the syntactic-level patterns rather than a smaller number of constructors, however there is still a lot of possibilities for improvement of the statistical parsing.

B. Mizar types
In the HOL setting used by Flyspeck, types are unique and do not intersect.This allows their simple use in the PCFG setting as "semantic categories" corresponding, e.g., to word senses when using PCFG for word-sense disambiguation.Such categories are useful for learning parsing rules from the treebank [11].In Mizar, each term has in general many adjectives (soft types [21]) like finite, natural, Function-like, non empty, etc., which are computed during the type analysis.Only some of them are usually needed to allow a term to be an argument of a particular function or predicate.Since it is more involved to learn such complex typing rules statistically, in the first version we use only the top of the type hierarchy -the Mizar type Set -as the result type of all terms, and types occur only A problem with this approach is that allowing any term to be an argument of any function/predicate may lead to great proliferation of ill-typed terms during parsing.This was indeed an issue in our first untyped export of Flyspeck [10], [11] which used just raw HOL parse trees and only context-free parsing rules.It however seems that the recently introduced deeper (context-aware) parsing rules [11] quite significantly reduce such proliferation.When using a combination of subtrees of depth 4-8, the top-20 success rate (number of examples where the correct parse appears among the 20 most probable parses proposed by the parser) in parsing Mizar (100-fold cross-validation) is already around 60%, while it is only about 30% for the simple context-free approach (Table II).

III. Evaluation
The machine-learning evaluation for Mizar is done in the same 100-fold cross-validation scenario as for Flyspeck in [10], [11].The evaluation is done both for the (simpler and imperfect) semantic (constructor-level) encoding, and for the more complex (but necessary) pattern-level encoding.
In each case we create the disambiguated grammar trees and the corresponding ambiguous sentences from all (about 61400) toplevel MML theorems and definitions.We split them randomly into 100 equally sized chunks of about 614 trees and their corresponding sentences.The grammar trees serve for training and the ambiguous sentences for evaluation.For each testing chunk C i (i ∈ 1..100) of 614 sentences we learn the probabilistic grammar P i on the union of the remaining 99 chunks of grammar trees.This can take considerable time (hundreds to thousands of minutes) for Mizar when using deeper subtrees for learning.There are roughly 100 million subtrees of depth 4-8 in the MML.The evaluation phase, i.e., the parsing of the remaining chunk is however typically fast, taking on average less than 1 second for each ambiguous sentence.The numbers of correctly parsed formulas and their average ranks across the several 100-fold cross-validations are shown in Table II.The relatively poor-performing context-free (subtree depth 2) method is evaluated only for the constructor-level encoding, in order to have a rough comparison with the performance on Flyspeck in [10], [11].While there are still many ways how to improve the performance, the top-20, resp.top-1 numbers in the range of 60-64%, resp.32-37% are very encouraging.

IV. Online Parsing System
Similarly to the work done for Flyspeck, the parsing toolchain is deployed as an online service.Fig. 3 presents a screenshot of the system, it is available at: http://grid01.ciirc.cvut.cz/~cek/parse_miz/The service visualizes the overloading disambiguation as superscripts and further uses hyperlinking to the HTMLized MML.This allows Mizar users to write ambiguous formulas and see their most probable interpretations.This is similar to systems for "wikification" [15] of named entities in natural language texts from which our project takes some inspiration [12].To make the probabilistic parsing sufficiently fast, we again limit the number of required parses to 20, and preselect only the 1024 closest grammar trees for the grammar training.This is done by running a k-nearest neighbor (k-NN) filter using n-gram (unigram, bigram and trigram) representations of all Mizar theorems and definitions in their ambiguous form.The system thus takes on average 5 seconds for the complete processing of a query.Such processing includes the k-NN filtering, the grammar induction, the probabilistic parsing, and finally the HTML-ization.
The web interface shows several typical examples of queries, one of them being: "f * h is Homeomorphism of T".Note that the probability (p = −22.05) of * being partial function composition (PARTFUN1:NK1 5 ) is much higher than it being its specialized case on many-sorted sets (CLOSURE2:NK8 6 ) or a composition of category morphisms (CAT_1:NK5 7 ).This is very likely thanks to the presence of the type Homeomorphism (TOPGRP_1:NM38 ) in the context, because such combinations of symbols have been previously seen in theorems like TOPGRP_1:31 9 .Unlike in the service for Flyspeck, where all variables were internally alpha-normalized, the names of variables are taken into account when disambiguating Mizar.For example, using f * g in the above example instead of f * h yields even higher probability for function composition, likely because of even greater similarity to theorem TOPGRP_1:31.
While such improvements pose additional challenges to the parsing system, we believe that addition of features like this and of other natural-language-like Mizar mechanisms takes our work significantly closer to parsing human-level mathematics written in L A T E X.The immediate future work in this line of research includes parsing of the human-like Mizar proofs, and using automated theorem provers for full semantic understanding.

Figure 3 :
Figure 3: Screenshot of the online parsing system

Table II :
Evaluation on MML.The top-20 success is the number of examples where the correct parse appears among the 20 most probable parses returned by the parser.