Published May 2, 2019 | Version v1
Software Open

Software supporting the paper: "Comparing machine learning models to choose the variable ordering for cylindrical algebraic decomposition"

  • 1. Coventry University


This toolbox supports the results in the following publication:

M. England and D. Florescu.  Comparing machine learning models to choose the variable ordering for cylindrical algebraic decomposition.  To appear in Proc CICM 2019, Springer LNCS, 2019.
Preprint:  Arxiv:1904.11061

The authors are supported by EPSRC Project EP/R019622/1: Embedding Machine Learning within Quantifer Elimination Procedures.

This toolbox requires the NLSAT database, which can be downloaded at 
The first file in this toolbox was run in Matlab R2018b. The CAD routine was run in Maple 2018, with an updated version of the RegularChains Library downloaded in February 2019 from This updated library contains bug fixes and additional functionality.
The training and evaluation of the machine learning models was done using the scikit-learn package v0.20.2 for Python 2.7.

I. << Load_NLSAT.m >>         
    converts the redlog files corresponding to all problems with 3 variables to .txt files 
    NLSAT database =>> \poly\poly1.txt - \poly\poly7200.txt
II. <<>>
        Input:     <<poly\poly*.txt>> 
        Output:    <<comp_times\comp_times*.txt, comp_times\challenging_formulas_more_than_64s.txt>>
        # observation: comp_times is the folder with the computation times for all problems, which are further on split into training and testing

        - for each problem in \poly, the script copies the corresponding file to \poly\pol.txt and calls the following Maple script
            Input:    <<poly\pol.txt, poly\var_order.txt, poly\time_limit.txt>>
            Output:    <<comp_times\comp_times*.txt>>

        - for each problem, the time limit starts from 4s and doubles if all of the orderings timed out
        - other files used/generated by the script:
        <<poly\rand_order_path.txt>> # pre generated file in pickle format containing a fixed randomised order of the 7200 file names in poly\; is used only for generating the testing data; 

        Input:    <<poly_test\poly*.txt>>
        Output:    <<poly\comp_times_test\comp_times*.txt>>

        - for each problem in poly_test, the script copies the corresponding file to \poly_test\pol.txt and calls the following Maple script
            Input:    <<poly_test\pol.txt, poly_test\var_order.txt, poly_test\time_limit.txt>>
            Output:    <<poly\comp_times2\comp_times.txt>>

        - for each problem, the time limit is fixed at 128s. 

IV.    <<data_gen_ml_test.ipynb>>
        Input:    <<'poly\poly*.txt', 'comp_times_test\comp_times*.txt'>>
        Output:    <<'ML_data\y_test.txt', 'ML_data\features_test.txt', 'ML_data\no_optimal_ordering_test.txt', 
                'ML_data\comp_times_test.txt', 'ML_data\formula_numbers_test.txt'>>
        - generates the input\output testing data for ML
        - generates the formula_numbers_test.txt, which is later used to work out the training data 
        - other files used by the script:
        - all formula numbers from poly_test are written in <<ML_data\formula_numbers_test.txt>> to be used later for ML

V.    <<data_gen_ml_brown_test.ipynb>>
        Input:    <<'poly\poly*.txt', 'comp_times_test\comp_times*.txt'>>
        Output:    <<'ML_data\y_brown_test.txt', 'ML_data\formula_numbers_brown_test.txt'>>
        - generates the Brown heuristic predictions on the testing data. 
VI.    <<data_gen_sotd_test.mpl>>
        Input:    <<'poly\poly*.txt', 'comp_times_test\comp_times*.txt'>>
        Output:    <<'ML_data\y_sotd_string.txt','ML_data/formula_numbers_sotd.txt'>>
        - generates the sotd predictions on the testing dataset
        - observation: the numbers in formula_numbers_sotd.txt are in a different order than in formula_numbers_test
        Input:    <<ML_data\formula_numbers_test.txt, comp_times\comp_times*.txt, poly\poly*.txt>> 
        Output:    <<ML_data\:y_train.txt, features_train.txt, no_optimal_ordering_train.txt, formula_numbers_train.txt>>
        - generates the input\output data for ML (the heuristics have nothing to do with this step)
        - formula_numbers_train.txt includes {1,...,7200}\formula_numbers_test.txt

        Input:    <<'ML_data\y_train.txt','ML_data\features_train.txt','ML_data\features_test.txt'>>
        Output:    <<'ML_data\y_'+model_class_name+'_test.txt'>>
        - generates ML predictions on the testing dataset for: DT, MLP, SVC and KNN
        Steps for training a model:
        - the ranges for performing CV are stored in a dictionary coded in, in the part corresponding to model_class_name
        - after the ranges are defined, the model is selected in, e.g., model_class_name='DecisionTreeClassifier'
        - if they want to train the model, the user should select the choice to perform grid search CV on the training dataset
        - the best parameters returned by grid search should be manually entered in models_dictionary using the 'def' key of the dictionary
        - the ranges should be reset and the process repeated until the user is confident that the CV performance could not be improved by extending the range or using a finer resolution
        - after training, '' should be run again. This time the user should skip training and select 'y' when asked 'Continue with the previous default parameters from ""?'
        - this will generate the model predictions and save them in 'y_'+model_class_name+'_test.txt'
IX.    <<test_comparative_performance.ipynb>>    
        Input:    <<'ML_data/y_sotd_string.txt', 'ML_data/formula_numbers_sotd.txt', 'ML_data/formula_numbers_test.txt', 
                'ML_data/y_brown_test.txt', 'ML_data/y_test.txt', 'ML_data/y_LinearSVC_test.txt', 'ML_data/y_SVC_test.txt',
                'ML_data/y_MLPClassifier_test.txt', 'ML_data/y_KNeighborsClassifier_test.txt', 'ML_data/y_DecisionTreeClassifier_test.txt',
                'ML_data/comp_times_test.txt'>> => time_ ...(all methods) & accuracy ...(all methods)
        Output: Prints the performance of each method
        - computes the accuracy and total time using the final prediction on the testing dataset with all methods
        - plots a histogram showing the percentage increase in computing time with every method, compared to the minimum time


Files (6.7 MB)

Name Size Download all
6.7 MB Preview Download

Additional details

Related works

Is cited by
arXiv:1904.11061 (arXiv)