SML-Bench – A Benchmarking Framework for Structured Machine Learning

The availability of structured data has increased significantly over the past decade and several approaches to learn from structured data have been proposed. These logic-based, inductive learning methods are often conceptually similar, which would allow a comparison among them even if they stem from different research communities. However, so far no efforts were made to define an environment for running learning tasks on a variety of tools, covering multiple knowledge representation languages. With SML-Bench, we propose a benchmarking framework to run inductive learning tools from the ILP and semantic web communities on a selection of learning problems. In this paper, we present the foundations of SML-Bench, discuss the systematic selection of benchmarking datasets and learning problems, and showcase an actual benchmark run on the currently supported tools.


Introduction
With the growth of the number and size of data sources over the last years, there is an increasing demand for algorithms and tools to perform accurate analysis of these datasets.History in computer science has shown that the main driver to scientific advances, and in fact a core element of the scientific method as a whole, is the provision of benchmarks to make progress measurable.A famous example from database benchmarking (specifically TPC-A) is considered to have been the motor that improved transaction performance of relational databases by an order of magnitude on equal hardware in the 90s.Other more recent benchmarking areas related to semantic technologies have been 'question answering benchmarks' (QALD), ontology matching (OAEI), as well as graph and triple store query performance (LDBC).All of those have led to significant performance improvements.
One area, which is not extensively covered by benchmarks yet is symbolic supervised machine learning from structured data.In this task, background knowledge is modelled using RDF, OWL, Prolog, or other knowledge representation languages.Within this background knowledge, entities are selected as positive and negative examples (supervised learning).Based on those examples, logical formulas e.g.Horn rules or OWL class expressions are learned, using some algorithms, which separate positive and negative examples.These formulas or rules are later used to predict further unseen entities.For instance, given a dataset describing chemical compounds in which negative examples are compounds known to cause cancer and positive examples are compounds which do not cause cancer, the algorithms would induce formulas describing what causes cancer.Two major advantages of those methods are that 1.) they can work with complex background knowledge including inference and 2.) the result can be interpreted and understood by humans.
While a large body of research work has been devoted to this area, the evaluation scenarios are scat-tered and no generally accepted reference benchmarking platform exists.There are at least two major reasons for this: 1.)The first problem is that the use of different knowledge representation languages makes results very difficult to compare.For instance, logic programs are incomparable in terms of expressivity to the description logics underlying OWL.In practice, the knowledge modelling is different to such an extent, that automatic conversion methods do not output satisfactory results.2.) The second reason is that the effort required to model the background knowledge using semantic technologies can be considered significantly high.Doing this for many learning problems and setting up a repository is a major effort.Overall, this has led to benchmarks being scattered across different publications and scientific communities.
To overcome this problem, we have performed a systematic scientific literature analysis in order to collect relevant benchmarks.Those were then translated into different knowledge representation languages if needed.For the execution of benchmarks, a framework has been implemented, which allows the execution of different learning systems over a given set of learning tasks and measure performance metrics.Moreover, wrappers for the systems Progol 1 , Golem 2 , Aleph 3 , FuncLog 4 , TopLog 4 , Pro-Golem 4 , TreeLiker 5 and DL-Learner 6 have been written to include them in the framework.Overall, our main contributions are: -A systemic survey of articles published in the last 10 years in relevant scientific conferences and journals collecting benchmarks for structured machine learning.-The preparation of nine learning tasks, which constitute our current benchmark suite, including translations of the background knowledge in OWL and different logic programming dialects.-The creation of a framework (called SML-Bench), which allows comparison of systems which differ in the knowledge representation languages they support, and the programming languages they are written in.-The creation of wrappers for eight learning systems for their inclusion in SML-Bench.
The paper is structured as follows: In section 2 we give an overview of related work followed by a section describing our dataset review process (section 4).In section 5 we introduce our benchmark framework and describe our evaluation setup and results in section 6.After a discussion in section 7 we give an outlook and conclude our paper in section 8.

Related Work
This section presents a brief review of some of the prominent related benchmarking efforts in the machine learning community.Machine learning is a vast field with a variety of different domains and learning problems.Our focus in this paper is on symbolic machine learning approaches on expressive structured data.An analogous systematic benchmarking initiative has not been attempted so far, to the best of our knowledge.We describe related projects.
In terms of benchmarking data collection, some of the well known benchmark suites are UCI [5], Statlog [11] and Statlib [25].Most of the datasets in these repositories are in tabular or CSV format, which can be categorised as structured data.However, the underlying structure of the data is often flat and simple, which is not the primary focus of this paper.
A noticeable effort for benchmarking, both datasets and algorithms, can be seen in libsvm7 [6].The authors have collected datasets from various repositories and scaled them to make them compatible with the format required by libsvm.This work mainly focuses on tabular data presented in a preprocessed format, which renders it beyond the scope of SML-Bench.
BioBench [1] is a benchmark suite for bioinformatics related problems.It contains different tools that could perform a variety of different tasks associated with numerous applications in the bioinfor-matics field.This benchmark particularly focuses on genomics applications.
There are some benchmarking efforts which emphasise a particular task e.g.reinforcement learning.One approach in this direction8 aims at implementing reinforcement learning algorithms in a uniform manner so that new algorithms can be easily tested on a set of different problems.The learning problems in this benchmark are synthetically generated and contain reinforcement learning specific parameters like state, rewards, actions etc.
A benchmark focusing on neural networks is the Shirley's Next-Generation Benchmark Suite 9 .It includes numerous types of neural networks that perform different tasks on the provided data.The benchmark suite contains several datasets collected from different internet resources together with results from simulation of different machine learning algorithms.This covers bitmap or numerical data tailored for certain neural network techniques which is not the focus of our approach.
Most of the recent benchmarking and data collection attempts cover big data tools, in particular scalability tests.The ongoing effort Bench-ML10 provides a minimal benchmark of learning tools for preprocessing, visualisation and machine learning algorithms for commonly used open source implementations e.g. in R, Python, H2O and Spark-MLlib.This benchmark can be used for testing scalability, speed and accuracy of the above mentioned machine learning tools.However, it uses only one dataset from an airline for the evaluation.This dataset is tabular and lacks any semantic structure.
Another big data tool comparison effort11 assesses Redshift, Hive, Shark, Impala and Stinger, with the goal of providing understandable and reproducible results.The benchmark data includes unstructured HTML documents and tables containing summary information.
Fox et al. [8] evaluate multiple existing big data benchmark suites with respect to the coverage of the so called Ogres facets derived from an analysis of real world big data applications.The approach by Ming et al. [19] is more focused on the creation of a big data generator that creates different types of data for numerous big data related tasks reflect-ing properties of real world datasets.It supports the generation of structured, semi-structured and unstructured data such as text, graph, or tables.
In a Spark specific benchmarking suite [18], a variety of different algorithms related to machine learning (logistic regression, SVM, matrix factorisation), graph computation (PageRank, SVD++, TriangleCount), SQL queries and streaming applications have been presented.The tests are carried out using synthetic datasets with a variety of workloads.Some other big data related benchmarks are [9,10,4] and [3].However, the provided datasets are not sufficiently structured for our benchmarking approach.One of the noticeable benchmarks in the semantic web community assessing query performance is proposed by the Linked Data Benchmark Council (LDBC) 12 [2].It comprises two sets of benchmarks covering semantic publishing and social networks.RdfStoreBenchmarking13 is a repository that provides an exclusive collection of references to RDF benchmarks, benchmarking results and papers about RDF benchmarking.The main focus in most of these benchmarking efforts is to collect different types of RDF data or to provide meaningful schema information.Some of the tasks include evaluating the query performance of different semantic web repositories that provide a SPARQL endpoint.Some other directions explored in these efforts include query performance on graphs with different properties (e.g.regarding their connectedness), or linked data translation and integration tasks.Some benchmarks are also focused towards measuring the performance of federated query engines, linked data quality assessment or data fusion systems.In summary, most of these systems are concerned with basic triple storage and retrieval performance related tasks.On the other hand, we are interested in data sets that are particularly suitable for supervised machine learning tasks.We require that the data allows to derive a classification problem as described in section 1.The benchmarking projects mentioned above do not deal with this type of data explicitly, making our benchmarking effort considerably different from these existing approaches.

Challenges of Structured Machine Learning
In this section, we describe some of the challenges of machine learning on structured data grouped by category: Background Language Expressivity A powerful property of machine learning algorithms operating on structured data is their ability to reason about the background knowledge.Generally, more expressive languages for the background knowledge, e.g. an expressive description logic such as SROIQ, have a higher (worst case) time complexity than a lightweight language.Structured machine learning algorithms usually either include a reasoner as part of their architecture or integrate reasoning capabilities into the core of their algorithms (sometimes without completeness and correctness guarantees for results).
Size Larger datasets affect the efficiency of structured machine learning tools.There are three main aspects of size: (1) The size of the schema, (2) the size of the instance data and (3) the number of examples.The first aspect, schema size, affects mainly the hypothesis space as the learned concepts are constructed from the available schema.The schema size is the number of predicates in the background knowledge (where OWL classes are unary and OWL properties are treated as binary predicates).In the abscence of a particular language bias, e.g.restrictions on the length of learned concepts, nested concepts etc., the size of the hypothesis space can become extremely large for big schemata.The second and third aspect, i.e. the instance data size and number of examples, affect mainly the time for hypothesis checking, i.e. validating whether a concept fits the given examples well.
Target Concept Language Given the same background knowledge, the performance of a structured machine learning algorithm will still heavily depend on the target concept language which is not necessarily the same as the background knowledge language.For instance, some algorithms can learn arbitrarily nested predicate structures (e.g."parents having studied in Germany" could be represented via nesting of the predicates parent and studied).Moreover, in particular for description logics, available concept constructors, such as existential and universal quantification or qualified cardinality restrictions, can be included or excluded.Including them increases the search space, but potentially also allows to find better solutions.
Within our evaluation, we will discuss how the tools perform on the selected datasets with respect to the above challenges (often also reffered to as choke points in the benchmarking literature).

Datasets
To review and collect datasets already proposed or used in other works, we performed an extensive literature review.The five co-authors investigated publications that appeared in major conferences and journals related to structured machine learning in the past 10 years.An overview of the conferences and journals covered is given in Table 1.
The actual review was performed by either reviewing the accepted papers as linked in the conference schedule or in the corresponding proceedings and journal issues.The papers were scanned for relevant learning tasks involving datasets that are suitable for our benchmarking approach.The following selection criteria were used in order to determine whether a learning task is relevant: Paper is available One first requirement was to be able to access an electronic version of the paper on the Web.This included PDF versions of the accepted submissions that were made available on the conference website as well as papers provided by the publishers of the corresponding conference proceedings or journals.All considered publications met this criterion.
Availability of the Dataset A major requirement regarding the actual data was its accessibility on the Web to be able to investigate its suitability.In the easiest case downloads and further information were provided on dedicated Web pages.However, frequently datasets were just referred to by name.In such cases we considered a dataset available if we could find an entry via a search engine or in one of the major public machine learning dataset repositories with an unambiguously matching name, file name or description.If datasets were only indirectly referenced by pointing to other publications introducing or using them, we considered them available if we could find a corresponding Web site after (transitively) following and review- ing the referenced papers.Although the portion of datasets actually available varies among the considered literature sources and time, we observed that only a small fraction of approximately 40% of the datasets were accessible, whereas the majority of them stems from benchmark dataset repositories.

Structure of the Dataset
Since the main aim of our framework is to provide benchmark scenarios for inductive learning tools working on structured logical representations, we focused on datasets that contain logical relations between single data entries or attributes.Hence, flat datasets mainly describing data with numeric attributes were not considered.However, data that does not comply with this requirement, but could easily be enriched with other structured data, was also further investigated in our review.An example of such a case would be data from clinical trials that could be linked to the Gene Ontology14 or phenotype ontologies like e.g. the Mammalian Phenotype Ontology 15 .
Dataset Size A further requirement was that a candidate dataset should be sufficiently complex in terms of its size.The main aim behind this requirement was to not just provide small toy examples that will show only negligible differences between the benchmarked tools.Ideally, datasets should cover non-synthetic, real world problems to prove the practical applicability of tools obtaining high SML-Bench scores.
Derivable Inductive Learning Problems The last requirement was that the described dataset represents an inductive learning scenario, or that such a scenario could be derived trivially.This means that a supervised machine learning task with positive and (optionally) negative examples is provided or can easily be constructed.This might not be the case if, for example, structured data is used in an unsupervised setting like clustering.
The review was performed in two rounds: In the literature review phase candidate datasets are selected based on their description in the corresponding paper or after briefly checking the actual data.The publications were then marked as either not, maybe, or likely containing suitable datasets.In the candidate review round all papers maybe or likely containing suitable datasets were examined in depth.Overall 6 890 publications were reviewed and 160 candidate datasets selected.From the datasets that were found and could be used for our framework, data conversions and adaptions were performed to work with all tools, if necessary.Besides common and simple formats like CSV or relational databases, data was also provided in special file formats like the Chemical Table file (CTfile) 16 .If not available dedicated parsers and converters had to be written to transform such data into the different KR language formats.Apart from the conversion of the actual data, metadata was added.This additional information comprised TBox axioms in case of OWL background knowledge, and mode declarations which had to be added for each Prolog-based learning system.In addition to these efforts usual testing cycles were performed to check whether the (constructed) learning problem contains enough and consistent information for inductive learning.As of March 2017, 9 out of 78 datasets were converted and integrated in SML-Bench.Due to the high to prepare benchmarks of good quality, including the configuration and verification in the participating inductive learning programs, this is an ongoing task for which we also acknowledge support from the community.The datasets were then labelled with the initial version tag v0.1 and added to our learning task repository.After internal reviews and discussions or external user feedback this version number can be increased to allow a unique reference to the dataset and represent its maturity.
Apart from conference and journal publications, we also investigated public machine learning dataset repositories mentioned in the reviewed literature.The investigated repositories are summarised in Table 2.If possible, we pre-filtered the repository to classification datasets, ignoring regression or clustering use cases.However, we also examined all datasets that were not grouped into any of those categories or could not be pre-filtered.An overview of datasets that are part of the SML-Bench framework is given in Table 3 and 4.
The datasets Lymphography, Mammographic, Pyrimidine and Suramin are rather small in terms of their instance and schema data.They have a very Mutagenesis, Hepatitis and Carcinogenesis can be considered as 'medium size' datasets w.r.t our benchmarking repository.While the OWL representation of the Mutagenesis dataset also shares the very simple DL family AL(D), Hepatitis and Carcinogenesis are more complex.However, with ALC(D) being the most expressive DL used, they can still be considered simple.In terms of their example sets, Mutagenesis provides the smallest number (84), followed by Carcinogenesis (298) and Hepatitis (500).
The most complex dataset w.r.t. the expressivity of its OWL representation is NCTRER.In size, however, it can also be considered medium, be it in terms of schema and instance data, or w.r.t. the number of examples.
The biggest dataset we provide is Premier League.It provides a lot of different statistics which are expressed through an extensive (though still simple) schema and comprehensive instance data.

SML-Bench Framework
With SML-Bench our aim is to provide a framework which is open and extensible but already comes with predefined benchmark scenarios and presets for relevant tools, thus being ready for use.The core framework is developed in the Java programming language and is intended to be used via a command line interface.However, the system can easily be extended to support graphical user interfaces.SML-Bench is provided as free software and accessible on the Web 26 .
Architecture The overall architecture of SML-Bench is shown in Figure 1.The framework's main building blocks are the tools to execute during a benchmark run and the benchmark scenarios.SML bench provides means to connect a set of inductive learning tools with such scenarios to run the evaluation on.This overall setting is held in a benchmark configuration and the framework will take care of providing the tools with the required 26 http://github.com/AKSW/SML-Benchdata, performing the benchmark and collecting the results.
To support a wide range of tools and the introduction of own inductive learning implementations the SML-Bench framework follows a lightweight extensibility approach.Based on the relations between benchmark scenarios, their background knowledge, utilised KR languages, and benchmarked inductive learning systems we define some conventions which allow the extension of the framework with new use cases and tools without any changes in the code base or further wiring.
Benchmark scenarios To better structure scenarios and allow different benchmark variations based on the same data we distinguish between learning tasks and learning problems.Learning tasks define the actual background knowledge the benchmark is run on for learning problems.Learning problems are thus learning task-specific and comprise a set of positive and (optionally) negative examples, as well as optional tool settings dedicated to the given example declarations (cf. Figure 1).Accordingly, varying example constellations or tool configurations are realised as separate learning problems.
In our 'convention over configuration' approach the files containing the background knowledge for a learning task A, given in a knowledge representation language L, are expected to reside in the directory path learningtasks/A/L/data/ (relative to the framework's root directory).Examples for L, in use already, are owl and prolog.If additional tool-specific data is required (as in case of the Prolog-based tools which usually require particular mode declarations), e.g. for a tool X, this should be put into the directory learningtasks/A/L/data/X/.
Data might be spread across multiple files which are all read and merged during a benchmark run.
An individual learning problem P can be defined adding a file containing positive positives and an optional file containing negative examples to a directory named learningtasks/A/L/lp/P/.A learning problem might also comprise tool-specific configurations which are put into a file named after the respective tool with the file suffix .conf,e.g.learningtasks/A/L/lp/P/X.conf.
Benchmarked Tools A similar approach is followed to integrate inductive learning tools into SML-Bench.To make a tool X available to the benchmark framework, a corresponding directory has to be created at learningsystems/X/.For a given learning system under assessment we are mainly interested in two things: 1) the learned rule or class expression and 2) a measure of how well this rule or expression performs on the provided examples.Accordingly, the benchmark process is divided into two phases: 1) the training phase and 2) the validation phase.In the training phase the learning system generates the rules or class expressions for a particular learning problem, which are then assessed in the validation phase.Whereas the output of phase 1 might be a tool-specific representation of the learned description, the validation output has to follow a fixed pattern, quantifying the number of true positive, false positive, true negative and false negative examples covered.Since the integration of all these particularities into the core framework would render it inflexible and hardly extensible, we rely on a wrapper-based interface to the respective learning systems: For each phase a dedicated executable has to be provided.One executable runs the learning system to generate the rules or class expressions which are then assessed by a validation executable producing the standardised output.These executables have to be named run and validate, respectively, and can be written in any programming language (cf. Figure 1).

Benchmark settings
To generate a custom benchmark on a selection of tools and learning problems, a global configuration has to be provided defining which tools to run, which learning problems to tackle, and optionally, additional benchmarkspecific tool configurations.The framework then executes the run and validate executables with the corresponding configurations of all selected tools.SML-Bench supports arbitrary train-test splits, n-fold cross validation, as well as running the training and validation on the whole set of examples27 .The actual execution can be performed in parallel threads or sequentially.A simple benchmark configuration snippet is shown in Listing 1. SML-Bench supports the generation of semantic descriptions of a benchmark setup based on the MEX vocabulary [7].Such descriptions do not only comprise general configuration issues as shown in Listing 1 but cover all the details to comprehend the benchmark settings in detail.This includes detailed specifications of the tools executed together with their runtime configurations, details about the data used in the benchmark and the actual evaluation results.

Available learning systems
In its current state SML-Bench supports eight inductive learning tools, collected during our literature review.A tool was introduced into SML-Bench as learning system if it implements a published inductive learning algorithm, if it is freely available and sufficiently documented.
The oldest of the available learning systems is the classic ILP tool Golem [21] which was published in the year 1990, implementing a Relative Least General Generalisations-based induction approach.Golem supports a Prolog-based knowledge representation language.Another, slightly more recent ILP tool called Progol [20] uses inverse entailment to derive covering clauses based on examples and background knowledge given as Prolog-like logic programs.An ILP tool completely implemented in Prolog is Aleph which supports a number of ILP algorithms.Similarly, the General Inductive Logic Programming System (GILPS) comprises several Prolog programs realising different inductive learning approaches.FuncLog [24], one tool of the GILPS collection, is specialised in learning on Head Output Connected learning problems.Besides this, the tools TopLog [23] and ProGolem [22] which are based on Top Directed Hypothesis Derivation and Asymmetric Relative Minimal Generalisations, respectively, are also part of GILPS and supported in SML-Bench.
On the description logics-based knowledge representation field, several algorithms are integrated via one tool: DL-Learner [16] is a framework for inductive learning on RDF and OWL-based background knowledge.It supports a wide range of algorithms, including refinement operator based algorithms and evolutionary inspired approaches, as well as different OWL profiles.
In terms of Statistical Relational Learning (SRL) tools, we considered RapidMiner 28 and Tree-Liker 29 [13].Unfortunately we were not able to integrate RapidMiner since its server component imposed requirements that were not fulfilled in our overall workflow 30 .However, we provide experimental support for the TreeLiker tool, which also works on background knowledge expressed in a Prolog-like syntax and contains implementations for different SRL algorithms.

Evaluation
To evaluate our framework, we ran SML-Bench on the available learning problems with 10-fold cross validation, excluding Suramin due to its small number of examples.In the case of DL-Learner and TreeLiker we executed a set of available algorithms introduced in the following.The OWL Class Expression Learner (OCEL) algorithm, which is part of the DL-Learner, is a refinement operatorbased learning algorithm using heuristics guiding the search.An evolution of OCEL which is more biased towards short and human readable concepts is the Class Expression Learning for Ontology Engineering (CELOE) algorithm [17].A third imple-28 https://rapidminer.com 29http://ida.felk.cvut.cz/treeliker/TreeLiker.html 30 See https://github.com/AKSW/SML-Bench/issues/14for more details mentation provided by the DL-Learner framework is the EL Tree Learner (ELTL) algorithm which is restricted to OWL EL as target concept language.
The TreeLiker tool can be configured to utilize a block-wise construction of tree-like relational features (RelF) [15], a hierarchical feature construction (HiFi) [14], or a Gaussian Logic-based algorithm (Poly) [12] for classification.These three algorithms can also be run in a grounding-counting setting (GC) considering the number of examples covered by a generated feature during learning.Since the TreeLiker works on a Prolog-like knowledge representation language that does not support certain Prolog expressions, we could not assess it on the Mutagenesis and NCTRER datasets.
All learning systems were executed using default configurations except the DL-Learner running OCEL which requires a noise value to be set to allow a certain number of misclassifications.Thus, we set the noisePercentage parameter to 30.
We set an overall maximum execution time of 300 seconds and executed all tools sequentially.The benchmark was performed on a machine with 2 Intel Xeon 'Broadwell' CPUs with 8 cores running at 2.1 GHz with 128 GB of RAM.A benchmark description based on the MEX RDF vocabulary can be found at http://aksw.org/Projects/SMLBench.html.The benchmark results are summarised in Table 5 and Table 6.Besides their average accuracies and F-scores we also report when nothing (or just the trivial solution listing all the input examples) could be learned (no results), when learning systems could not finish within the given 300 seconds (timeout), or when learning systems ran out of memory (out of mem.).
Since we executed all learning systems in their default settings, the results might not show the tool's optimal performance.They will rather reflect whether a tool performs well 'out-of-the-box' and whether the executed algorithms fit the particular learning scenarios.This also means that highly specialized algorithms might not work well on all of the learning scenarios, or even, that our learning problems do not have certain, expected characteristics an algorithm requires to function well.Thus, to provide a meaningful benchmarking environment we would expect to see that 1) not all learning problems can be 'solved' easily in default settings 2) the learning problems are able to distinguish the field of competitors, i.e. that they point out certain strengths and weaknesses of the Table 5 Evaluation results of an SML-Bench benchmark run.All tools were run with a maximum execution time of 5 minutes.Reported are the average accuracy and its standard deviation of 10-fold cross validation.One first observation we made is that we seem not to have a suitable learning scenario which would benefit from FuncLog's specialization in learning with head output connected predicates.Since it did not return learned rules on any of the learning problems we did not list FuncLog in Table 5 and Table 6.
Another observation is that Aleph, CELOE and OCEL already provide default settings which work well on the learning problems and lead to very good results on mutagenesis/42, premierleague/1 and pyrimidine/1.
Taking into consideration that Golem was implemented in the 1990s one possible explanation for its lower performance could be that the default settings reflect hardware expectations in terms of available memory and computing power that are now superseded.This might also apply to constants defined in Golem's source code.Hence, adjusting settings to current hardware capabilities might make a considerable difference here.A similar argumentation might apply for Progol.Besides this Golem's mode declarations do not provide means to express explicitly which predicates should appear in the head of a learned rule.This might be an explanation for some of the results that do not provide a description for the given examples, at all, as in the case of nctrer/1 where we observed learned rules like bound_atom(A,B) :-first_bound_atom(A,atom_232_2) that should actually characterize molecules, i.e. positive examples like molecule(molecule13).
Progol and ProGolem appear to be overly curtailed by the restricted execution time.In those cases, the algorithm itself may be very suitable for the learning problems but increasing the time limit further would lead to prohibitive runtimes of the evaluation scenario.Currently, the maximum runtime is approximately 100 hours (10 folds × 8 learning problems × 15 configured learning systems × 5 minutes).
In their default settings, the GILPS tools Pro-Golem and TopLog often returned identical results.Even though we can only report a low performance on all the learning problems, the authors of Pro-Golem and TopLog published experiments showing much better results on Carcinogenesis, Pyrimidine and Mutagenesis [23,22].This might emphasize that a proper configuration substantially impacts the tool's performance, or suggest that differing versions of the respective datasets were in use.
For the TreeLiker algorithms we can also observe that different settings might give identical results.The low performance can be attributed to the fact that TreeLiker is a collection of feature construction algorithms and to the way we use it in our benchmark framework.Since the TreeLiker algorithms usually produce a high number of features we currently only consider the best one for our evaluation.This might not fully exploit TreeLiker's capabilities and we are in contact with one of the tool authors to improve this.
Even though the premierleague/1 learning problem is large in terms of background knowledge with more than 200 thousand axioms, Aleph was able to learn almost perfect results.For the TreeLiker we gradually increased the maximum available JVM heap size up to 10 GB.Increasing it even further might also give results for its algorithms.However, since other Java implementations could generate results with 2 GB of available maximum heap size, we stopped there.
Overall, the evaluation supports our initial expectations.However, we also have to admit that some of the learning systems need to be adjusted properly to provide competitive performances.This will be discussed in the following.

Discussion
With SML-Bench we built a benchmarking framework that is extensible and comes with a set of initial scenarios to evaluate arbitrary inductive learning tools.As shown in the previous section, the provided learning problems are able to discriminate the performance of different learning systems but are not complete in the sense that we are lacking some datasets that are tailored for particular capabilities of certain tools (in particular FuncLog).We also believe -and verified this in some cases manually -that most of the results can be improved by spending more effort in configuring the learning systems, hence generating more competitive results.We already tried to get in contact with the tool authors to support this, but only got a reply from one of the TreeLiker developers.In the future, we may allow an explicit parameter tuning phase, e.g.via nested cross validation or explicit tuning examples, in our benchmark for systems which are capable of this functionality.Apart from the issues revolving around system configuration, the literature survey and the actually converted datasets have shown that datatype properties are widely used in many learning scenarios.However, the tools currently do not fully exploit this part and focus more on the structured components.In this sense, the tools would benefit more from deeper structures in the datasets.To this end, we will work on further enriching datasets in this direction if possible.Of course doing so requires considerable effort and extensive domain knowledge (e.g. in chemistry or genetics).Through the community feedback we obtain, we will continuously extend and refine the learning problem library.SML-Bench is part of a funded research programme and benchmarking challenges will be presented in a series of (yet to be finally determined) venues.We will use those as a feedback channel.
A further issue that needs to be discussed is the representation of knowledge in different KR languages.Most of the design decisions for the dataset conversions were made individually, since an automatic conversion would not yield a satisfactory modelling result exploiting the strengths of Prolog or different OWL profiles even in cases when it is theoretically possible.This can potentially lead to a bias since particular modelling choices may lead to different solutions provided by the tools.While we acknowledge this problem, we do not see a straightforward solution and also believe that to some extent this could spur some competition in terms of finding appropriate KR languages or dialects to support in inductive learning.
In our opinion, the availability of SML-Bench will improve the state of the art for symbolic machine learning from expressive background knowledge in the next years.While many efforts in this field date back to the early 90s, only in the past years the availability of data has increased significantly.However, it was still challenging for individual researchers or small groups to perform a comprehensive benchmark.We believe to have closed this gap, which could in turn lead to similar improvements as we have seen for question answering, link discovery and query performance for RDF.

Conclusion and Outlook
In this paper, we have presented SML-Bench -a benchmarking framework for structured machine learning.We performed a systematic literature survey to obtain relevant benchmark datasets.Overall, we analysed 6 890 papers, which led to 160 candidate learning problems.For 9 of those, we converted them across the used KR languages and set them up for all learning systems.Currently, 8 learning systems are integrated with two further inclusion requests in the works.A first analysis presented here has identified some shortcomings of individual tools.Generally, we believe that a mature research area requires a benchmark to evolve further.In particular, we want to contribute to bringing the Semantic Web and machine learning areas close together.We also aim to reduce the boundaries of knowledge representation languages as well as the research communities behind them.We further envision that SML-Bench could in the future evolve into a central hub for comparing suggested tool settings, learning problems, and performances.
We will perform further analysis and regular benchmarking runs in the scope of the HOBBIT project 31 which will fund this benchmarking activity until end of 2018 after which it will be taken over by the HOBBIT association.SML-Bench based challenges are planned in further workshops, e.g. the Know@LOD workshop to which we contributed for the past 5 years.In the future, it is likely that we will support further languages, e.g.full first order logic based systems, combinations of rules and description logics as well as fuzzy and probabilistic description logics and statistical relational learning systems.
Another direction for future work is the integration of means to import MEX machine learning experiment descriptions to generate benchmark configurations.In combination with our MEX export function this would allow to load experiments from other researchers or share own benchmark settings.

Fig. 1 .
Fig. 1.Overview of the SML-Bench framework.Learning problems (lpP , red) are defined on a learning task (taskA, yellowish) and contain positive/negative examples and optional learning system configurations.An overall benchmark configuration defines which learning systems (learnsysX, green) to run on which learning problem to produce benchmark results.

Table 1
Conferences and journals with number of surveyed papers (#P) and candidate datasets (#C)

Table 2
Surveyed benchmark dataset repositories with their number of datasets (#D) as of March 2017, and the number of candidates (#C) that could be used in our framework.

Table 3
Overview of the datasets that are part of the SML-Bench framework

Table 6
Evaluation results of an SML-Bench benchmark.All tools were run with a max.execution time of 5 minutes.Reported are the average F-score and its standard deviation of 10-fold cross validation.