1. For what purpose was the dataset created?
It was created to test heuristics for choosing a variable ordering in cylindrical algebraic decomposition (CAD). This dataset shouldn't be directly used for machine learning because as seen in question 6 it may not be balanced for the classification purpose.

2. Who created the dataset (for example, which team, research group) and on behalf of which entity (for example, company, institution, organization)?
Tereso Del Río and Matthew England in Coventry University.

3. Who funded the creation of the dataset?
Tereso Del Río is funded by Coventry University.

4. Any other comments?

5. What do the instances that comprise the dataset represent (for example, documents, photos, people, countries)?
They represent sets of polynomials.

6. How many instances are there in total (of each type, if appropriate)?
There is a total of 1019 instances.
There are 406 instances for which the best ordering is the one labeled as 0, 93 for 1, 135 for 2, 51 for 3, 202 for 4 and 132 for 5.

7. Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
It does not contain all possible instances. It has been made out of the sets of polynomials describing the three variable problems in the QF_NRA category of the SMT-LIB library as of July 2021. Also, the instances with the same number of cells in the CADs for all possible variable orderings are clustered and only one of them is included in
the dataset and instances that timed out for all possible variable orderings are not included in the dataset.

8. What data does each instance consist of?
The possible orderings are refered to as follows: 0 means x_1>x_2>x_3, 1 means x_1>x_3>x_2, 2 means x_2>x_1>x_3, 3 means x_2>x_3>x_1, 4 means x_3>x_1>x_2 and 5 means x_3>x_2>x_1, where the biggest variable is projected first.
Each instance has four features. 
The first one is a list of all six possible projections of the set of polynomials describing the problem. Each projection is described by the three sets of polynomials describing the three levels of the projection for three variables. Each set of polynomials, is given as a list of descriptions of each polynomial, where the polynomials are described as a list of monomials. Each monomial is described as a list where the first element is the coefficient and the other three are the exponents of each of the three variables (x_1,x_2,x_3).
The second one indicates the best variable ordering for doing a CAD for such a list of polynomials. 
The third one is a list of the timings of each of the possible orderings.
The forth one is a list of the extra time cost that an expensive heuristics such as sotd or mods would have spent computing the projections if each possible ordering is chosen.

9. Is there a label or target associated with each instance?
The second feature could be used as a label or target, as it describes the best ordering.

10. Is any information missing from individual instances?
No.

11. Are relationships between individual instances made explicit (for example, users’ movie ratings, social network links)?
No.

12. Are there recommended data splits (for example, training, development/validation, testing)?
No.

13. Are there any errors, sources of noise, or redundancies in the dataset?
No.

14. Is the dataset self-contained, or does it link to or otherwise rely on external resources (for example, websites, tweets, other datasets)?
It is  self-contained.

15. Does the dataset contain data that might be considered confidential (for example, data that is protected by legal privilege or by doctor–patient confidentiality, data that includes the content of individuals’ non-public communications)? 
No.

16. Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? 
No.

17. Does the dataset identify any subpopulations (for example, by age, gender)?
No.

18. Is it possible to identify individuals (that is, one or more natural
persons), either directly or indirectly
(that is, in combination with other
data) from the dataset? 
No.

19. Does the dataset contain data
that might be considered sensitive
in any way (for example, data that reveals race or ethnic origins, sexual
orientations, religious beliefs, political opinions or union memberships,
or locations; financial or health data;
biometric or genetic data; forms of
government identification, such as social security numbers; criminal history)? 
No.

20. Any other comments?
No.

21. How was the data associated
with each instance acquired? Was
the data directly observable (for example, raw text, movie ratings), reported by subjects (for example, survey responses), or indirectly inferred/
derived from other data (for example,
part-of-speech tags, model-based
guesses for age or language)? If the
data was reported by subjects or indirectly inferred/derived from other
data, was the data validated/verified?
The data was collected from the SMT-LIB library.

22. What mechanisms or procedures were used to collect the data
(for example, hardware apparatuses
or sensors, manual human curation,
software programs, software APIs)?
How were these mechanisms or procedures validated?
The data was collected from the SMT-LIB library.

23. If the dataset is a sample from
a larger set, what was the sampling
strategy (for example, deterministic,
probabilistic with specific sampling
probabilities)?
Unknown.

24. Who was involved in the data
collection process (for example, students, crowdworkers, contractors)
and how were they compensated (for
example, how much were crowdworkers paid)?
SMT-LIB collected the data that was processed to create this dataset.

25. Over what timeframe was the
data collected? Does this timeframe
match the creation timeframe of the
data associated with the instances
(for example, recent crawl of old news
articles)? 
Before July 2021.

26. Were any ethical review processes conducted (for example, by an institutional review board)?
Yes, Coventry University made an ethical review.

37. Has the dataset been used for
any tasks already?
Yes, it has been used to compare human-made heuristics in a paper that is awaiting publication.

38. Is there a repository that links
to any or all papers or systems that
use the dataset? 


39. What (other) tasks could the dataset be used for?
The dataset has been specifically built for this task. But it could be manipulated carefully to use it to compare machine learning models instead of heuristics.

40. Is there anything about the
composition of the dataset or the way
it was collected and preprocessed/
cleaned/labeled that might impact future uses? For example, is there anything that a dataset consumer might
need to know to avoid uses that could
result in unfair treatment of individuals or groups (for example, stereotyping, quality of service issues) or other
risks or harms (for example, legal
risks, financial harms)? If so, please
provide a description. Is there anything a dataset consumer could do to
mitigate these risks or harms?
The instances with the same number of cells in the CADs for all possible variable orderings are clustered and only one of them is included in the dataset and instances that timed out for all possible variable orderings are not included in the dataset.

41. Are there tasks for which the dataset should not be used? If so, please
provide a description.
As mentioned the dataset is not balanced, so it should not be used for machine learning without preprocessing.

42. Any other comments?
No.

43. Will the dataset be distributed
to third parties outside of the entity
(for example, company, institution,
organization) on behalf of which the
dataset was created?
No.

44. How will the dataset be distributed (for example, tarball on website,
API, GitHub)? Does the dataset have a
digital object identifier (DOI)?
......???????????###################
45. When will the dataset be distributed?
......???????????###################

46. Will the dataset be distributed
under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?
No.

47. Have any third parties imposed
IP-based or other restrictions on the
data associated with the instances?
No.

48. Do any export controls or other
regulatory restrictions apply to the
dataset or to individual instances? 
No.

49. Any other comments?
No.

50. Who will be supporting/hosting/maintaining the dataset?
Tereso Del Río.
51. How can the owner/curator/
manager of the dataset be contacted
(for example, email address)?
Through email delriot@uni.coventry.ac.uk
52. Is there an erratum? 
No.

53. Will the dataset be updated (for
example, to correct labeling errors,
add new instances, delete instances)?
No.

54. If the dataset relates to people,
are there applicable limits on the retention of the data associated with
the instances (for example, were the
individuals in question told that their
data would be retained for a fixed period of time and then deleted)?
It does not relate to people.

55. Will older versions of the dataset continue to be supported/hosted/
maintained?
There are no older versions.

56. If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them
to do so? If so, please provide a description. Will these contributions
be validated/verified? If so, please describe how. If not, why not? Is there a
process for communicating/distributing these contributions to dataset
consumers?
Contributors please refer to the SMT-LIB library.

57. Any other comments?
No.


Datasheet created according to "Datasheets for datasets" by Gebru et al. https://dl.acm.org/doi/10.1145/3458723