This readme.txt file was generated on 2021-12-08 by William J. Foster

GENERAL INFORMATION

1. Title of Dataset: Machine learning identifies ecological selectivity patterns across the end-Permian mass extinction

2. Author Information
	A. Principal Investigator Contact Information
		Name: William J. Foster
		Institution: Universität Hamburg
		Address: Bundestraße 55, 20146 Hamburg. Deutschland
		Email: william.foster@uni-hamburg.de

3. Date of data collection: (2018-03, last updated 2021-03)

4. Geographic location of data collection: South China

5. Information about funding sources that supported the collection of the data: This project was funded by Geo.X grant SO_087_GeoX 


SHARING/ACCESS INFORMATION

1. The data and python scripts are publically available in a usuable format via Github (https://github.com/PaleoML/permian-selectivity) 
and non-programmers can easily run the cide and access the data via (https://mybinder.org/v2/gh/PaleoML/permian-selectivity/main).


METHODOLOGICAL INFORMATION

1. Description of methods used for collection/generation of data: We used a database of all genera of marine invertebrates, conodonts, 
and calcareous algae recorded from the late Permian to the Middle Triassic in South China downloaded from the Paleobiology Database and 
supplemented with additional data from the literature.

2. Methods for processing the data: 
Vetting of the data meant that undetermined genera, and informal genera were excluded from the database. In addition, individual species 
were vetted to ensure that they were not represented within multiple genera in the database through taxonomic synonymy, and the most 
up-to-date generic identification of the species was followed

3. Instrument- or software-specific information needed to interpret the data: 
Scripts were coded using python and can be run using the Jupyter notebook.

DATA-SPECIFIC INFORMATION FOR:

1. Datasets: There are four datasets labelled Time_Interval1, TimeInterval2, TimeInterval3, TimeInterval4, which correspond with the Wuchiapingian,
pre-extinction Changhsingian, extinction interval, and post-extinction Griesbachian, respectively.

2. Dataset variables:
*taxa contains rows of the name of each genus included in the analysis.
*extinct refers to whether or not that genus went extinct (1) or survived (0) in that time interval.
*occurrences is the number of occurrences in the database the respective genus had for that time interval.
*NoSpecies is the number of species the respective genus has in that time interval.
*NoSpeciesLN is 'NoSpecies' after natural log transformation.
*K_Numeric is physiology for each respective genus. See Table 1 in the article.
*Min_Numeric is the mineralogy for each respective genus. See Table 1 in the article.
*C_Numeric is the carbonate load for each respective genus. See Table 1 in the article.
*C_Cnumeric is the original calculated value of the carbonate load for each respective genus.
*S_Numeric is the body size for each respective genus. See Table 1 in the article.
*SC_Numeric is the original calculated value for body size for each respective geuns.
*O_Numeric is the ornamentation for each respective genus. See Table 1 in the article.
*T_Numeric is the tiering for each respective genus. See Table 1 in the article.
*M_Numeric is the motility for each respective genus. See Table 1 in the article.
*R_Numeric is the resporitory protein for each respective genus. See Table 1 in the article.
*Re_Numeric is the reproduction mode for each respective genus. See Table 1 in the article.
*Phylum is the phyla for each respective genus. 
*MinD_Numeric is the minimum depth each respective genus ranges to. See Table 1 in the article.
*MaxD_Numeric is the maximum depth each respective genus ranges to. See Table 1 in the article.
*B_Numeric is the bathymetric range, number of depostional settings each respective genus ranges into.
*System_Numeric is the primary lithology that each genus is found in. (1) carbonate, (2) siliciclastic, (3) silica

3. Python scripts: 

01_machine_learning_gradient_boosting.ipynb is the machine learning algorithm code used for this study.
01_machine_learning_logistic_regression.ipynb uses the logictic regression algorithm to investigate its utility as an alternative to Catboost.
01_machine_learning_random_forest.ipynb uses the randomForest algorithm to investigate its utility as an alternative to Catboost.

02_visualization.ipynb is the code to visualise the results of the machine learning algorithm.
03_visualization_individual_TimeIntervals.ipynb is the code to visualise the results of the machine learning algorithm.

04_recursive_feature_elimination.ipynb is the code to see if the model can be improved by removing features from the algorithm.

05_multicollinearity_analysis.ipynb is the code to determine if variables correlate with one another.

06_explaining_tree_based_model_predictions.ipynb is the code for the SHAP summary and decision plots.

07_dataset_imbalance.ipynb is the code for assessing the data imbalence and to investigate the quality of the model predictions.