googlenews-vectors-negative300 Stock Market Data #Charlottesville 40-year AVHRR record of visible channel Rrs and coccolithophorid blooms, links to netCDF files datasets.zip RAW files X-ray crystallography data Figure S2 Ferndale Bog loss-on-ignition dataset NN training and testing Air_Quality Google-Landmarks Dataset Biodiversity in National Parks New York Times Best Sellers VDiscovery Geothermal Geodatabase for Routt Hot Springs, Routt County, Colorado International airline passengers CLAAS-2: CM SAF CLoud property dAtAset using SEVIRI - Edition 2 Datasets for outlier detection cell differentiation tree Appendix S2 Images of Lego Bricks Single locus analysis Excel Dataset Table_S2 Clustering analysis of microarray data Radiocarbon in CO2 and Soil Organic Matter from Laboratory Incubations, Barrow, Alaska, 2014 USDA plant database GloVe 6B Russian Troll Tweets Genomic datasets Figure3A Chlorophyll a and Chlorophyll c all-words Salivary sTREM-1 and PGLYRP-1 PCA dataset Intraday-Data Solar System Features Powerlifting Database Amazon Fine Food Reviews World Cities Spatial distribution of a flying seabird (Antarctic petrel) and penguins (Adélie penguin, Emperor penguin) in the wider Weddell Sea (Antarctica) with links to ArcGIS map packages Raw MSE data Cycling Metrics Human proteins interactions Interactive Locomotion OFC data Data.xlsx Adaptive Incremental Mixture Markov Chain Monte Carlo Yelp 2015 Star Cluster Simulations Breast Histopathology Images Crypto Currencies Additional File 4: RuStance Chicago Crime Strong Rotational Anisotropies Affect Nonlinear Chiral Metamaterials What we are trying to do New draft item Fig 3a. sales of shampoo Python-scripts Rooftop Energy Potential of Low Income Communities in America REPLICA Supplementary Table 9 Empathic accuracy Bollywood Movie Dataset The Movies Dataset synthetic_dataset ACTINN THzSecurityImageDataset BOLD5000 horses for courses Flickr Image dataset 15/2 pollen surface sample dataset The Social Life of Data What people purchase Metadata record for: Reference gene and small RNA data from multiple tissues of Davidia involucrata Baill THz Security Image Dataset Orthologous groups Seattle Pet Licenses 100 phylogenies High resolution global grids of revised Priestley-Taylor and Hargreaves-Samani coefficients for assessing ASCE-standardized reference crop evapotranspiration and solar radiation, links to ESRI-grid files, supplement to: Aschonitis, Vassilis G; Papamichail, Dimitris; Demertzi, Kleoniki; Colombani, Nicolo; Mastrocicco, Micol; Ghirardini, Andrea; Castaldelli, Giuseppe; Fano, Elisa-Anna (2017): High-resolution global grids of revised Priestley-Taylor and Hargreaves-Samani coefficients for assessing ASCE-standardized reference crop evapotranspiration and solar radiation. Earth System Science Data, 9(2), 615-638 Disk Space Data Wikidata Property Ranking', 'Relevance judgments for properties of 350 Wikidata entities Study 1 stimuli How many samples are needed to prove the absence of contamination - an example using arsenic? G-03 dataset 4 Understanding of Pain NTU Dataset JournalInformation Queries: DBpedia New York City population Traffic accident severity sex-classification PatientDiagnosis Supplemental Table 6 CO2 and CH4 Production and CH4 Oxidation in Low Temperature Soil Incubations from Flat- and High-Centered Polygons, Barrow, Alaska, 2012 Snow Cover Fraction (SCF) and snow depth obtained using terrestrial photography (2009-2013) in the control area Refugio Poqueira (Sierra Nevada, Spain), supplement to: Pimentel, Rafael; Herrero, Javier; Polo, María José (2017): Subgrid parameterization of snow distribution at a Mediterranean site using terrestrial photography. Hydrology and Earth System Sciences, 21(2), 805-820 New draft item New draft item MNIST Digit Recognition Computational Imaging Relevance assessment Optimized implementations of voxel-wise degree centrality and local functional connectivity density mapping in AFNI The Genomes Mapserver Phylogenomic Supermatrices 15/1 pollen surface sample dataset Incidence dataset Student Feedback Dataset Chat 80 smsspamcollection The E2E Challenge Dataset Speed Dating Experiment Minimal dataset Creating Customer Segments Model output data Linked Data Platforms GPCP Version 2.2 Combined Precipitation Data Set Captcha Images Spam Text Message Classification Data for both species raw mapping data IMDB Movie Review Jester Collaborative Filtering Dataset YouTube Comedy Slam UCI Cardiotocography person.csv Smart Home Scenarios Example dataset Baby data Micro-Loans Hospital Charges for Inpatients Supplementary Data 1 FocaLens Medium Articles Bank Marketing Dataset Who starts and who debunks rumors', 'Webpages cited by rumor trackers Style Color Images thyroid CT images No Data Sources train.csv Improved estimate of global gross primary production for reproducing its long-term variation, 1982-2017 Richness in ecosystem services Data discovery and re-use Query reformulation chromosome number polymorphism ISCO-08 A dataset of 30-meter annual vegetation phenology indicators (1985-2015) in urban areas of the conterminous United States The Million Song Dataset Dataset Descriptions Minutiae Sample Arabic Handwritten Characters Dataset AnimalDataset Blood donation in Brazil An example of PSCP files in which multiple microarray datasets were analyzed simultaneously Malicious and Benign Websites wsdream Additional File 10: Meteorological dataset GoodReads Dataset Transect database Fungi Dataset Movie Industry Question Answering Data LSDO Prospective Harmonization concatenated mtDNA dataset GloVe: Global Vectors for Word Representation Data3 Spearman correlation coefficient matrix Global Fuelbed Dataset Black Friday Failure dataset Employee Attrition EmBeD data Behavioral Risk Factor Surveillance System MNIST FASHION Thermal soccer dataset Supplementary Data 7 Medical Appointment Mortality Curves SMS dataset Lahman Baseball Database Piezoelectric Tensor Data DueCredit: automated collection of citations for software, methods, and data Spearman correlation coefficient analysis Japanese-English Bilingual Corpus Supplemental Table 5 Office Supply Sales Dataset I CelebA resized Using CALIOP to estimate cloud-field base height and its uncertainty: the Cloud Base Altitude Spatial Extrapolator (CBASE) algorithm and dataset LORIS: DICOM anonymizer population genetic dataset Our overall approach Electronic supplementary material, Dataset S2 Developmental trajectories in brain development IUGR dataset WOCE-Argo Global Hydrographic Climatology (WAGHC Version 1.0) 2009 data Visualization 2 Supraglacial Debris Cover Two loci analysis New draft item Kaggle Datasets Indirect Food Additives Venues in New York City HR 114 pollen surface sample dataset A global gridded data set on tillage Air Passengers Movie Dataset Fruits 360 dataset New draft item Geology datasets in North America, Greenland and surrounding areas for use with ice sheet models, supplement to: Gowan, Evan J; Niu, Lu; Knorr, Gregor; Lohmann, Gerrit (2019): Geology datasets in North America, Greenland and surrounding areas for use with ice sheet models. Earth System Science Data, 11(1), 375-391 R-code and datasets Scottish Borders Roads Indian Pines Hyperspectral Dataset Suicides in India Horse Racing Data Museum of Modern Art Collection Snow cover data across Nordenskiöldland, Svalbard, from point measurements during 2014-2016, supplement to: Möller, Marco; Möller, Rebecca (in review): Snow cover variability across glaciers in Nordenskiöldland (Svalbard) from point measurements in 2014–2016. Earth System Science Data Discussions, 1-16 Concatenated multigene aignments BASE calculations, supplement to: Searchinger, Tim D; Wirsenius, Stefan; Beringer, Tim; Dumas, Patrice (2018): Assessing the efficiency of changes in land use for mitigating climate change. Nature, 564(7735), 249-253 The transcriptome assembly Journal Data Shifted cumulative regulation dataset from TCGA Additional File 2: Names Corpus Metadata record for: CommonMind Consortium provides transcriptomic and epigenomic data for Schizophrenia and Bipolar Disorder Searches and filters Soil hydrological data GWAS summary statistics Supplementary table 2 Random Forest Code USArrests Yelp 2013 PSets Dielectric Constant Data Steady state simulation and operation planning of integrated energy supply systems RumourEval 2019 data Non-synchronized HeLa cells Twitter Threads Crime in Atlanta ENEM 2015 Lung Nodule Malignancy nba draft Company Reviews Dataset Summary Bug Triaging Prediction of Pathological Stage in Patients with Prostate Cancer: A Neuro-Fuzzy Model New York Stock Exchange Cat and Dog Nobel Prize winners China Merged Surface Temperature, supplement to: Yun, Xiang; Huang, Boyin; Cheng, Jiayi; Xu, Wenhui; Qiao, Shaobo; Li, Qingxiang (in review): A new merge of global surface temperature datasets since the start of the 20th Century. Earth System Science Data Discussions, 1-44 NOAA-CIRES-DOE Twentieth Century Reanalysis Version 3 Figurnoye Lake pollen dataset World Values Survey Latitudinal gradient in seed dispersal distance MIT-BIH Arrhythmia Database Crimes in Boston Yelp 2014 RASH evaluation New draft item Aretha Franklin Reconciliation of quantum local master equations with thermodynamics A cortical surface-based geodesic distance package for Python DataA1 NCEP ADP ETA / NAM Upper Air Observation Subsets The International Surface Pressure Databank version 3 loghub Physical Characteristics of Comets Geostrophic Currents in the northern Nordic Seas - A Combined Dataset of Multi-Mission Satellite Altimetry and Ocean Modeling (data), supplement to: Müller, Felix L; Dettmering, Denise; Wekerle, Claudia; Schwatke, Christian; Bosch, Wolfgang; Seitz, Florian (in review): Geostrophic Currents in the northern Nordic Seas - A Combined Dataset of Multi-Mission Satellite Altimetry and Ocean Modeling. Earth System Science Data Discussions Transplant experiment Accounting Network female survival Molluscan Shell Matrix Proteins Yelp Dataset', 'A trove of reviews, businesses, users, tips, and check-in data! Supplementary Figure 2 Toothaker Pond pollen surface sample dataset Dataset for "The effect of acute hypohydration on glycemia in healthy adults" The simulated sEMG signals Water Analyte Concentrations Global Shark Attack File Carsales Solar and Lunar Eclipses Imagenet32 Study 3 data Metadata record for: The sequence and de novo assembly of Takifugu bimaculatus genome using PacBio and Hi-C technologies Fisherman Lake pollen surface sample dataset 20 Newsgroups R Codes Post-Operative Patient Data Set Sets of omnidirectional images ABC dataset Figure S4 Image Dataset for Object Recognition Supplemental Dataset 1 R data file Aufeis (naleds) of the North-East of Russia: GIS catalogue for the Indigirka River basin, supplement to: Makarieva, Olga; Shikhov, Andrey; Nesterova, Nataliia; Ostashov, Andrey (2019): Historical and recent aufeis in the Indigirka River basin (Russia). Earth System Science Data, 11(1), 409-420 Appendix S1 Lower Back Pain Symptoms Dataset Bike Sharing Democrat Vs. Republican Tweets', '200 tweets of Dems and Reps GTZAN music/speech collection Boston housing dataset Pima Indians Diabetes Database', 'Predict the onset of diabetes based on diagnostic measures Caricature Image Demand Dataset rawdata M3500 Supplementary Table 1-2 Multi-source global wetland maps combining surface water imagery and groundwater constraints IMDB data GRUN : Global Runoff Reconstruction Handwritten Names Bitcoin Dataset Wikipedia Article Titles GRACE-REC: A reconstruction of climate-driven water storage changes over the last century Motifs Data SPEECH-COCO Chiangsaen Metadata record for: De novo transcriptome assembly and analysis of the freshwater araphid diatom Fragilaria radians, Lake Baikal ArXiV Archive Adult income dataset BBC News Summary Ships in Satellite Imagery A global database of radiogenic Nd and Sr isotopes in marine and terrestrial samples (V. 2.0) A novel nanozyme assay utilising the catalytic activity of silver nanoparticles and SERRS ACL Accepted Papers Microsatellites dataset Multivariate alternating decision tree 80 Cereals Student Marks Input and Output Files Community phylogeny Analysis of complete dataset Data.xlsx Air quality Dataset Histogram Inputs Uncompressed version NFL Statistics FMC dataset Reconciliation Vocabulary Olivetti Faces 10Knots Retail Sales Forecasting Study 1 data Melbourne Housing Supplemental Dataset 8 Cryptocurrency Historical Prices datasets.tar.gz Dataset2 Finding and Measuring Lungs in CT Data train6 Batch Effects Correction with Unknown Subtypes Time series of streamflow occurrence from 182 sites in ephemeral, intermittent and perennial streams in the Attert catchment, Luxembourg Readme.txt Iris dataset A schematic view of the procedure bank-marketing Additional File 1: Table S1. ENEM 2016 movie lens Color Rendition Characteristics Multidimensional Poverty Measures testcar Wisconsin Breast Cancer Dataset Pima Indian Diabetes Data InteractiveSegmentation Facial keypoints African Elephants GPCP Version 2.3 Monthly Analysis Product BRFSS 2001-2010 Data Sharing, Distribution and Updating Using Social Coding Community Github and LaTeX Packages in Graduate Research Metadata record for: Longitudinal dataset of human-building interactions in U.S. offices Observations of sea turtles Supplemental Dataset Bibliographic Dataset Data4 Reference in the dataset Boston Housing Visualization 3 Lecture capture survey Social Network Ads ChinaCropPhen1km: A high-resolution crop phenological dataset for three staple crops in China during 2000-2015 based on LAI products Mitochondrial DNA sequences Allele files Analyzing data from the digital healthcare exchange platform for surveillance of antibiotic prescriptions in primary care in urban Kenya: a mixed-methods study Avocado Prices Metadata record for: A dataset of cetacean occurrences in the Eastern North Atlantic LiveStreaming Frames from video The phenotype gap Introduction to Machine Learning Dataset S1 Hay Lake pollen dataset Deriving Canada-wide soils dataset for use in Soil and Water Assessment Tool (SWAT), supplement to: Cordeiro, Marcos R C; Lelyk, Glenn; Kröbel, Roland; Legesse, Getahun; Faramarzi, Monireh; Masud, Mohammad Badrul; McAllister, Tim (2018): Deriving a dataset for agriculturally relevant soils from the Soil Landscapes of Canada (SLC) database for use in Soil and Water Assessment Tool (SWAT) simulations. Earth System Science Data, 10(3), 1673-1686 European Soccer Database Overwatch PPG-BP Database Submerged sand deposits data from Western Sardinia, Mediterranean Sea organised in an interoperable Spatial Data Infrastructure, supplement to: Brambilla, Walter; Conforti, Alessandro; Simeone, Simone; Carrara, Paola; Lanucara, Simone; De Falco, Giovanni (2019): Data set of submerged sand deposits organised in an interoperable spatial data infrastructure (Western Sardinia, Mediterranean Sea). Earth System Science Data, 11(2), 515-527 JS Database Wikipedia Edits Bitcoin Historical Data London Crime LinkedIn Profile Data Global Causes of Mortality Elastic Tensor Data Metadata record for: Temporary dense seismic network during the 2016 Central Italy seismic emergency for microzonation studies New draft item International Financial Statistics Fashion Mnist Supplementary table 5 The Global Energy Balance Archive (GEBA) version 2017: A database for worldwide measured surface energy fluxes. Link to database files, supplement to: Wild, Martin; Ohmura, Atsumu; Schär, Christoph; Müller, Guido; Folini, Doris; Schwarz, Matthias; Hakuba, Maria Z; Sanchez-Lorenzo, Arturo (2017): The Global Energy Balance Archive (GEBA) version 2017: a database for worldwide measured surface energy fluxes. Earth System Science Data, 9(2), 601-613 Electronic supplementary material, Dataset S3 iris.csv Quotes Dataset Online Job Postings Missing People Nigeria dishes Top 100 2017 Users Data Datasets used in this study Snow cover maps (C1) of Guadalfeo Monitoring Network (Sierra Nevada, Spain), supplement to: Polo, María José; Herrero, Javier; Pimentel, Rafael; Pérez-Palazón, María José (2019): The Guadalfeo Monitoring Network (Sierra Nevada, Spain): 14 years of measurements to understand the complexity of snow dynamics in semiarid regions. Earth System Science Data, 11(1), 393-407 Predict Angina Floating Island Lake pollen dataset FROM-GLC-Hierarchy Raw Data.xlsx Video Games Review Intel Xeon Scalable Processors Fish Relatedness Seattle Office for Civil Rights Megapool RAW_DATA Loans data Growth characteristics of Dahurian larch (Larix gmelinii) in northeast China during 1965-2015, supplement to: Jia, Bingrui; Zhou, Guangsheng (2018): Growth characteristics of natural and planted Dahurian larch in northeast China. Earth System Science Data, 10(2), 893-898 Data for: "A synthetic map of the northwest European Shelf sedimentary environment for applications in marine science" Properties of PPI networks Keras Models Generating music with resting-state fMRI data Pavia University Hyperspectral dataset Crowding and Subitizing ERA5 Reanalysis ESM file Game of Thrones Iris Data Set SNP datasets Fecal hormones sift data Modern dataset Metadata record for: De novo transcriptomes of 14 gammarid individuals for proteogenomic analysis of seven taxonomic groups Glacier inventory of Pamir and Karakoram, link to GIS files, supplement to: Mölg, Nico; Bolch, Tobias; Rastner, Philipp; Strozzi, Tazio; Paul, Frank (2018): A consistent glacier inventory for Karakoram and Pamir derived from Landsat data: distribution of debris cover and mapping challenges. Earth System Science Data, 10(4), 1807-1827 Visualization 4 Hotel review Daily temperature data from the Foothills Climate Array Mesonet, Canadian Rocky Mountains, 2005-2010, supplement to: Wood, Wendy H; Marshall, Shawn J; Fargey, Shannon E; Whitehead, Terri L (2018): Daily temperature records from a mesonet in the foothills of the Canadian Rocky Mountains, 2005-2010. Earth System Science Data, 10(1), 595-607 JSON File Membrane feeding assays morphological_data heterozygosity-fitness Pedestrian Dataset Data for 27 countries Geothermal Geodatabase for Rico Hot Springs Area and Lemon Hot Springs, Dolores and San Miguel Counties, Colorado Bedroom air temperatures A national dataset of annual urban extent (1985-2015) in the conterminous United States using Landsat time series data Metadata record for: The sequencing and de novo assembly of the Larimichthys crocea genome using PacBio and Hi-C technologies Football Events Internal Cases Bag of Words Meets Bags of Popcorn Dataset and code (Matlab) for recoloring images Arabic Natural Audio Dataset Small mammal dataset Data Table B 20 Newsgroups Supplementary Table 7 Visualization 1 IRLA-CL Simulation dataset Indian License Plates The exon diversity RAxML Concatenated Predicting a Biological Response Twitter sentiment analysis Lake catchment A gridded dataset of belowground autotrophic respiration from 1980 to 2012 in global terrestrial ecosystems upscaling of observations ESI-FTICR-MS Molecular Characterization of DOM Degradation under Warming in Tundra Soils from Barrow, Alaska Microsatellite genotype data web log dataset Data file S1 Job Recommendation Chinese Characters Generator Toxic Words Predicting Movie Revenue PM in Kunming Dataset II Metadata record for: Time series of heat demand and heat pump efficiency for energy system modeling input_data US Mass Shootings Video Game Sales Breast Histology Images Baseline surface radiation data (1992-2017), supplement to: Driemel, Amelie; Augustine, John; Behrens, Klaus; Colle, Sergio; Cox, Christopher J; Cuevas-Agulló, Emilio; Denn, Fred M; Duprat, Thierry; Dutton, Ellsworth G; Fukuda, Masato; Grobe, Hannes; Haeffelin, Martial; Hodges, Gary; Hyett, Nicole; Ijima, Osamu; Kallis, Ain; Knap, Wouter; Kustov, Vasilii; Lanconelli, Christian; Long, Charles; Longenecker, David; Lupi, Angelo; Maturilli, Marion; Mimouni, Mohamed; Ntsangwane, Lucky; Ogihara, Hiroyuki; Olano, Xabier; Olefs, Marc; Omori, Masao; Passamani, Lance; Pereira, Enio Bueno; Schmithüsen, Holger; Schumacher, Stefanie; Sieger, Rainer; Tamlyn, Jonathan; Vogt, Roland; Vuilleumier, Laurent; Xia, Xiangao; Ohmura, Atsumu; König-Langlo, Gert (2018): Baseline Surface Radiation Network (BSRN): structure and data description (1992-2017). Earth System Science Data, 10(3), 1491-1501 Hyperspectral images of tea Spatial distribution of zoobenthos (sponges, echinoderms) in the wider Weddell Sea (Antarctica) with links to ArcGIS map packages Additional-data Iris flower dataset W-band gyro-TWA Interactive Hand Gesture Charities in the United States World Marathon Majors Laptop Prices Sokoto Coventry Fingerprint Dataset (SOCOFing)', 'Sokoto Coventry Fingerprint Dataset (SOCOFing) A3130 Data.xlsx Facilitating big data meta-analyses for clinical neuroimaging through ENIGMA wrapper scripts Continuous meteorological monitoring at Cape Posillipo (Denza Institute weather station - Naples - Campania Region - Italy) during the period January 2014 - December 2018 Additional File 5 Scheduling In Cloud Computing Mnist Data Mango Transcriptome Assembly Natural Speech Dataset Dataset.xlsx Advancing open science through NiData Gender Recognition by Voice', 'Identify a voice as male or female Music notes IBI duration DM Authors TrainingInstitute Hubway Data World of Warcraft Avatar History NCAR CESM Global Bias-Corrected CMIP5 Output to Support WRF/MPAS Research Lending Club Loan Data Abalone Dataset Black Carbon measurements in Germany between 1994 and 2014, link to netCDF files, supplement to: Kutzner, Rebecca D; von Schneidemesser, Erika; Kuik, Friderike; Quedenau, Jörn; Weatherhead, Betsy; Schmale, Julia (2018): Long-term monitoring of black carbon across Germany. Atmospheric Environment, 185, 41-52 house price prediction oregon education APS dataset The Blazing Signature Filter Amino acid dataset Supplementary Table 3 Identifiers on the Rise in Germany Coder A Lobbying Data dataset cleaned Data available for each species TRIOML US PRESIDENTS Comic Books Images Building Management System Analysis Raw Data.xlsx House Prices dataset r/mexico Gowalla Checkins data_packet PRIMAP-crf: UNFCCC CRF data in IPCC 2006 categories Donald Trump Tweets Board Game Data Insult sets painter test Population Time Series Data Movie Genre from its Poster Data for figures Gene matrix Periodic table of the elements Wikipedia Sentences Passenger Satisfaction Tatoeba Sentences thyroid image data Taxa partition Terrestrial Water Budget Data Archive Los Angeles Addresses Dataset References Consumer Complaints MAS 5 MovieLens+IMDb Cars Data New draft item Fantasy Premier League CITES Wildlife Trade Database', 'A year in the international wildlife trade Movie Reviews Study 2 data Fig 4a. 16/1 pollen surface sample dataset car prediction Drosophila Melanogaster Genome Netflix Prize data', "Dataset from Netflix's competition to improve their reccommendation algorithm Chest X-Ray Images (Pneumonia)', '5,863 images, 2 categories Arabic Handwritten Digits Dataset Wikisource Internet Archive DBpedia OpenStreetMap Wikidata National Register of Historic Places UNESCO World Heritage Site MusicBrainz ChemIDplus Project Gutenberg MEROPS AlloCiné AllMusic Internet Broadway Database IUCN Red List Integrated Authority File Internet Movie Database Catalogue of Life GameRankings The Oxford English Dictionary International Union for Conservation of Nature Virtual International Authority File Find a Grave Who Named It? Fortune 500 Encyclopedia of Life National Center for Biotechnology Information Integrated Taxonomic Information System Aozora Bunko Rotten Tomatoes Beilstein database Bonn Conventio Gene Ontology ZEMA Geographic Names Information System Transporter Classification database Metacritic OmegaWiki Australian Plant Name Index Biodiversity Heritage Library PubMed Last.fm tz database WikiMapia Medical Subject Headings Google Books Research Papers in Economics BirdLife International The Zoological Record Anime News Network Box Office Mojo PubMed Central Europeana British National Corpus The European Library Online Mendelian Inheritance in Man Persée VD 17 Gracenote Collins English Dictionary archINFORM Pauline epistles Artnet Jeuxvideo.com Eurogamer freedb Fortune 1000 Polity data series AGROVOC AMIS Plus Aquatic Sciences and Fisheries Abstracts Abandonia eMedicine Abbreviationes Urban Dictionary Bibliotheca Augustana AcademiaNet Aminet Scopus Corbis ARKive Karlsruher Virtueller Katalog Windows Registry American Battle Monuments Commission World Digital Library National Diet Library AllMovie Korean Movie Database NASA/IPAC Extragalactic Database Tatoeba Discogs Choral Public Domain Library International Music Score Library Project Coffin Texts Linguist List MedlinePlus BRENDA listed building in the United Kingdom Bibliographic Ontology Ordbog over det danske Sprog MyHeritage World Register of Marine Species Art & Architecture Thesaurus Deutsche Digitale Bibliothek CIDOC Conceptual Reference Model Arachne Dublin Core The Plant List Perseus Project EURODAC Rodovid SIMBAD Deutsche Fotothek The Merck Index MetaCyc arthistoricum.net German Medical eLibrary LibraryThing Astrophysics Data System ResearchGate Registry of Toxic Effects of Chemical Substances Netherlands Institute for Art History Protein Data Bank Australian National Heritage List Austrian Literature Online Instituto Nacional de Estadística y Geografía Fossilworks Joconde Mathematics Genealogy Project GeoNames The Freesound Project FishBase Dictionary of Canadian Biography YouPorn WorldCat Bridgeman Art Library OKATO Bildarchiv Foto Marburg Bildindex Registry of Open Access Repositories Baseball-Reference.com Biographical Portal Contenta BoardGameGeek GenBank ChEBI Digital Literary Academy UniProt Grooveshark Kyoto Encyclopedia of Genes and Genomes International Plant Names Index Atlas of the World's Languages in Danger Project Runeberg Pornhub Encyclopaedia Metallum PANGAEA Linguee International Children's Digital Library Brown Corpus International Shark Attack File BugMeNot Generally recognized as safe Israeli Central Bureau of Statistics Censimento nazionale delle edizioni italiane del XVI secolo Structurae Charity Navigator Microsoft Academic Search LibriVox WikiTree Cochrane Library DrugBank Society for American Baseball Research Jamendo World Atlas of Language Structures Swedish Film Database Crew United Plena Ilustrita Vortaro de Esperanto Current Index to Statistics UbuWeb Rate Your Music Cyc International HapMap Project Hungarian Electronic Library Online Etymology Dictionary GEOnet Names Server Pandora Radio LyricWiki Open Library Web of Science German Reference Corpus Deutsches Textarchiv dblp computer science bibliography Digitale Bibliothek Directory of Open Access Journals DODIS VirTheo Pinakes Shanghai Interbank Offered Rate RedTube e-rara.ch Academic Search GetInfo Operabase Schengen Information System Encyclopedia of Triangle Centers English Short Title Catalogue Ensembl genome database project MP3.com ST 16 EudraVigilance NNDB Extrasolar Planets Encyclopaedia FOAF Max Planck Digital Library Foundational Model of Anatomy Filmdienst JSTOR Gesamtkatalog der Wiegendrucke statistical business register Flora of North America Freebase SIMAP Web Gallery of Art GEO-LEO Gallica George Eastman Museum Orphanet Unifrance Getty Thesaurus of Geographic Names Inter-Active Terminology for Europe VD 16 Reaxys Global Biodiversity Information Facility Mammal Species of the World MEDLINE Mutopia Project Vitis International Variety Catalogue Norsk biografisk leksikon PostGIS Virtuelle Fachbibliothek Germanistik Reptile Database Immune Epitope Database and Analysis Resource Index theologicus VizieR Virtual Manuscript Room World Database on Protected Areas VD 18 Stanford Physics Information Retrieval System bibliographic database Dicionário Houaiss da Língua Portuguesa LIBRIS Hessian Regional History Information System Handbook of the Birds of the World Les Classiques des sciences sociales Missouri Botanical Garden Index Fungorum Marxists Internet Archive Mouse Genome Informatics Moviepilot National Center for Education Statistics online public access catalog Geheugen van Nederland WordReference.com Personenstandsregister Filmweb Anefo Stationers' Register Reactome Web of Knowledge Cambridge Structural Database SABIO-Reaction Kinetics Database Victorian Heritage Register Nederlands Soortenregister Simple Knowledge Organization System Social Security Death Index Tree of Life Web Project ChemSpider Lib.ru Swissbib Systems Biology Ontology KinoPoisk Tenders Electronic Daily Animal Sound Archive Topten Semantically-Interlinked Online Communities Transfermarkt Digital Library for Dutch Literature UNILEX Union List of Artist Names AntWeb Virtual Laboratory Virtual Library Eastern Europe Jewish Virtual Library Greenstone MycoBank Tropicos SUDOC Swiss-Prot FilmAffinity Ameco Sherdog World of Spectrum IntEnz CiteSeerX Fauna Europaea academia.edu ACME Newspictures AlgaeBase Alsatica Amphibian Species of the World Animal Diversity Web BALaT BASOL European Cultivated Potato Database Anatomography Education Resources Information Center British Trust for Ornithology Digital Public Library of America Common Locale Data Repository Compendex Dialnet Orotariko Euskal Hiztegia E-corpus GeneReviews EDGAR Equasis FamilySearch FNAEG Flora of Australia FranceTerme FRANCIS GISAID Global Invasive Species Database HathiTrust Hyper Articles en Ligne Inspec Center for Biological Diversity NYPL Digital Gallery National Science Digital Library Open Food Facts Powder Diffraction File Proteopedia Genius Reverso Digital Library of Slovenia Canadian Register of Historic Places Saccharomyces Genome Database Schema.org Dictionary of the Scots Language International Nuclear Information System SIRENE Tela Botanica Terrorist Identities Datamart Environment The Arabidopsis Information Resource National Digital Library of India Harvard University Center for Italian Renaissance Studies ČSFD World Spider Catalog WormBase Rfam AnimeClick SciFinder Library of Congress Online Catalog XVideos Grand Comics Database IntraText Liber Liber Sardegna Digital Library ODIS JPL Small-Body Database Kepler Input Catalog MyAnimeList Portuguese Web Archive magazines.russ.ru GOLD Multitran Russian National Corpus ontology alignment Speech corpus Mushroom Observer Dictionary of Algorithms and Data Structures CDDB Fundamental electronic library CRIStin BIBSYS Filmweb AGRICOLA ATLA Religion Database Adverse Event Reporting System AgBase Allele frequency net database America's Most Endangered Historic Places American Birding Association American National Corpus AmoebaDB Analytical Sciences Digital Library Anemi, Digital Library of Modern Greek Studies AnimalTFDB Animal Genome Size Database Aquatic Commons Archaeology Data Service Arnetminer Oxford Dictionaries AusStage AustLit: The Australian Literature Resource Automated Similarity Judgment Program Aviation Safety Reporting System BIOSIS Previews BISC BabelNet Baen Free Library EPPO code Islamonline.net Biblioteca Virtual Miguel de Cervantes MAINWAY BindingDB Bio2RDF BioGRID BioModels Database BioOne Biographical Directory of Federal Judges BirdLife Australia BitterDB Bookshare British Humanities Index Brix BugGuide CAB Direct CATH Protein Structure Classification database CINAHL COSMIC cancer database California Digital Library California Ethnic and Multicultural Archives California Native Plant Society California Register of Historical Resources Chinese Text Project National Commission for the Knowledge and Use of Biodiversity Contemporary Authors Copac Core Historical Literature of Agriculture Corpus of Contemporary American English Corpus of Electronic Texts Crossref Cuneiform Digital Library Initiative Cylinder Audio Archive DAD-IS DOAP DPVweb Database of Interacting Proteins Death Master File Department of Defense Serum Repository Dietary Supplements DigitalNZ Digital Comic Museum Digital Himalaya Digital Library of Georgia DisProt Disease Ontology Domínio Público DroID Drug Industry Document Archive eBird Embase East London Theatre Archive EcoCyc eggNOG Eighteenth Century Collections Online Ekşi Sözlük English-Arabic Parallel Corpus of United Nations Texts Ensembl Genomes Europarl corpus Europe PubMed Central European Nucleotide Archive Exoplanet Archive FADO FRBRoo Filmow FloraBase Flora of China Fusarium graminearum genome database Global Administrative Areas GEISA GSHHG GWASdb Gazetteer of Australia GeneCards Gene Wiki Generic Model Organism Database Genetic codes GeoRef PsycINFO GALILEO global microbial identifier H-Invitational HIV Drug Resistance Database AQL Hazardous Substances Data Bank Historypin Hispana Human Metabolome Database Human Protein Reference Database IEEE Xplore IGRhCellID INSPIRE-HEP Index Copernicus information schema Intercontinental Dictionary Series International Protein Index Interstate Identification Index Invasive Species Compendium Iraqi Virtual Science Library IsoBase RENAP ChEMBL Tebeosfera KUPS Kujawsko-Pomorska Digital Library LIDB LLMDB Lancaster-Oslo-Bergen Corpus Latindex Lattes Platform List of Prokaryotic names with Standing in Nomenclature Latin American and Caribbean Center on Health Sciences Information Lyon-Meudon Extragalactic Database MAREC MICAD MICdb MISLE Making of America Mapper(2) Mapping the Practice and Profession of Sculpture in Britain and Ireland 1851–1951 Melvyl MetroLyrics miRBase MiRTarBase MimoDB ModBase Mouse Phenome Database Mouse gene expression database Munk's Roll MuseData MusicDNA Musixmatch NAPP NCBI Epigenomics NGSmethDB NIAID ChemDB NaPTAN National Biodiversity Network National Bridge Inventory National Corpus of Polish National Driver Register National Elevation Dataset National Software Reference Library Natural Earth neXtProt New Advent New Zealand Electronic Text Centre Nolot Norsk Ordbok OER Commons Online Books Page OpenCorporates Ordnett OriDB OrthoDB Orthologous MAtrix Oxford English Corpus P2CS PATRIC PCRPi-DB PDBsum PHI-base Panjab Digital Library Pathway Commons Penn World Table Pennsylvania Sumerian Dictionary PhilPapers Phosida Phospho3D PhylomeDB Planetary Data System PPDB Plant ontology Plazi Post-Reformation Digital Library ProGlycProt ProRepeat Project Vote Smart Protein circular dichroism data bank Proteomics Identifications Database Pseudogene PubMed Central Canada Publication of Archival, Library & Museum Materials Quilt Index REPAIRtoire RNA-binding protein database Registry of Open Access Repositories Mandates and Policies REBASE Redalyc Regional Planetary Image Facility Register of the National Estate RxNorm SAGE KE SciELO Screenonline SeaLifeBase SedDB Sequence Ontology Shiron.net Sibley Music Library Social Science Research Network Spike Atlas StarBase Synthetic gene database TRICS Tanums store rettskrivningsordbok Taxatio Ecclesiastica Technical Report Archive & Image Library textfiles.com Small Molecule Pathway Database TopFIND Toxin and Toxin-Target Database UK Biobank Uberon VINITI Database RAS VIOLIN Viral Bioinformatics Resource Center Virginia Landmarks Register VoiD WhoSampled WikiPathways Wikifonia Women Writers Project World Checklist of Selected Plant Families World Guide to Covered Bridges Xeno-canto YAGO ZINC database ZooBank ZoomInfo Catalog of Fishes e-teatr.pl Glosbe China Academic Library and Information System AdoroCinema Corpus Documentale Latinum Gallaeciae Guia dos Quadrinhos Internet Movie Script Database AISLP PlaymakerStats.com DisGeNET Spreadthesign Svensk mediedatabas CiNii Crunchbase Event Log Mtime J-STAGE MRDB NACSIS-CAT Weblio Czech Terminology Database of Library and Information Science Automatic Fingerprint Identification System Polona Index Herbariorum National Police Information System Malopolska Digital Library Narodowe Archiwum Cyfrowe NUKAT Documenta Catholica Omnia Finnish Historical Newspaper Library Terhikki Finnish Social Science Data Archive Naturbase The Norwegian Patient Registry Oncolex Register of inhabitants Kramerius Visitors Location Register BioCyc database collection Euskalterm Inguma Liburuklik Pomet Scope Statistikbanken BioLib Reta Vortaro HebrewBooks RAMBI National Digital Science Library Hungarian Periodicals Table of Contents Database Corpus Scriptorum Historiae Byzantinae New English-Irish Dictionary DINOloket Library of Congress Authorities Manuscriptorium Czech National Corpus China Biographical Database LithoLex Tilastopaja Muséofile Taxonomy database of the U.S. National Center for Biotechnology Information Hollandse Hoogte FoundationDB The National Map Canadian Geographical Names Data Base LncRNAdb MinDat Handbook of Mineralogy rruff webmineral.com Sycomore World Database of Happiness Wikilivres ViralZone filmportal.de Cochrane Database of Systematic Reviews Database for Spoken German GESTIS database register of objects of cultural heritage Ishim Free Music Archive The LiederNet Archive Experimental Factor Ontology BNAber Netpath Basisregistratie Personen Buruxkak Galiciana xHamster Crossroads Bank for Enterprises Visual Novel Database CyberLeninka Helsinki Annotated Corpus Ghana Club 100 Statutory List of Buildings of Special Architectural or Historic Interest Data Catalog Vocabulary National Automated Fingerprint Identification System Matter of England Barcode of Life Data Systems Digital Image Archive of Medieval Music COMBINE National Biomedical Imaging Archive iNaturalist Frances G. Spencer Collection of American Sheet Music Welsh Newspapers Online Dictionary of Scottish Architects Lester S. Levy Collection of Sheet Music National Technical Reports Library IMVDb Queensland Heritage Register Australian Organ Donor Register Figshare FlyExpress Avicenna Directories Poetry Archive Analysis & Policy Observatory Human Phenotype Ontology BIBFRAME MAQAM Influenza Research Database MNIST database The Numbers PROSESS RefDB International Tree-Ring Data Bank CollecTF Y Chromosome Haplotype Reference Database IUPHAR/BPS Guide to PHARMACOLOGY Library of Congress Linked Data Service Anopress VetBact National Vital Statistics System PREDITOR Medical Heritage Library Glottolog Japan Center for Asian Historical Records AUSTLANG COMPLUDOC hymnary.org Global Terrorism Database GDELT Project SILVA ribosomal RNA database PharmGKB Volume Area Dihedral Angle Reporter BioStor National electronic library Trove OpenStreetMap Wiki Open Science Framework Finnish Population Information System Doria ArtFacts.Net Finna Digital Repository of Scientific Institutes Calflora Postcode data National Pipe Organ Register Aviation Safety Network accident description AFL Tables The Internet Hockey Database Database of Vascular Plants of Canada database of Genotypes and Phenotypes GRIN Taxonomy for Plants Driver Database Center for Turkish Cinema Studies Normannia Archeological Information System GrassBase Bach Digital Genealogics Australian Bibliographic Network BnF authorities NIOSH pocket guide to chemical hazards ClinVar Internet Game Database Bangumi Gymnosperm Database Revistes Catalanes amb Accés Obert Dyntaxa CERN Document Server NCBI Gene Shipyards Medeltidens bildvärld Fartyg ACToR database Avibase swMATH The Movie Database KNApSAcK LIPID MAPS NDF-RT Digital Archaeological Archive of Comparative Slavery CLOCKSS The Peerage The Academic Family Tree Database of Dutch first names by Meertens institute FANTOM Social Security Applications and Claims Index Re-Member Directory of Open Access Books Qatar Digital Library Austrian Parliament personal database AGORHA Library Genesis Early Canadiana Online Trismegistos Renaper Ontology Lookup Service Basketball-Reference.com Phenocarta Zenodo WomenWriters PhDTree Archives Service Center OpenFDA Semantic Scholar Theoi Project Knowledge Web European Reference Index for the Humanities TrEMBL KuLaDig Manioc Croatian Scientific Bibliography Fondazione Federico Zeri UCI ChemDB Panama Papers Catalog of the German National Library SciCrunch OpenNeuro New Zealand Organisms Register Nutrient Tables for use in Australia Index Hepaticarum 3DMet British Nursing Index Loop Australasian Pollen and Spore Atlas Gazetteer of Planetary Nomenclature MIAR MassBank XMetDB SureChEMBL UniChem NMRShiftDB MetaboLights eNanoMapper NCBI Nucleotide GreeNC PATRIC SuperCYP BARCdb NCBI Protein caNanoLab Nanomaterial Registry BiGG Models STITCH diXa Data Warehouse CrocBITE MetaboAnalyst RettBASE euL1db MethBank General Internet Corpus of Russian Jisho Cellosaurus ECARTICO BioCarta CeCaFDB Library of Apicomplexan Metabolic Pathways Metabolomics Workbench PeroxisomeDB JASPAR FunCat ZINC15 Adlr.link Maria Austria Instituut White Rose Research Online ImageNet Corpus Corporum US Census Bureau International Data Base PomBase BacDive Irama Nusantara DIGAR British Book Trade Index Teuchos Common Core of Data Standards for Networking Ancient Prosopographies: Data and Relations in Greco-Roman Names Spenserians Lord Byron and his Times Cranach Digital Archive MitoAge Legacies of British Slave-ownership Catalogue of Life in Taiwan CosIng database Global Plants OpenCitations Corpus WikiGenomes ENZYME Mapillary database MUSEFREM NIOSHTIC-2 National Inventory of Dams TAXREF China National GeneBank CompTox Chemistry Dashboard open data portal ratings.fide.com South African Natural Compounds Database Open PHACTS Discovery Platform Gramene Kunst im Stadtraum CEUR Workshop Proceedings earthquake.usgs.gov Open Spectral Database General Finnish Ontology World Waterfall Database CIViC database Archive of Digital Art DrugCentral GRID IntAct protein interaction database kPath UNdata Ensembl Plants Klosterdatenbank OpenTrials CycleBase Library History Database California Death Index UniProt-GOA The File Room Mix'n'match Panarctic Flora Developmental FunctionaL Annotation at Tufts LifeDB Proteome Inc. Syscilia Congress.gov Vagalume Microsoft Academic Evidence & Conclusion Ontology Biblioteca de la UOC Monuments database Hockey-Reference.com PROSPERO European Genome-phenome Archive Euro+Med Plantbase MathSciNet Global Ingredient Archival System VICNAMES Species-ID MinIO Harran Census AuthorClaim Zebrafish Model Organism Database Open TG-GATEs Devri Stanford Natural Language Inference corpus Finnish MP database SNAC Catalogus Philologorum Classicorum GeoNames ontology Critical Assessment of Protein Function Annotation Open Metadata Registry Bolin Centre for Climate Research Infrafrontier COCO Egyptian Knowledge Bank PsyArXiv The Himalayan Database Overnia lobid-organisations lobid-resources Environment Ontology WordSim-353 Dixi Athenaeum Google analogy test set Inventories of American Painting and Sculpture Stanford Question Answering Dataset CuratedTREC WebQuestions WikiMovies RFC Editor Repository Catalogue of Illuminated Manuscripts Canadian Civil Aircraft Register The World's Airlines: Past, Present & Future Sefaria WikiPapers The Monarch Initiative Gary Kessler's File Signature Table Movie Review Data Australian Women's Register Customer Review Datasets Amazon product data Stanford Sentiment Treebank Large Movie Review Dataset MPQA Opinion Corpus Nominis Ulysses database MyVariant.info FB15K Brent corpus NPS Internet Chatroom Conversations, Release 1.0 E-Theses Online Service SecondHandSongs SwissLipids BioAssay Ontology Extracellular RNA Atlas Leipzig Corpora Collection SocArxiv Registry of Births, Deaths and Marriages Victoria Queensland Registry of Births, Deaths & Marriages Paradise Papers ACM Digital Library The Camelot Project Semantic Wiki Vocabulary and Terminology Botanico Periodicum Huntianum ClinGen Allele Registry Getty Iconographic Authority WikiQA Archives de littérature du Moyen Âge SimLex-999 Rubenstein-Goodenough dataset LinkedGeoData GeoLinkedData Common Voice Early Modern Letters Online Semantic Publishing and Referencing Ontologies FRBR-aligned Bibliographic Ontology Citation Typing Ontology Bibliographic Reference Ontology Skyscraper Center CIFAR-10 CIFAR-100 Samla SHERPA/Juliet E-Periodica CBCL Face Database Commonwealth War Graves Commission database Fine Arts Heritage Register Altered States Database 20 Newsgroups data set Cinema Treasures BridgeReports.com UMBC corpus Ontobee Inventory of evaluations performed by the Joint Meeting on Pesticide Residues Ecocrop Complex Portal AmphibiaWeb Valka North Carolina Violent Death Reporting System imSitu Digital Repository of Ireland Det Norske Akademis ordbok Directory of Open Access Scholarly Resources California Data Exchange Center Mycology Collections data Portal Genetic and Rare Diseases Information Center Clojars Plants of the World Online UK Medical Heritage Library donor register Green's Dictionary of Slang Biological Magnetic Resonance Data Bank GEPRIS Dutch War Memorial Database Album of the Year Central Library of National Technical University of Athens MuIS DNAtraffic New York Public Library Digital Collections Architectuurgids datos.bne.es 50 Salads dataset YouTube-8M Parliamentary Information System 4TU.Centre for Research Data Missouri Cancer Registry and Research Center Nature Index Royal College of Surgeons biographical database govinfo Interim Register of Marine and Nonmarine Genera Sistema de Información Cultural BEIC Digital Library Merriam-Webster online dictionary SofaScore TFRRS TESEO WikiSQL Food-10k OER World Map History of Geology and Mining B.R.A.H.M.S. DASH Repository (Harvard University) Sol Genomics Network Media Art Database LC-QuAD Kielitoimiston sanakirja HOLLIS Silent Era Collective Biographies of Women Bgee Operone Harvard Dataverse ACGT Master Ontology Adverse Event Reporting Ontology Apollo Structured Vocabulary Bacterial Clinical Infectious Diseases Ontology Behavior Perspective Model Battle Management Ontology BioAssay Ontology Biological Collections Ontology Biomedical Ethics Ontology Biomedical Grid Terminology Blood Ontology Bone Dysplasia Ontology Cancer Cell Ontology Cancer Chemoprevention Ontology Cardiovascular Disease Ontology Cell Behavior Ontology Cell Culture Ontology Cell Expression; Localization; Development and Anatomy Ontology Cell Line Ontology Cell Ontology Cellular Microscopy Phenotype Ontology Chemical Entities of Biological Interest Chemical Information Ontology Chemical Methods Ontology Clusters of Orthologous Groups Analysis Ontology Cognitive Paradigm Ontology Common Anatomy Reference Ontology Common Core Ontologies Communication Standards Ontology Comparative Data Analysis Ontology Computational Neuroscience Ontology Computer-Based Patient Record Ontology Conceptual Model Ontology Coriell Cell Line Ontology Drug Interaction Ontology Drug Ontology Evidence & Conclusion Ontology Drug-drug Interaction Ontology Emotion Ontology Epidemiology Ontology Epilepsy and Seizure Ontology Evolution Ontology EXperimental ACTions Biomedical Protocol Ontology Fission Yeast Phenotype Ontology Food Ontology Gene Regulation Ontology General Information Model Genomic Epidemiology Ontology Health Data Ontology Trunk Hemocomponents and Hemoderivatives Ontology Host Pathogen Interactions Ontology Human Interaction Network Ontology Human Physiology Simulation Ontology Infectious Disease Ontology Information Artifact Ontology Informed Consent Ontology Interaction Network Ontology Interdisciplinary Prostate Ontology Project Knowledge Base Of Biomedicine Lipid Ontology Malaria Ontology Materials Ontology Mental Disease Ontology Mental Functioning Ontology Minimum Information Model for Patient Safety Middle Layer Ontology for Clinical Care Military Scenario Ontology MIRO and IRbase: IT Tools for the Epidemiological Monitoring of Insecticide Resistance in Mosquito Disease Vectors Model for Clinical Information Mouse Pathology Ontology Name Reaction Ontology Nanoparticle Ontology NeuroPsychological Testing Ontology Neuroscience Information Framework Standard Ontology Neural Electromagnetic Ontologies New Upper Level Ontology Non-Coding RNA Ontology Ontologized Minimum Information About BIobank data Sharing Ontology for Biobanking Ontology for Biomedical Investigations Ontology for Dengue Fever Ontology for Drug Discovery Investigations Ontology for Energy Investigations Ontology for General Medical Science Ontology for Genetic Interval Ontology for Laparoscopic Surgeries Ontology for MIcroRNA Target Prediction Ontology for Newborn Screening and Translational Research Ontology for Pain and Related Disability; Mental Health and Quality of Life Ontology for Periodontitis Ontology of Clinical Research Ontology of Biobanking Administration Ontology of Biological and Clinical Statistics Ontology of Data Mining Ontology of Datatypes Ontology of Experimental Variables and Values Ontology of Medically Related Social Entities Ontology of Vaccine Adverse Events Ontology-Based Data Access Oral Health and Disease Ontology Parasite Experiment Ontology Patient Safetry Categorial Structure Phenotypic Quality Ontology Plant Ontology Population and Community Ontology Population Health Record Porifera Ontology Proteomics data and process provenance ontology Protein Ontology Quality of Service Ontology RNA Ontology Role Ontology Saliva Ontology Schistosomiasis Process Ontology Scientific Evidence and Provenance Information Ontology Semanticscience Integrated Ontology Situation Awareness Ontology Sleep Domain Ontology Software Ontology Statistics Ontology Subcellular Anatomy Ontology of Suggested Ontology for Pharmacogenomics Surface Water Ontology Time Event Ontology Translational Medicine Ontology Tumour-Node-Metastasis Ontology Microbial Typing Ontology Universal Core Semantic Layer Vaccination Informed Consent Ontology Vaccine Ontology Vital Sign Ontology Xenopus Anatomy Ontology Zebrafish Anatomical Ontology ZOBODAT Signatures of Majorana fermions in hybrid superconductor-semiconductor nanowire devices YP130 SemEval 2012 Task 2 dataset Polish language corpus Mol-Instincts Trademark Electronic Search System VOCEDplus National Road Data Bank DigitalCommons@UMaine annuaire prosopographique: la France savante RailLexic FRBR-aligned Bibliographic Ontology RCSB protein data bank PDBe Number World Satyricon Automated Weather Data Network ChemInform Open Data Web EM-DAT SHARE Catalogue Morphbank Ġabra PAULING FILE Soybase Legume Information System PeanutBase MaizeGDB PlantGDB Guardiana Small Bodies Node National Coronial Information System TriviaQA Digitale Bibliothek Braunschweig EdShare NC DOCKS Diposit Digital de Documents de la UAB Digital Commons@Wayne State University FreiDok SoilGrids 10,000 Immunomes Statistics on income and living conditions The Federal Register of Legislation FEIS Systematic Catalog of Culicidae Allen Coral Atlas Map of Life Mitochondrial Disease Database Michigan Flora SpringerLink National Identity Register Online catalog Jordan Antiquities Database and Information System Missing and Murdered Indigenous Women and Girls eFloraSA Conservation and Art Materials Encyclopedia Online U.S. Geologic Names Lexicon National Geologic Map Database A Space of Their Own Manufacturer and User Facility Device Experience 4TU.Centre for Research Data (4TU.ResearchData) Data Series International Pharmaceutical Abstracts IndexCat Index-Catalogue of the Library of the Surgeon-General’s Office Applied Science & Technology Index Kepler Finance European Criminal Records Information System Biblioteca Virtual de Defensa Butterflies of India Moths of India Odonata of India Reptiles of India Birds of India Moths of North America Palynological Database Plantarium GONIAT Catalogue of the Lepidoptera of Belgium Leeds Robotic Commands C. V. Starr Virtual Herbarium Networked Digital Library of Theses and Dissertations Mineral Resource Data System International Fossil Plant Names Index Feminae: Medieval Women and Gender Index Epistolae: Medieval Women's Letters Litchfield Ledger PseudoCAP SynGO YuBioLab Psyl'list LibriSpeech TED-LIUM corpus 45worlds CREMA-D Gateway to Research Orlando Measurement Units Ontology Extensible Observation Ontology Library for Quantity Kinds and Units Semantic Web for Earth and Environmental Terminology Microsoft Academic Knowledge Graph BroadwayWorld Logeion Collection #1 Fleuron CERL Thesaurus Hall of Light Something About the Author Global Species ScaleNet Orphan Works Database Lucerna MuseumFinland Semantic Kalevala BookSampo Index to American Botanical Literature Microsoft Academic Graph JRC Names Time Ontology in OWL Corpus of Linguistic Acceptability Japan Search Comédie-Française Registers Project Genetics Home Reference Arabic Ontology SIUSA The Bhagavad-Gita Archaeology Data Service library Profiles in Science Register of Antarctic Marine Species Kalos FB15K-237 Pinakes Culture Collections Information Worldwide KBpedia Classify digilibLT PHI Latin Texts National Population Register Levidata Genomics England PanelApp CellMarker myschool Australian Stratigraphic Units Database Bionomia MVDBase Six Degrees of Francis Bacon Gambay International Labour Organization statistics database Libraries.org Find & Connect eFloraSA Musisque Deoque PathoPhenoDB Ukrainica NSW Beach Profile Database electrocd SciGraph AusPat Australian Food Composition Database Lawcodes JUSTfind Dcine.org Australian Marine Algal Name Index Global Names Index Vidwan SemCor AllTrails Cross-National Socio-Economic and Religion Data, 2011 Exceptional Experience Questionnaire General Social Survey, 1993 General Social Survey, 1994 General Social Survey, 1996 General Social Survey, 2002 General Social Survey, 2004 General Social Survey, 2006 National Survey of High School Biology Teachers Lilly Survey of Attitudes and Social Networks Spirit and Power: Survey of Pentecostals in Guatemala Biomedicina Slovenica OpenUp Religion among Academic Scientists Religion in Italy Spiritual Life Study of Chinese Residents PCORnet Calendrier électronique des spectacles sous l'Ancien Régime et la Révolution Endangered Archives Programme Carolina Digital Repository AMS Tesi di Dottorato American Memory American Mineralogist Crystal Structure Database ALT Open Access Repository Agritrop AgEcon Search AHERO ACMAC ARRT Archivio Istituzionale Archivio Giuliano Marini OpenScore FIA Results and Statistics National Digital Library of Theses and Dissertations in Taiwan Northernstars.ca ISSN Portal Contributor Role Ontology Proff Finnish Biodiversity Information Facility Power Reactor Information System Legends World World Values Survey, 2005 World Values Survey, 2010 District of Columbia Inventory of Historic Sites Museum of Modern Art online collection Library Publishing Directory TESEO University of Chicago Photographic Archive New Zealand Heritage List Cinema Context Neliti Taiwan Cinema Pleias Gender Studies Resources Database WorldCat Identities AACT Database The Kidney & Urinary Pathway Knowledge Base Newsroom dataset Sequence Database Setup: MSDB World Flora Online Scilit The Digital Archaeological Record Speech Accent Archive Garaph DOIBoost Dataset Dump Version 3 Comprehensive Aramaic Lexicon PubAg MicrobeDB ProKinO: Protein Kinase Ontology Browser AnAge Vision AI The Natural Products Atlas LiverTox Microworld Unified Cyber Ontology The Good Old Days Mapa da Cultura Sistema Cultura gene2phenotype Symptom Ontology Map of Early Modern London 8-bits VGMRips Roglo Civil registration District Digital Illinois Digital Heritage Hub Indiana Memory Kentucky Digital Library Minnesota Digital Library Missouri Hub North Carolina Digital Heritage Center PA Digital Plains to Peaks Collective The Portal to Texas History South Carolina Digital Library Floridata Exposome-Explorer Calaix The Vault at Pfaff's Bird tracking - GPS tracking of Lesser Black-backed Gulls and Herring Gulls breeding at the southern North Sea coast HotpotQA SearchQA Members of the European Parliament Open Super-large Crawled ALMAnaCH coRpus Poeti d'Italia in lingua latina DGA Member Directory Natural Questions WikiHop SynTagRus PanTHERIA QALD-9 Hemeroteca Nacional Digital de Mexico NCBI Genome NCBI Assembly Attic Inscriptions Online National Record of the Historic Environment Digital Library of South Dakota CIRIS Chinese Clinical Trial Registry Decoda COVID Tracking Project COVID-19 Community Mobility Reports ALCUIN Spanish National Catalog of Hospitals BBMRI-ERIC Directory FactGrid Bang! Hot Film CLEVR Héloïse Epistemonikos bab.la New Zealand Gazetteer Archaeology in Greece Online iDAI.gazetteer kb.nl EDBL Bibliopolis Common Vulnerabilities and Exposures Bioweb Ecuador Open Images Dataset Places database DataPile Elephant Encyclopedia Land Use Database World Checklist of Vascular Plants LSE Digital Library LSE Research Online LSE Theses Online Linked Stage Graph AccessAble 80 Million Tiny Images Icarus Films Tiaki Polish scientist Gender, Sex, and Sexual Orientation Ontology CephBase GlyGen ZivaHub UNESDOC The minority health & health equity archive USGS ScienceBase Google People Cards VertNet depositar Library Hub Discover Datastream Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Apache License 2.0 Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? Context This dataset contains data from a list of Indian stocks in NSE. It includes a collection of well performing stocks with all the data necessary to predict which stocks to buy, hold, or exit. Acknowledgements I work in a stock research firm. This stock data is for all Kaggle users to play and experiment with in order to learn more about stock research. Inspiration The second column, "Category", gives a list of all the stocks that a user needs to buy, hold, or exit . We challenge you to develop an algorithm to see if your result matches ours. On Friday, August 11th, 2017 a large groups of racist white nationalists carrying torches marched on the University of Virginia campus in Charlottesville, VA as an intimidation tactic against proponents for the removal of confederate statues of Robert E. Lee. The Friday evening march was held ahead of a much larger racist white nationalist rally in the center of Charlottesville planned for Saturday, August 12th, 2017. This dataset includes 100,000 tweet ids collected using the DocNow tweet collection prototype: http://app.docnow.io/ The tweet ids can be converted back into the original tweets using the DocNow Hydrator tool which can be downloaded from here: https://github.com/DocNow/hydrator A consistently calibrated 40-year record of visible channel remote sensing reflectances (Rrs), based on the Advanced Very High Resolution Radiometer (AVHRR) sensor global time-series. The dataset is derived from the top of atmosphere visible channel reflectances provided by the Pathfinder Atmospheres - Extended (PATMOS-x) V5.3 Climate Data Record (CDR), atmospherically corrected and masked according to quality flags. Temporal filtering and selective masking of the Rrs product is used to highlight regions of the global ocean affected by highly reflective blooms of the coccolithophorid Emiliania Huxleyi over the past four decades.Both the Rrs and coccolithophorid bloom product are supplied at monthly resolution on a 0.1 x 0.1 degree global grid. Monthly mean and monthly maximum values are supplied for each product. Requests for daily files can be made to Plymouth Marine Laboratory. The six datasets of vineyard thermal information acquired by on-the-go thermal imaging, to predict water status:- With thermal indices: - East side: train and test. - West side: train and test. - Global model: with both sides; train and test.- Without thermal indices: - East side: train and test. - West side: train and test. - Global model: with both sides; train and test. The RAW files in this dataset can be converted to .mzXML using Proteowizard (available at http://proteowizard.sourceforge.net) and then viewed using Skyline (available through MacCoss Lab at https://skyline.gs.washington.edu/labkey/project/home/begin.view). Seven Crystallographic Information Files obtained from EPSRC first grant project studying solvent separated magnesium organohaloaluminates relevant to rechargeable battery electrolytes The first dataset contributed to Dalton Trans., 2016, doi: 10.1039/C6DT00531D, published online as an accepted manuscript 22/02/16 Nexus and .tre files for the single-gene analyses of the Canarina dataset Raw data for the Ferndale Bog loss-on-ignition dataset obtained from the Neotoma Paleoecological Database. Copyright information:Taken from "A procedure for identifying homologous alternative splicing events"http://www.biomedcentral.com/1471-2105/8/260BMC Bioinformatics 2007;8():260-260.Published online 19 Jul 2007PMCID:PMC1950890. In the figure we highlight these two processes with a different colour code, red for the training and blue for the testing. We followed a two-fold heterogeneous cross-validation scheme [50] in which the original dataset was split in two (training and test sets). A resampling protocol was applied to correct for class-imbalance effects [51], resulting in 100 training sets with the same proportion of correct and incorrect observations. Each training set was then utilised to train a NN. We applied the latter to the events in the test set and computed the success rate. The success rate given in the article is the average of the success rates for the 200 NN. Context I get this dataset from UCI Machine Learning. I very interested with this dataset because one of our global warming problem is about air quality in some big city very serious. In UCI ML get this data from sensor device that located in Italy. Also you can read about the dataset in the description. Content I get this data from UCI Machine Learning. Here is about descripstion rows and column also another description. "The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value. " Acknowledgements Thank to UCI [https://archive.ics.uci.edu/ml/index.php][1] Inspiration I would like to see another method to classify or cluster this dataset with timeseries purpose. [1]: https://archive.ics.uci.edu/ml/index.php Did you ever go through your vacation photos and ask yourself: What is the name of this temple I visited in China? Who created this monument I saw in France? Landmark recognition can help! This technology can predict landmark labels directly from image pixels, to help people better understand and organize their photo collections. Today, a great obstacle to landmark recognition research is the lack of large annotated datasets. This motivated us to release Google-Landmarks, the largest worldwide dataset to date, to foster progress in this problem. The dataset is divided into two sets of images, to evaluate two different computer vision tasks: recognition and retrieval. The data was originally described in [1], and published as part of the [Google Landmark Recognition Challenge](https://www.kaggle.com/c/landmark-recognition-challenge) and [Google Landmark Retrieval Challenge](https://www.kaggle.com/c/landmark-retrieval-challenge). Additionally, to spur research in this field, we have open-sourced Deep Local Features (DELF), an attentive local feature descriptor that we believe is especially suited for this kind of task. DELF\'s code can be found on github via [this link](https://github.com/tensorflow/models/tree/master/research/delf). If you make use of this dataset in your research, please consider citing: `H. Noh, A. Araujo, J. Sim, T. Weyand, B. Han, "Large-Scale Image Retrieval with Attentive Deep Local Features", Proc. ICCV\'17` Challenges The two challenges associated to this dataset can be found in the following links: * [Google Landmark Recognition Challenge](https://www.kaggle.com/c/landmark-recognition-challenge) * [Google Landmark Retrieval Challenge](https://www.kaggle.com/c/landmark-retrieval-challenge) CVPR\'18 Workshop The [Landmark Recognition Workshop](https://landmarkscvprw18.github.io) at [CVPR 2018](http://cvpr2018.thecvf.com/program/workshops) will discuss recent progress on landmark recognition and image retrieval, taking into account the results of the above-mentioned challenges. Top submissions for the challenges will be invited to give talks at the workshop. Content The dataset contains URLs of images which are publicly available online (this [Python script](https://www.kaggle.com/tobwey/landmark-recognition-challenge-image-downloader) may be useful to download the images). Note that no image data is released, only URLs. The dataset contains test images, training images and index images. The test images are used in both tasks: for the recognition task, a landmark label may be predicted for each test image; for the retrieval task, relevant index images may be retrieved for each test image. The training images are associated to landmark labels, and can be used to train models for the recognition and retrieval challenges (for a visualization of the geographic distribution of training images, see [2]). The index images are used in the retrieval task, composing the set from which images should be retrieved. Note that the test set for both the recognition and retrieval tasks is the same, to encourage researchers to experiment with both. We also encourage participants to use the training data from the recognition task to train models which could be useful for the retrieval task. Note, however, that there are no landmarks in common between the training/index sets of the two tasks. The images listed in the dataset are not directly in our control, so their availability may change over time, and the dataset files may be updated to remove URLs which no longer work. Dataset construction The training and index sets were constructed by clustering photos with respect to their geolocation and visual similarity using an algorithm similar to the one described in [3]. Matches between training images were established using local feature matching. Note that there may be multiple clusters per landmark, which typically correspond to different views or different parts of the landmark. To avoid bias, no computer vision algorithms were used for ground truth generation. Instead, we established ground truth correspondences between test images and landmarks using human annotators. License The images listed in this dataset are publicly available on the web, and may have different licenses. Google does not own their copyright. References [1] H. Noh, A. Araujo, J. Sim, T. Weyand, B. Han, "Large-Scale Image Retrieval with Attentive Deep Local Features", Proc. ICCV\'17 [2] A. Araujo, T. Weyand, "Google-Landmarks: A New Dataset and Challenge for Landmark Recognition", Google Research blog post, available online [here](https://research.googleblog.com/2018/03/google-landmarks-new-dataset-and.html) [3] Y.-T. Zheng, M. Zhao, Y. Song, H. Adam, U. Buddemeier, A. Bissacco, F. Brucher T.-S. Chua, H. Neven, “Tour the World: Building a Web-Scale Landmark Recognition Engine,” Proc. CVPR’09 Context The National Park Service&nbsp;publishes a database of animal and plant species identified in individual national parks and verified by evidence — observations, vouchers, or reports that document the presence of a species in a park. All park species records are available to the public on the National Park Species portal; exceptions are made for sensitive, threatened, or endangered species when widespread distribution of information could pose a risk to the species in the park. Content National Park species lists provide information on the presence and status of species in our national parks. These species lists are works in progress and the absence of a species from a list does not necessarily mean the species is absent from a park. The time and effort spent on species inventories varies from park to park, which may result in data gaps. Species taxonomy changes over time and reflects regional variations or preferences; therefore, records may be listed under a different species name. Each park species record includes a species ID, park name, taxonomic information, scientific name, one or more common names, record status, occurrence (verification of species presence in park), nativeness (species native or foreign to park), abundance (presence and visibility of species in park), seasonality (season and nature of presence in park), and conservation status (species classification according to US Fish &amp; Wildlife Service). Taxonomic classes have been translated from Latin to English for species categorization; order, family, and scientific name (genus, species, subspecies) are in Latin. Acknowledgements The National Park Service species list database is managed and updated by staff at individual national parks and the systemwide Inventory and Monitoring department. Source: https://irma.nps.gov/NPSpecies Users interested in getting this data via web services, please go to: http://irmaservices.nps.gov Source Gathered from the New York Times API for Hardcover Fiction best sellers from June 7, 2008 to July 22, 2018 The API can be found here: [https://developer.nytimes.com/][1] Collected data includes the book title, author, the date of the best seller list, the published date of the list, the book description, the rank (this week and last week), the publisher, number of weeks on the list, and the price. [1]: https://developer.nytimes.com/ This dataset has been uploaded from VDiscover on [Github](https://github.com/CIFASIS/VDiscover) This geodatabase was built to cover several geothermal targets developed by Flint Geothermal in 2012 during a search for high-temperature systems that could be exploited for electric power development. Several of the thermal springs and wells in the Routt Hot Spring and Steamboat Springs areahave geochemistry and geothermometry values indicative of high-temperature systems. Datasets include: 1. Results of reconnaissance shallow (2 meter) temperature surveys 2. Air photo lineaments 3. Groundwater geochemistry 5. Georeferenced geologic map of Routt County 6. Various 1:24,000 scale topographic maps Context Dataset available at https://datamarket.com/data/set/22u3/international-airline-passengers-monthly-totals-in-thousands-jan-49-dec-60!ds=22u3&amp;display=line The CLAAS-2 record provides cloud properties derived from the SEVIRI sensor onboard METEOSAT second generation (MSG) satellites. This second edition is the improved and extended follow-up of the first version of the record (Stengel et al., 2014; CLAAS-1 DOI:10.5676/EUM_SAF_CM/CLAAS/V001). In order to ensure a homogeneous data basis, the solar SEVIRI channels of MSG-1, MSG-2 and MSG-3 were intercalibrated (Meirink et al, 2013) with MODIS Aqua before applying the cloud retrievals. CLAAS-2 features 12 years (2004-2015) of cloud mask/type, cloud top temperature/pressure/height, cloud phase as well as cloud microphysical properties such as optical thickness, effective droplet radius and cloud water path. The data are available on native SEVIRI resolution, i.e. 15 minutes repeat cycle and 3km (nadir) to 11km (edge of the field of view) spatial resolution. In addition, spatio-temporal averages of the above mentioned cloud properties are included: Daily and monthly averages and monthly histograms on a 0.05° x 0.05° grid as well as monthly mean diurnal cycles on a 0.25° x 0.25° grid. The advancements compared to CLAAS-1 (DOI:10.5676/EUM_SAF_CM/CLAAS/V001) are for example: (1) extended MSG measurement record used with better calibration, (2) improvements made to the retrieval algorithm leading to products with higher quality and (3) increased temporal resolution (15 Minutes). A summary on the CLAAS-2 characteristics and a comprehensive evaluation of the results are currently documented in Benas et al. (2016). Along with the data, a comprehensive documentation including user guide, algorithm descriptions, reprocessing layout and extensive validation studies, is provided. With CLAAS-2, regional and large scale cloud processes at temporal scales of minutes to years can be studied. SEVIRI-based surface radiation products, which were part of CLAAS-1, are now released in a separate dataset (SARAH-2). The zip files contains 12338 datasets for outlier detection investigated in the following papers:(1) Instance space analysis for unsupervised outlier detection Authors : Sevvandi Kandanaarachchi, Mario A. Munoz, Kate Smith-Miles (2) On normalization and algorithm selection for unsupervised outlier detection Authors : Sevvandi Kandanaarachchi, Mario A. Munoz, Rob J. Hyndman, Kate Smith-MilesSome of these datasets were originally discussed in the paper: On the evaluation of unsupervised outlier detection:measures, datasets and an empirical studyAuthors : G. O. Campos, A, Zimek, J. Sander, R. J.G.B. Campello, B. Micenkova, E. Schubert, I. Assent, M.E. Houle. Every replicates(individual or together) in the 3 datasets H3K4me3, H3K27me3 and H3K36me3 and dataset H3K27ac have a fixed number of cell-types in it. H3K4me3 has two replicates: 1 and 2 . H3K27 holds replicate 1 and replicates 1 and 2 together. Dataset H3K36 has only replicate 1. For combined analysis of H3K4me3 and H3K27me3 we have a folder named H3K4me3-H3K27me3(combined). In each dataset folder there are 4 subfolders named IQA, MLQA ,ML and Overlap representation. Here in the three folders named ML, MLQA and IQA we have included the results from these three cell-type tree generation methods. All the three folders contain cell-type tree in newick format. Estimated quartet files which we generated for MLQA and IQA methods, are given in both MLQA and IQA folders. Finally the overlap representation data for the cell-types are in the folder Overlap representation. In this folder we have a text file named Overlap_datarepresentaion in which the the two numbers in the first row contains the number of cell-types and data length. After that each row identified by t1,t2 etc carries the overlap data. The mapping from t1, t2 etc to original cell-types are provided in file_sequence text file. The folder contains all the R script files and data used in the paper. Two datasets are trimmed .csv versions of Table S1 and S3 and the third is functional data for North American mammals from the EltonTraits 1.0 database. Wilman, H., Belmaker, J., Simpson, J., la Rosa, de, C., Rivadeneira, M. M., & Jetz, W. (2014). EltonTraits 1.0: species-level foraging attributes of the world’s birds and mammals. Ecology, 95(7), 2027. To recreate all analyses and figures used in the paper, open the script “Davis Disassembly Main Code Open First.R”, highlight all the text and click run. Make sure that all the script files and data .csv files are in the same folder and that that folder is set as your working directory. Context I was looking for a good dataset for learning and research purposes. I always kept in mind a collection which could be used for a sorting robot machine in later stage. Lego bricks are good candidates. At first thought I did some experimentation to photograph bricks from different angles but this was time consuming. That is why I turned to computer rendering of the bricks using Blender. Content In this dataset you will find 16 different lego bricks. Each brick is selected in Mecabricks.com and next imported in collada (.dae) format in Blender. I used an animator object to render the imported brick from 400 different angles. Acknowledgements Blender is free and Open 3D Creation Software. Mecabricks.com is a free online Lego modeling tool. Inspiration I hope you can take advantage of this simple set in your learning or research. Let me know if there is need to expand the dataset to more bricks. Enjoy! Datasets obtained from the single locus markov chain model + R scripts for producing the figures presented within the article. Recent dataset comprising clinical and US data Table_S2.xlsx: Presence/Absence of genes x taxa Genes present by taxon. 1 = present; 0 = absent. Genes are given by OG name (OrthoMCL) and are marked with an * if a member of the 150 most even gene dataset and with a ^ if identified as a gene affected by EGT Copyright information:Taken from "Genome-wide identification of functionally distinct subsets of cellular mRNAs associated with two nucleocytoplasmic-shuttling mammalian splicing factors"Genome Biology 2006;7(11):R113-R113.Published online 30 Nov 2006PMCID:PMC1794580. Unsupervised clustering of the microarray dataset was performed with the dChip software using standard settings considering all nonredundant probes with positive hybridization signal. The dataset includes microarray hybridization results from input and immunoprecipitation (IP) samples from three experiments with anti-U2AFantibody (U1 to U3) and two experiments with anti-PTB antibody (P1 and P2). Sample clustering defines a tree with two first level branches corresponding to input and IP samples. Re-clustering analysis after clearing transcripts that were over-represented either in the inputs or in all immunoprecipitation samples. Sample clustering defines a tree with three first level branches corresponding to input, U2AF, and PTB immunoprecipitation samples. For clustering analysis, the probe signal intensities for each mRNA are standardized to have mean 0 and standard deviation 1 across all samples. The color scale for mRNAs is presented as follows: red represents expression level above mean expression of a gene across all samples, black represents mean expression; and green represents expression lower than the mean. Because of the standardization, probe signal intensities most likely fall within [-3, 3]. PTB, polypyrimidine tract binding protein; U2AF, U2 small nuclear RNP auxiliary factor. Dataset includes 14C measurements made from soil organic matter and CO2 from paired anaerobic and aerobic laboratory soil incubations of active layer soils collected in Barrow, Alaska in 2014. In addition to 14CO2, dataset includes CO2 production rates and carbon and nitrogen concentrations. Samples were collected from intensive study site 1 areas A, B, and C, and the site 0 and AB transects, from specified positions in high-centered, flat-centered, and low centered polygons. Context The USDA Plant database extraction from the Natural Resources Conservation Service. Content It contains a wide variety of varieties in raw format. Inspiration There is currently no USDA plant information available via API. Using this data set I'm hoping that I can extract needed information for improved plant growth in controlled environments. Context Global Vector or GloVe is an unsupervised learning algorithm for obtaining vector representations for words Content Contains 4 files for 4 embedding representations. 1. glove.6B.50d.txt - 6 Billion token and 50 Features 2. glove.6B.100d.txt - 6 Billion token and 100 Features 3. glove.6B.200d.txt - 6 Billion token and 200 Features 4. glove.6B.300d.txt - 6 Billion token and 300 Features Acknowledgements https://nlp.stanford.edu/projects/glove/ 3 million Russian troll tweets This data was used in the FiveThirtyEight story [Why We’re Sharing 3 Million Russian Troll Tweets](https://fivethirtyeight.com/features/why-were-sharing-3-million-russian-troll-tweets/). This directory contains data on nearly 3 million tweets sent from Twitter handles connected to the Internet Research Agency, a Russian "troll factory" and a defendant in [an indictment](https://www.justice.gov/file/1035477/download) filed by the Justice Department in February 2018, as part of special counsel Robert Mueller\'s Russia investigation. The tweets in this database were sent between February 2012 and May 2018, with the vast majority posted from 2015 through 2017. FiveThirtyEight obtained the data from Clemson University researchers [Darren Linvill](https://www.clemson.edu/cbshs/faculty-staff/profiles/darrenl), an associate professor of communication, and [Patrick Warren](http://pwarren.people.clemson.edu/), an associate professor of economics, on July 25, 2018. They gathered the data using custom searches on a tool called Social Studio, owned by Salesforce and contracted for use by Clemson\'s [Social Media Listening Center](https://www.clemson.edu/cbshs/centers-institutes/smlc/). The basis for the Twitter handles included in this data are the [November 2017](https://democrats-intelligence.house.gov/uploadedfiles/exhibit_b.pdf) and [June 2018](https://democrats-intelligence.house.gov/uploadedfiles/ira_handles_june_2018.pdf) lists of Internet Research Agency-connected handles that Twitter [provided](https://democrats-intelligence.house.gov/news/documentsingle.aspx?DocumentID=396) to Congress. This data set contains every tweet sent from each of the 2,752 handles on the November 2017 list since May 10, 2015. For the 946 handles newly added on the June 2018 list, this data contains every tweet since June 19, 2015. (For certain handles, the data extends even earlier than these ranges. Some of the listed handles did not tweet during these ranges.) The researchers believe that this includes the overwhelming majority of these handles’ activity. The researchers also removed 19 handles that remained on the June 2018 list but that they deemed very unlikely to be IRA trolls. In total, the nine CSV files include 2,973,371 tweets from 2,848 Twitter handles. Also, as always, caveat emptor -- in this case, tweet-reader beware: In addition to their own content, some of the tweets contain active links, which may lead to adult content or worse. The Clemson researchers used this data in a working paper, [Troll Factories: The Internet Research Agency and State-Sponsored Agenda Building](http://pwarren.people.clemson.edu/Linvill_Warren_TrollFactory.pdf), which is currently under review at an academic journal. The authors’ analysis in this paper was done on the data file provided here, limiting the date window to June 19, 2015, to Dec. 31, 2017. The files have the following columns: Header | Definition ---|--------- `external_author_id` | An author account ID from Twitter `author` | The handle sending the tweet `content` | The text of the tweet `region` | A region classification, as [determined by Social Studio](https://help.salesforce.com/articleView? id=000199367&amp;type=1) `language` | The language of the tweet `publish_date` | The date and time the tweet was sent `harvested_date` | The date and time the tweet was collected by Social Studio `following` | The number of accounts the handle was following at the time of the tweet `followers` | The number of followers the handle had at the time of the tweet `updates` | The number of “update actions” on the account that authored the tweet, including tweets, retweets and likes `post_type` | Indicates if the tweet was a retweet or a quote-tweet `account_type` | Specific account theme, as coded by Linvill and Warren `retweet` | A binary indicator of whether or not the tweet is a retweet `account_category` | General account theme, as coded by Linvill and Warren `new_june_2018` | A binary indicator of whether the handle was newly listed in June 2018 If you use this data and find anything interesting, please let us know. Send your projects to oliver.roeder@fivethirtyeight.com or [@ollie](https://twitter.com/ollie). The Clemson researchers wish to acknowledge the assistance of the Clemson University Social Media Listening Center and Brandon Boatwright of the University of Tennessee, Knoxville. Description of the genomic datasets (nanopore and illumina). (A) Global surface heat flow measurements (International Heat Flow Commission) plotted by binning the point-measurement datasets to an adaptively-refined triangulation with a minimum resolution of 50 km. Color saturation represents the resolution of the triangulation which washes out progressively such that the coarsest resolution is shown with only 5% saturation. In order to allow full comparability with other ocean acidification data sets, the R package seacarb (Gattuso et al, 2016) was used to compute a complete and consistent set of carbonate system variables, as described by Nisumaa et al. (2010). In this dataset the original values were archived in addition with the recalculated parameters (see related PI). The date of carbonate chemistry calculation by seacarb is 2018-02-02. This datasets contains txt file with all words from different languages like english or french for example. This dataset contains original and imputed data from the publication Each column represents a vector of 3720 elements that were obtained from the calculation of 20 PCA, for each RGB color matrix , to the image dataset US Stock Intra-day Dataset Columns 1. Name of Planet 2. Weight by World 3. Diameter (km) 4. Average Distance from Sun (km) 5. Gravity (Earth=1) 6. Time to Orbit Sun (a day) 7. Time to Spin on Axis (a minutes) 8. Number of Known Moons 9. Year of Discovery 10. Average Temperature (°C) 11. Contents of Atmosphere(more than %1) Context This dataset is a snapshot of the [OpenPowerlifting](http://www.openpowerlifting.org/index.html) database as of February 2018. OpenPowerlifting is an organization which tracks meets and competitor results in the sport of powerlifting, in which competitors complete to lift the most weight for their class in three separate weightlifting categories. Content This dataset includes two files. `meets.csv` is a record of all meets (competitions) included in the OpenPowerlifting database. `competitors.csv` is a record of all competitors who attended those meets, and the stats and lifts that they recorded at them. For more on how this dataset was collected, see the [OpenPowerlifting FAQ](http://www.openpowerlifting.org/faq.html). Acknowledgements This dataset is republished as-is from the [OpenPowerlifting source](http://www.openpowerlifting.org/data.html). Inspiration * How much influence does overall weight have on lifting capacity? * How big of a difference does gender make? What is demographic of lifters more generally? Context This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories. Contents - Reviews.csv: Pulled from the corresponding SQLite table named Reviews in database.sqlite - database.sqlite: Contains the table 'Reviews' Data includes: - Reviews from Oct 1999 - Oct 2012 - 568,454 reviews - 256,059 users - 74,258 products - 260 users with &gt; 50 reviews [![wordcloud](https://www.kaggle.io/svf/137051/2ba35b1344041b4964fe12365b577999/wordcloud.png)](https://www.kaggle.com/benhamner/d/snap/amazon-fine-food-reviews/reviews-wordcloud) Acknowledgements See [this SQLite query](https://www.kaggle.com/benhamner/d/snap/amazon-fine-food-reviews/data-sample) for a quick sample of the dataset. If you publish articles based on this dataset, please cite the following paper: - J. McAuley and J. Leskovec. [From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews](http://i.stanford.edu/~julian/pdfs/www13.pdf). WWW, 2013. Utility Data The data is extracted from [geonames][1], a very exhaustive list of worldwide toponyms. **It can be joined with datasets containing geographic fields to facilitate geospatial analysis including mapping.** This [datapackage][2] only lists cities above 15,000 inhabitants. Each city is associated with its country and subcountry to reduce the number of ambiguities. Subcountry can be the name of a state (e.g., in United Kingdom or the United States of America) or the major administrative section (e.g., ''region'' in France''). See `admin1` field on [geonames website][3] for further info about subcountry. Notice that: * Some cities like Vatican City or Singapore are a whole state so they don't belong to any subcountry. Therefore subcountry is `N/A`. * There is no guaranty that a city has a unique name in a country and subcountry (At the time of writing, there are about 60 ambiguities). But for each city, the source data primary key `geonameid` is provided. Preparation You can run the script yourself to update the data and publish them to GitHub/Kaggle: see [scripts README][4] Acknowledgments and License All data is licensed under the Creative Common Attribution License as is the original data from [geonames][5]. This means you have to credit [geonames][6] when using the data. And while no credit is formally required a link back or credit to [Lexman][7] and the [Open Knowledge Foundation][8] is much appreciated. *This dataset description is reproduced here from [its original source][9] with slight modifications.* [1]: http://www.geonames.org/ [2]: http://dataprotocols.org/data-packages/ [3]: http://www.geonames.org/ [4]: http://data.okfn.org/data/core/scripts/README.md [5]: http://www.geonames.org/ [6]: http://www.geonames.org/ [7]: http://github.com/lexman [8]: http://okfn.org/ [9]: http://data.okfn.org/data/core/world-cities Here we provide four ArcGIS map packages with georeferenced files on the spatial distribution of Antarctic petrels, Adélie penguins (breeders and non-breeders) and Emperor penguins in the wider Weddell Sea (Antarctica), which were created in the context of the development of a marine protected area in the Weddell Sea.Antarctic petrel (Thalassoica antarctica): We approximated potential foraging habitats of T. antarctica according to existing literature by ice coverage from AMSR-E sea ice maps, bathymetric data from the International Bathymetric Chart of the Southern Ocean (IBCSO), and seawater temperature data from the Finite Element Sea Ice - Ocean Model (FESOM) provided by R. Timmermann (AWI). Subsequently, we combined our Antarctic petrel model with the kernel utilization distribution model from Descamps et al. (2016). The authors kindly provided us with shape files showing the kernel utilization summer and winter distribution of Antarctic petrel breeding at Svarthamaren. Breeding locations and estimated number of breeding pairs were taken from van Franeker et al. (1999). Favourable habitat conditions for Antarctic petrels were predicted for the Lazarev Sea and along the eastern coast of the Weddell Sea, particularly for the area off the Fimbul Ice Shelf and along the coast between approx. 15°E to 10°W within a water depth range from approx. 500 m to 2500 m.Breeding Adélie penguins (Pygoscelis adeliae): The map of potential foraging habitats of breeding P. adeliae is based on British Antarctic Survey (BAS) Inventory data from Phil Trathan (ID 754) and Mike Dunn and P. Trathan (ID 764, 773, 779), a dataset from BAS (P. Trathan) and Instituto Antártico Argentino (Mercedes Santos) (ID 753) and a dataset from the US AMLR Program from Jefferson Hinke and Wayne Trivelpiece (NOAA) (ID 910), which are stored in the Birdlife International's Seabird Tracking Database (data request: 20-10-2015). Suitable foraging habitats for breeding Adélies from colonies from which no tracking data were not available were approximated by a 50 km buffer and a 50-100 km ring buffer around each colony according to the recommendations of a CCAMLR MPA planning workshop. Breeding locations and estimated abundance of breeding pairs were taken from Lynch and LaRue (2014). The tracking data were processed with a state-space model described by Johnson et al. (2008) and were implemented in the R package crawl (Johnson 2011). Jefferson Hinke (NOAA) kindly provided us with support running the R script. Highly suitable foraging habitats occurred about 50 km away from the colonies on King Georg Island, the colony in Hope Bay (Graham Land) and the colonies on the South Orkney Islands.Non-breeding Adélie penguins (Pygoscelis adeliae): The map of potential foraging habitats of non-breeding P. adeliae is based on British Antarctic Survey (BAS) Inventory data from Phil Trathan (ID 754) and Mike Dunn and P. Trathan (ID 773, 779), a dataset from BAS (P. Trathan) and Instituto Antártico Argentino (Mercedes Santos) (ID 753) and a dataset from the US AMLR Program from Jefferson Hinke and Wayne Trivelpiece (NOAA) (ID 910), which are stored in the Birdlife International's Seabird Tracking Database (data request: 20-10-2015). The tracking data were processed with a state-space model described by Johnson et al. (2008) and were implemented in the R package crawl (Johnson 2011). Jefferson Hinke (NOAA) kindly provided us with support running the R script. Highest habitat utilisation was concentrated in relative small areas (e.g., close to King Georg Island). However, the non-breeding Adélies seemed to roam through large parts of the Weddell Sea.Emperor penguins (Aptenodytes forsteri): The probability map of A. forsteri occurrence was developed as a function of distance to colony and colony size from Fretwell et al. (2012, 2014) as well as from sea ice concentration from AMSR-E sea ice maps. Our model of emperor penguin foraging distribution during breeding season showed that the probability of occurrence is highest at the Halley and Dawson colony near Brunt Ice Shelf and at the Atka colony near Ekstrøm Ice Shelf.More information on the spatial analysis is given in working paper WG-EMM-16/03 and WG-SAM-17/30 (for T. antarctica) submitted to the CCAMLR Working Group on Ecosystem Monitoring and Management (EMM) and the CCAMLR Working Group on Statistics, Assessments and Modelling (SAM), respectively (available at https://www.ccamlr.org/en/wg-emm-16 and https://www.ccamlr.org/en/wg-sam-17). Dataset used for correlation analyses and PLS. These are some key metrics behind indoor and outdoor road cycling. Context This is the modified version of BioGrid Homo Sapience dataset. I removed some columns from the original dataset in order to make it easier to use and understand. Content This dataset contains four columns. First two specifies an interaction between two proteins (Official Symbol Interactor A and Official Symbol Interactor B). The third column contains PMID of an article that describes an experiment that gives information about interactions. Fourth column contains information about throughput of an interaction. You can download the original version of the dataset [here](https://thebiogrid.org/). The dataset is made up of full-body kinematics data of 6 subjects performing two main tasks: - walking alone (Solo Trial); - walking together with another subject while mechanically coupled through a stretcher-like object (Paired Trial). Data were collected by using the VICON system with 14 infrared bonita camera. Subjects were fully instrumented by 34 passive markers placed according to the Plug-In-Gait marker placement. The stretcher like object used to mechanically coupled a pair of subjects was also instrumented with 6 markers in order to detect its position during the trials. The experiment parameters and subject metadata are provided in the files Subject_Disposition.xlsx and Database_Subjects_CoMTrajectory.xlsx. The trajectory of each marker placed on each subject within a pair is stored in Pair*.mat while the trajectory and the velocity of the CoM for each subject evaluated at each gait cycle can be found in CoM_trajectories and CoM_velocities. In detail: - Pair*.mat: For each pair (*=1,2,3,...,7), the cells represent the trials performed by the pair. By opening one cell (trial) it is possible to find several fields. The relevant one is the ‘DATA’ field (hdl*{1,trial}.DATA) where it is possible to find several parameters related to the Subject in front of the stretcher (1) or behind it (2), For simplicity we will refer only to the Subject in front to explain the rest of the fields (same can be said for the Subject behind just by replacing 1 with 2 or A with B). The main field to be analyzed are: o hdl*{1,trial}.DATA.Pos1->it contains all the markers’ trajectories, related to markers on the right (R) or on the left (L) or in the center (C). o hdl*{1,trial}.DATA.CoMs01A.C.CoM->it contains CoM trajectory o hdl*{1,trial}.DATA.CoM_Table->it contains Table CoM trajectory. - CoM_trajectories: each field is a Subject that is identified by a number (pair to which he/she belongs) and a letter A/B to indicate whether the subject is in front of the stretcher (A) or behind (B). o COM_GCT.Subject**.Single->contains a nuber of cells corresponding to the total number of trials that the Subject** did during the ‘Solo Trial’. Since some of the 6 analyzed subjects can compare in different pairs or in the same pair but in different position, in order to map the subject inside a pair to the real subject (whose parameters are stored in Database_Subjects_CoM_Trajectory.xlsx) please refer to Subject_Disposition.xlsx. In each trial one can access to the COM traejectory along the forward direction; oCOM_GCT.Subject**.Coupled ->contains a nuber of cells corresponding to the total number of trials that the Subject** did during the ‘Paired Trial’; In each trial one can access to the COM traejectory along the forward direction; o COM_GCT.Subject**.Media->it is the mean CoM trajectory for both Single and Paired trials. - CoM_velocities: same scheme of CoM_trajectories. This dataset contains macaque single unit recordings from the orbitofrontal cortices (subjects J and T), for stop signal task and economic choice task.The data is stored in .mat format, and has separate files for neural activity aligned to go signal and stop signal Quanyin Hu et al. Dataset for [Conjugation of haematopoietic stem cells and platelets decorated with anti-PD-1 antibodies augments anti-leukaemia efficacy]. We propose adaptive incremental mixture Markov chain Monte Carlo (AIMM), a novel approach to sample from challenging probability distributions defined on a general state-space. While adaptive MCMC methods usually update a parametric proposal kernel with a global rule, AIMM locally adapts a semiparametric kernel. AIMM is based on an independent Metropolis–Hastings proposal distribution which takes the form of a finite mixture of Gaussian distributions. Central to this approach is the idea that the proposal distribution adapts to the target by locally adding a mixture component when the discrepancy between the proposal mixture and the target is deemed to be too large. As a result, the number of components in the mixture proposal is not fixed in advance. Theoretically, we prove that there exists a stochastic process that can be made arbitrarily close to AIMM and that converges to the correct target distribution. We also illustrate that it performs well in practice in a variety of challenging situations, including high-dimensional and multimodal target distributions. Finally, the methodology is successfully applied to two real data examples, including the Bayesian inference of a semiparametric regression model for the Boston Housing dataset. Supplementary materials for this article are available online. This dataset is a subset of the Yelp Challenge, it contains all the reviews in the year of 2015 Context Stars mostly form in clusters and associations rather than in isolation. Milky Way star clusters are easily observable with small telescopes, and in some cases even with the naked eye. Depending on a variety of conditions, star clusters may dissolve quickly or be very long lived. The dynamical evolution of star clusters is a topic of very active research in astrophysics. Some popular models of star clusters are the so-called direct N-body simulations [1, 2], where every star is represented by a point particle that interacts gravitationally with every other particle. This kind of simulation is computationally expensive, as it scales as O(N^2) where N is the number of particles in the simulated cluster. In the following, the words "particle" and "star" are used interchangeably. Content This dataset contains the positions and velocities of simulated stars (particles) in a direct N-body simulation of a star cluster. In the cluster there are initially 64000 stars distributed in position-velocity space according to a King model [3]. Each .csv file named c_xxxx.csv corresponds to a snapshot of the simulation at time t = xxxx. For example, c_0000.csv contains the initial conditions (positions and velocities of stars at time t=0). Times are measured in standard N-body units [4]. This is a system of units where G = M = −4E = 1 (G is the gravitational constant, M the total mass of the cluster, and E its total energy). **x, y, z** Columns 1, 2, and 3 of each file are the x, y, z positions of the stars. They are also expressed in standard N-body units [4]. You can switch to units of the median radius of the cluster by finding the cluster center and calculating the median distance of stars from it, and then dividing x, y, and z by this number. In general, the median radius changes in time. The initial conditions are approximately spherically symmetric (you can check) so there is no particular physical meaning attached to the choice of x, y, and z. **vx, vy, vz** Columns 4, 5, and 6 contain the x, y, and z velocity, also in N-body units. A scale velocity for the stars can be obtained by taking the standard deviation of velocity along one direction (e.g. z). You may check that the ratio between the typical radius (see above) and the typical velocity is of order unity. **m** Column 7 is the mass of each star. For this simulation this is identically 1.5625e-05, i.e. 1/64000. The total mass of the cluster is initially 1. More realistic simulations (coming soon) have a spectrum of different masses and live stelar evolution, that results in changes in the mass of stars. This simulation is a pure N-body problem instead. **Star id number** The id numbers of each particle are listed in the last column (8) of the files under the header "id". The ids are unique and can be used to trace the position and velocity of a star across all files. There are initially 64000 particles. At end of the simulation there are 63970. This is because some particles escape the cluster. Acknowledgements This simulation was run on a Center for Galaxy Evolution Research (CGER) workstation at Yonsei University (Seoul, Korea), using the NBODY6 software (https://www.ast.cam.ac.uk/~sverre/web/pages/nbody.htm). Inspiration Some stars hover around the center of the cluster, while some other get kicked out to the cluster outskirts or even leave the cluster altogether. Can we predict where a star will be at any given time based on its initial position and velocity? Can we predict its velocity? How correlated are the motions of stars? Can we predict the velocity of a given star based on the velocity of its neighbours? The size of the cluster can be measured by defining a center (see below) and finding the median distance of stars from it. This is called the three-dimensional effective radius. Can we predict how it evolves over time? What are its properties as a time series? What can we say about other quantiles of the radius? How to define the cluster center? Just as the mode of a KDE of the distribution of stars? How does it move over time and how to quantify the properties of its fluctuations? Is the cluster symmetric around this center? Some stars leave the cluster: over time they exchange energy in close encounters with other stars and reach the escape velocity. This can be seen by comparing later snapshots with the initial one: some IDs are missing and there is overall a lower number of stars. Can we predict which stars are more likely to escape? When will a given star escape? References [1] Heggie, D., Hut, P. 2003, The Gravitational Million-Body Problem: A Multidisciplinary Approach to Star Cluster Dynamics ~ Cambridge University Press, 2003 [2] Aarseth, S.~J. 2003, Gravitational N-Body Simulations - Cambridge University Press, 2003 [3] King, I. 1966, AJ, 71, 64 [4] Heggie, D. C., Mathieu, R. D. 1986, Lecture Notes in Physics, Vol. 267, The Use of Supercomputers in Stellar Dynamics, Berlin, Springer Context Invasive Ductal Carcinoma (IDC) is the most common subtype of all breast cancers. To assign an aggressiveness grade to a whole mount sample, pathologists typically focus on the regions which contain the IDC. As a result, one of the common pre-processing steps for automatic aggressiveness grading is to delineate the exact regions of IDC inside of a whole mount slide. Content The original dataset consisted of 162 whole mount slide images of Breast Cancer (BCa) specimens scanned at 40x. From that, 277,524 patches of size 50 x 50 were extracted (198,738 IDC negative and 78,786 IDC positive). Each patch’s file name is of the format: u_xX_yY_classC.png — &gt; example 10253_idx5_x1351_y1101_class0.png . Where u is the patient ID (10253_idx5), X is the x-coordinate of where this patch was cropped from, Y is the y-coordinate of where this patch was cropped from, and C indicates the class where 0 is non-IDC and 1 is IDC. Acknowledgements The original files are located here: http://gleason.case.edu/webdata/jpi-dl-tutorial/IDC_regular_ps50_idx5.zip Citation: https://www.ncbi.nlm.nih.gov/pubmed/27563488 and http://spie.org/Publications/Proceedings/Paper/10.1117/12.2043872 Inspiration Breast cancer is the most common form of cancer in women, and invasive ductal carcinoma (IDC) is the most common form of breast cancer. Accurately identifying and categorizing breast cancer subtypes is an important clinical task, and automated methods can be used to save time and reduce error. «Datasets per la comparació de moviments i patrons entre els principals índexs borsatils espanyols i les crypto-monedes» Context En aquest cas el context és detectar o preveure els diferents moviments que es produeixen per una serie factors, tant de moviment interns (compra-venda), com externs (moviments polítics, econòmics, etc...), en els principals índexs borsatils espanyols i de les crypto-monedes. Hem seleccionat diferents fonts de dades per generar fitxers «csv», guardar diferents valors en el mateix període de temps. És important destacar que ens interessa més les tendències alcistes o baixes, que podem calcular o recuperar en aquests períodes de temps. Content En aquest cas el contingut està format per diferents csv, especialment tenim els fitxers de moviments de cryptomoneda, els quals s’ha generat un fitxer per dia del període de temps estudiat. Pel que fa als moviments del principals índexs borsatils s’ha generat una carpeta per dia del període, en cada directori un fitxer amb cadascun del noms dels índexs. Degut això s’han comprimit aquests últims abans de publicar-los en el directori de «open data» kaggle.com. Pel que fa als camps, ens interessà detectar els moviments alcistes i baixistes, o almenys aquelles que tenen un patró similar en les cryptomonedes i els índexs. Els camps especialment destacats són: • Nom: Nom empresa o cryptomoneda; • Preu: Valor en euros d’una acció o una cryptomoneda; • Volum: En euros/volum 24 hores,acumulat de les transaccions diàries en milions d’euros • Simbol: Símbol o acrònim de la moneda • Cap de mercat: Valor total de totes les monedes en el moment actual • Oferta circulant: Valor en oportunitat de negoci • % 1h, % 2h i %7d, tant per cent del valor la moneda en 1h, 2h o 7d sobre la resta de cyprtomonedes. Acknowledgements En aquest cas les fonts de dades que s’han utilitzat per a la realització dels datasets corresponent a: - http://www.eleconomista.es - https://coinmarketcap.com Per aquest fet, les dades de borsa i crypto-moneda estan en última instància sota llicència de les webs respectivament. Pel que fa a la terminologia financera podem veure vocabulari en renta4banco. [https://www.r4.com/que-necesitas/formacion/diccionario] Inspiration Hi ha un estudi anterior on poder tenir primícies de com han enfocat els algoritmes: - https://arxiv.org/pdf/1410.1231v1.pdf En aquest cas el «trading» en cryptomoneda és relativament nou, força popular per la seva formulació com a mitja digital d’intercanvi, utilitzant un protocol que garanteix la seguretat, integritat i equilibri del seu estat de compte per mitjà d’un entramat d’agents. La comunitat podrà respondre, entre altres preguntes, a: - Està afectant o hi ha patrons comuns en les cotitzacions de cryptomonedes i el mercat de valors principals del país d'Espanya? - Els efectes o agents externs afecten per igual a les accions o cryptomonedes? - Hi ha relacions cause efecte entre les acciones i cryptomonedes? Project repository https://github.com/acostasg/scraping Datasets Els fitxers csv generats que componen el dataset s’han publicat en el repositori kaggle.com: * https://www.kaggle.com/acostasg/stock-index/ * https://www.kaggle.com/acostasg/crypto-currencies Per una banda, els fitxers els «stock-index» estan comprimits per carpetes amb la data d’extracció i cada fitxer amb el nom dels índexs borsatil. De forma diferent, les cryptomonedes aquestes estan dividides per fitxer on són totes les monedes amb la data d’extracció. Comparisons of this dataset to muscle and neural expression datasets. Included in this file are: 1- total muscle enriched genes [32] that are also SGP-biased or hmc-biased, 2- larval pan-neural enriched genes [31] that are also SGP-biased or hmc-biased, 3- genes that are expressed in muscle, neuron, and hmc, 4- GO terms for genes that are expressed in muscle and hmc. 5- GO terms for genes that are expressed in neuron and hmc. 6- genes that are involved in the synaptic vesicle cycle [59], 7- genes that encode components of thin and thick filaments of body wall muscle [60], and 8- genes that encode FMRF-like and insulin-like peptides. (XLSX 159 kb) The dataset described in https://arxiv.org/abs/1809.01574DOI: 10.13140/RG.2.2.15252.76161/1Approx. 1000 entities for stance detection in Russian. Context Approximately 10 people are shot on an average day in Chicago. http://www.chicagotribune.com/news/data/ct-shooting-victims-map-charts-htmlstory.html http://www.chicagotribune.com/news/local/breaking/ct-chicago-homicides-data-tracker-htmlstory.html http://www.chicagotribune.com/news/local/breaking/ct-homicide-victims-2017-htmlstory.html Content This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department\'s CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. This data includes unverified reports supplied to the Police Department. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time. Update Frequency: Daily Fork [this kernel][1] to get started. Acknowledgements https://bigquery.cloud.google.com/dataset/bigquery-public-data:chicago_crime https://cloud.google.com/bigquery/public-data/chicago-crime-data Dataset Source: City of Chicago This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source —https://data.cityofchicago.org — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. Banner Photo by [Ferdinand Stohr from Unplash][2]. Inspiration What categories of crime exhibited the greatest year-over-year increase between 2015 and 2016? Which month generally has the greatest number of motor vehicle thefts? How does temperature affect the incident rate of violent crime (assault or battery)? [1]: https://www.kaggle.com/paultimothymooney/starter-kernel-for-chicago-crime-dataset [2]: https://unsplash.com/photos/EK8DxK_7IwY ![](https://cloud.google.com/bigquery/images/chicago-scatter.png) https://cloud.google.com/bigquery/images/chicago-scatter.png This is the Dataset for paper titled: Strong Rotational Anisotropies Affect Nonlinear Chiral Metamaterials Produced by David Hooper (d.c.hooper@bath.ac.uk) and Joel Collins (j.collins@bath.ac.uk) • The folders TI-539A and TI-540A contain SHG continuous polarization data for the left- and right-handed nanohelices, respectively. The data found in these folders is used to produce figures 2c-f, 3, 4 in the manuscript and figures S2-S6 in the supporting information. o The subfolders are split into measurements performed at normal incidence and 45 degrees incidence. \uf0a7 The subfolders give information about the sample geometry such as the polarizer-analyzer configuration and which parts were rotated. • The data files are then named such as 20160218_Pol_0_Ana_0_Sample_0_QWP_0to360Step5_17h07m18s which should be read as: DateStamp_PolarizerAngle_AnalyzerAngle_SampleAngle_QWP_Range&Step_TimeStamp o There are two columns within each data file. The 1st column contains the angle (in degrees) of the Quarter-wave plate (QWP) over the range and in steps given in the file name. The 2nd column contains the SHG counts per second recorded by the photon counting system. (The zero degree angle means that the fast axis of a component is horizontal with respect to the bench. 90 degrees then means the fast axis is vertical (normal to the bench). The angle of the components (not sample) is recorded from the frame of reference looking against the direction of propagation, zero on the left-hand side horizontal and the fast axis rotated clockwise. This information should be enough to reconstruct the polarization state incident on the sample.) • The folder “Linear Polarization Anisotropy P-in P-out TI540A” contains the data for the continuous polarization measurement displayed in Figure 2a of the paper. These measurements are performed at 45 degrees incidence. o The file format is the same as explained above. • The “Linear Spectrum” folder contains the data for figures 2b and S1. It contains its own read me file. Talk given during the "Harmonise This! Analyzing Diverse Neuroimaging Datasets" workshop at the 2015 Organization for Human Brain Mapping (OHBM) conference in Hawaii, 14-18 June. This is a sample dataset that includes 29,437 full text articles for testing SparkText (SparkText: biomedical text mining on big data framework). Number of papers per category for ten key entropy concepts. The concepts were selected according to their frequency of appearances in all abstracts in our dataset. Dataset title Sales of shampoo over a three year period Last updated 1 Feb 2014, 19:52 Last updated by source 20 Jun 2012 Provider Time Series Data Library Provider source Makridakis, Wheelwright and Hyndman (1998) Source URL http://datamarket.com/data/list/?q=provider:tsdl Units Dataset metrics 36 fact values in 1 timeseries. Time granularity Month Time range Jan 1 – Dec 3 Language English License Default open license License summary This data release is licensed as follows: You may copy and redistribute the data. You may make derivative works from the data. You may use the data for commercial purposes. You may not sublicense the data when redistributing it. You may not redistribute the data under a different license. Source attribution on any use of this data: Must refer source. Description Sales, Source: Makridakis, Wheelwright and Hyndman (1998), in file: data/shampoo, Description: Sales of shampoo over a three year period Archive of python scripts used in data analysis. See Readme in that archive and COMMANDS.txt The Rooftop Energy Potential of Low Income Communities in America REPLICA data set provides estimates of residential rooftop solar technical potential at the tract-level with emphasis on estimates for Low and Moderate Income LMI populations. In addition to technical potential REPLICA is comprised of 10 additional datasets at the tract-level to provide socio-demographic and market context. The model year vintage of REPLICA is 2015. The LMI solar potential estimates are made at the tract level grouped by Area Median Income AMI income tenure and building type. These estimates are based off of LiDAR data of 128 metropolitan areas statistical modeling and ACS 2011-2015 demographic data. The remaining datasets are supplemental datasets that can be used in conjunction with the technical potential data for general LMI solar analysis planning and policy making. The core dataset is a wide-format CSV file seeds_ii_replica.csv that can be tagged to a tract geometry using the GEOID or GISJOIN fields. In addition users can download geographic shapefiles for the main or supplemental datasets. This dataset was generated as part of the larger NREL-led SEEDSII Solar Energy Evolution and Diffusion Studies project and specifically for the NREL technical report titled Rooftop Solar Technical Potential for Low-to-Moderate Income Households in the United States by Sigrin and Mooney 2018. This dataset is intended to give researchers planners advocates and policy-makers access to credible data to analyze low-income solar issues and potentially perform cost-benefit analysis for program design. To explore the data in an interactive web mapping environment use the NREL SolarForAll app. List and characteristics of the different public transcriptomic datasets from ovarian cancer used to establish the potential impact of CDR2 and CDR2L expression on overall survival.Stage column is early/late/unknown. Histology column is ser/clearcell/endo/mucinous/other/unknown The datasets contains information on empathic accuracy, personality, and health status of chronic pain patients and informal caregivers Context Indian Hindi Cinema, popularly known as Bollywood has witnessed exponential growth in terms of volume of business, manpower employed, number of movies produced each year and also the global reach. Hence, it could be of great commercial importance to develop a model which could predict the success of a movie before it's release. However, it is not easy to forecast demand for a movie. There are a number of factors like Actors, Directors, Time of Release, Genre, Production house etc. which affect the outcome of a movie. The primary requirement to develop such a model would be the availability of Bollywood movie data. Thus, I created this dataset while working on my senior year research project, titled 'Predicting success of upcoming Bollywood movies'. Content The data has been created manually by visiting different websites. The primary ones being Wikipedia, boxofficeindia.com and IMDB. The data contains 1285 rows with movies released between the years 2001 to 2014. The hitFlop column contains values from 1 to 9 with 1 - Disaster 2 - Flop 3 - Below Average 4 - Average 5 - Semi Hit 6 - Hit 7 - Super Hit 8 - Blockbuster 9 - All-Time Blockbuster Acknowledgements Research Guide - Dr. S.K. Saha Inspiration Can we save the time and money wasted by movie viewers on viewing flop and disaster movies? Can we suggest must-watch movies to movie viewers even before movies release? Can we classify upcoming movies into 1 of 9 categories even before their release? Context These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages. This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website. Content This dataset consists of the following files: **movies_metadata.csv:** The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies. **keywords.csv:** Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object. **credits.csv:** Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object. **links.csv:** The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset. **links_small.csv:** Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset. **ratings_small.csv:** The subset of 100,000 ratings from 700 users on 9,000 movies. The Full MovieLens Dataset consisting of 26 million ratings and 750,000 tag applications from 270,000 users on all the 45,000 movies in this dataset can be accessed [here](https://grouplens.org/datasets/movielens/latest/) Acknowledgements This dataset is an ensemble of data collected from TMDB and GroupLens. The Movie Details, Credits and Keywords have been collected from the TMDB Open API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself [here](https://www.themoviedb.org/documentation/api). The Movie Links and Ratings have been obtained from the Official GroupLens website. The files are a part of the dataset available [here](https://grouplens.org/datasets/movielens/latest/) ![](https://www.themoviedb.org/assets/static_cache/9b3f9c24d9fd5f297ae433eb33d93514/images/v4/logos/408x161-powered-by-rectangle-green.png) Inspiration This dataset was assembled as part of my second Capstone Project for Springboard's [Data Science Career Track](https://www.springboard.com/workshops/data-science-career-track). I wanted to perform an extensive EDA on Movie Data to narrate the history and the story of Cinema and use this metadata in combination with MovieLens ratings to build various types of Recommender Systems. Both my notebooks are available as kernels with this dataset: [The Story of Film](https://www.kaggle.com/rounakbanik/the-story-of-film) and [Movie Recommender Systems](https://www.kaggle.com/rounakbanik/movie-recommender-systems) Some of the things you can do with this dataset: Predicting movie revenue and/or movie success based on a certain metric. What movies tend to get higher vote counts and vote averages on TMDB? Building Content Based and Collaborative Filtering Based Recommendation Engines. A synthetic dataset containing the fighting rate values for each of the familial lineages that were represented in the male combinations for every all unrelated vial; this was accomplished using our observations made for males from the appropriate corresponding males related vials. The three values (corresponding to the three males in an unrelated trio) in this synthetic dataset were ranked from most aggressive to least aggressive. The fighting rate in the corresponding "all unrelated" vial is also included. This folder contains datasets and codes used in this paper:ACTINN: Automated identification of Cell Types in Single Cell RNA Sequencing This is a THz Security Image Dataset. Possible researches on this dataset may include the development of THz quality standards, the selection of the best display mode, the enhancement of images, the modeling of image noise, and the detection of prohibited goods (include groundtruth in "data" file).If you have any questions, you can send a request to humenghan@sjtu.edu.cnPlease cite the following paper if you wish to use our dataset: Menghan Hu, Guangtao Zhai, Rong Xie, Xiongkuo Min, Qingli Li, Xiaokang Yang, Wenjun Zhang, "A Wavelet-Predominant Algorithm can Evaluate Quality of THz Security Image and Identify its Usability," IEEE Transactions on Broadcasting, 2019, accepted. Brain, Object, Landscape DatasetVision science - particularly machine vision - is being revolutionized by large-scale datasets. State-of-the-art artificial vision models critically depend on large-scale datasets to achieve high performance. In contrast, although large-scale learning models (e.g., AlexNet) have been applied to human neuroimaging data, the stimuli for such neuroimaging experiments include significantly fewer images. The small size of these stimulus sets also translates to limited image diversity. Here we dramatically increase the stimulus set size deployed in an fMRI study of visual scene processing. We scanned four participants in a slow-evented related design that incorporated 4,916 unique scenes. Data was collected over 16 sessions, 15 of which were task-related sessions, plus an additional session for acquiring high resolution anatomical scans. In 8 of the 15 task-related sessions, a functional localizer was run in order to independently define scene-selective cortex. In each scanning session, participants filled out a questionnaire (Daily Intake) about their daily routine, including: current status regarding food and beverage intake, sleep, exercise, ibuprofen, and comfort in the scanner. During BOLD scanning, physiological data (heart rate and respiration) was also acquired.The experiment including 4,803 images presented on a single trial throughout the experiment, and 112 images repeated four times, and one image repeated three times, throughout the experiment, yielding a total of 5,254 stimuli trials. The stimuli were drawn from three datasets: 1) 1000 images from Scene Images (250 scene categories, based on SUN categories, with four exemplars each); 2) 2000 images from the COCO dataset; and 3) 1916 images from the ImageNet dataset. In the experiment, images were presented for 1 second, with 9 seconds of fixation between trials. Participants were asked to judge whether they liked, disliked, or were neutral about the image.In sum, our dataset is unique in three ways: it is 1) significantly larger than existing slow-event neural datasets by an order of magnitude, 2) extremely diverse in stimuli, 3) considerably overlapping with existing computer vision datasets. Our large-scale dataset enables novel neural network training and novel exploration of benchmark computer vision datasets through neuroscience. Finally, the scale advantage of our dataset and the use of a slow event-related design enables, for the first time, joint computer vision and fMRI analyses that span a significant and diverse region of image space using high-performing models. Please refer to our website for more details and future news and releases: BOLD5000.org arXiv preprint in references below: https://arxiv.org/abs/1809.01281v2: Added BOLD5000_ROIs.zip (9/7/18)v3: Added BOLD5000_MRI-Protocols.zip (9/11/18)v4: Added Austin Marcus as author and image stimuli files moved to a different location (see bold5000.org). **Context:** Daily horse racing (thoroughbred) information that has(is) being actively collected and aggregated from a variety of sources. Years covered are just 2016, country is irrelevant to the dataset. **Acknowledgements:** This data has(is) being actively collected and aggregated from a variety of sources, all in the public domain. **Past Research:** None of merit, data is used currently to influence some betting decisions but no solid machine learning model(s) have been developed. Have thrown various versions of the data into: - Google Prediction - Amazon Machine Learning - Azure Machine Learning - Watson Analytics as a way to learn how these systems work. **Inspiration:** Probably one of the hardest things to do is pick stocks and horses. I have been involved in the stocks and horses industry for many years and through publishing previous libraries and software I have met many interesting people and also one of my long term clients/friends. I am currently trying enhance my software development skills by learning data science / machine learning. I have a done a few tutorials and I am hoping that by publishing this data I can learn and collaborate with members of the Kaggle Community. **Content:** **markets.csv** - id - start_time - what time did the race start, datetime in UTC - venue_id - race_number - distance(m) - condition_id - track condition, see conditions.csv - weather_id - weather on day, see weathers.csv - total_pool_win_one - rough $ amount wagered across all runners for win market - total_pool_place_one - rough $ amount wagered across all runners for place market - total_pool_win_two - total_pool_place_two - total_pool_win_three - total_pool_place_three **runners.csv** - id - collected - what time was this row created/data collected, datetime in UTC - market_id - position - **THIS IS THE FIELD WE WANT TO PREDICT!!!!** - Will either be 1,2,3,4,5,6 etc or 0/null if the horse was scratched or failed to finish - If all positions for a market_id are null it means we were unable to match up the positional data for this market - place_paid - Will either be 1/0 or null - If you see a race that only has 2 booleans of 1 it means that the race only paid out places on the first two positions - margin - If the runner didnt win, how many lengths behind the 1st place was it - horse_id - see horses.csv - trainer_id - rider_id - see riders.csv - handicap_weight - number - barrier - blinkers - emergency - did it come into the race at the last minute - form_rating_one - form_rating_two - form_rating_three - last_five_starts - favourite_odds_win - from one of the odds sources, will it win - true/false - favourite_odds_place - from one of the odds sources, will it win - true/false - favourite_pool_win - favourite_pool_place - tip_one_win - from a tipster, will it win - true/false - tip_one_place - from a tipster, will it place - true/false - tip_two_win - tip_two_place - tip_three_win - tip_three_place - tip_four_win - tip_four_place - tip_five_win - tip_five_place - tip_six_win - tip_six_place - tip_seven_win - tip_seven_place - tip_eight_win - tip_eight_place - tip_nine_win - tip_nine_place **odds.csv (collected for every runner 10 minutes out from race start until race starts)** - runner_id - collected - what time was this row created/data collected, datetime in UTC - odds_one_win - from odds source, win odds - odds_one_win_wagered - from odds source, rough $ amount wagered on win - odds_one_place - from odds source, place odds - odds_one_place_wagered - from odds source, rough $ amount wagered on place - odds_two_win - odds_two_win_wagered - odds_two_place - odds_two_place_wagered - odds_three_win - odds_three_win_wagered - odds_three_place - odds_three_place_wagered - odds_four_win - odds_four_win_wagered - odds_four_place - odds_four_place_wagered **forms.csv** - collected - what time was this row created/data collected, datetime in UTC - market_id - horse_id - runner_number - last_twenty_starts - `e.g. f9x726x753x92222x35` - f = failed to finish, 7 = finished 7th, 6 = finished 6th, 7 = finished 7th, x = runner was scratched - class_level_id - 1 = eq (in same class as other horses) - 2 = up (up in class) - 3 = dn (down in class) - field_strength - days_since_last_run - runs_since_spell - overall_starts - overall_wins - overall_places - track_starts - track_wins - track_places - firm_starts - firm_wins - firm_places - good_starts - good_wins - good_places - dead_starts - dead_wins - dead_places - slow_starts - slow_wins - slow_places - soft_starts - soft_wins - soft_places - heavy_starts - heavy_wins - heavy_places - distance_starts - distance_wins - distance_places - class_same_starts - class_same_wins - class_same_places - class_stronger_starts - class_stronger_wins - class_stronger_places - first_up_starts - first_up_wins - first_up_places - second_up_starts - second_up_wins - second_up_places - track_distance_starts - track_distance_wins - track_distance_places **conditions.csv** - id - name **weathers.csv** - id - name **riders.csv (jockeys)** - id - sex **horses.csv** - id - age - sex_id - see horse_sexes.csv - sire_id - not related to horses.id, there is another table called horse_sires that is not present here - dam_id - not related to horses.id, there is another table called horse_dams that is not present here - prize_money - total aggregate prize money **horse_sexes.csv** - id - name The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Such annotations are essential for continued progress in automatic image description and grounded language understanding. They enable us to define a new benchmark for localization of textual entity mentions in an image. We present a strong baseline for this task that combines an image-text embedding, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While our baseline rivals in accuracy more complex state-of-the-art models, we show that its gains cannot be easily parlayed into improvements on such tasks as image-sentence retrieval, thus underlining the limitations of current methods and the need for further research. Raw data for the 15/2 pollen surface sample dataset obtained from the Neotoma Paleoecological Database. Citizen science is a participatory research practice whereby members of the public contribute to research through sensing, collecting and analysing data. Often citizen science is facilitated by the internet and digital technology including apps and web based games. There are many examples of citizen science initiatives across many disciplines, including projects that address societal or environmental challenges. The rise of citizen science and the increasing use of interactive and emerging technologies to collect, analyse and share data presents new opportunities and challenges for researchers, their institutions and the public creators of these datasets. This presentation looks to the future of publicly engaged research practices and encourages speculation around the challenges, opportunities for impact and potential innovations when data becomes playable and social. This dataset gives us information about the things people purchase when they go to a shop. The raws are the people who buy specific things when they go to a shop. By looking at the patterns what they buy, we can get an understanding to reorder the things in the shop in a better way to help people feel more convenient in finding what they want in an easier way! This dataset contains key characteristics about the data described in the Data Descriptor Reference gene set and small RNA set construction with multiple tissues from Davidia involucrata Baill.. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format3. machine readable metadata file in ISA-Tab format (zipped folder) This is a THz Security Image Dataset. Possible researches on this dataset may include the development of THz quality standards, the selection of the best display mode, the enhancement of images, the modeling of image noise, and the detection of prohibited goods (include groundtruth in "data" file).If you have any questions, you can send a request to humenghan@sjtu.edu.cnPlease cite the following paper if you wish to use our dataset: Menghan Hu, Guangtao Zhai, Rong Xie, Xiongkuo Min, Qingli Li, Xiaokang Yang, Wenjun Zhang, "A Wavelet-Predominant Algorithm can Evaluate Quality of THz Security Image and Identify its Usability," IEEE Transactions on Broadcasting, 2019, accepted. Orthologous groups predicted by OrthoMCL. Datasets included: 35 Sceloporus species, plus the Anolis carolinensis, Gallus gallus, and human proteins. Context The city of Seattle makes available its database of pet licenses issued from 2005 to the beginning of 2017 as part of the city's ongoing [Open Data Initiative](https://data.seattle.gov/). The data is also obtainable from the [Socrata Open Data Access (SODA)](https://data.seattle.gov/Community/Seattle-Pet-Licenses/jguv-t9rb) portal in either CSV or JSON formats. It is also made available here (unofficially, I have no official affiliation with the city of Seattle or the Seattle Animal Shelter) to help spread awareness of the dataset and Seattle's Pet Licensing initiative. Content Seattle Pet Licenses Dataset The data set contains information on licenses issued as far back as 2005 to the end of January 2017. **Dataset Columns:** * License Issue Date: Floating Timestamp - Date and time of when the pet license was issued. * License Number: Integer - Unique ID for each issued license. * Animal's Name: String - Name of the licensed pet. * Species: String - Species of the licensed pet. Will be either 'Dog,' 'Cat,' or 'Livestock.' * Primary Breed: String - Primary breed of the licensed pet. * Secondary Breed: String - Secondary breed (if any) of the licensed pet. Washington Zip Codes Tax Returns by Income Bracket As part of an analysis done to see if there is a relationship between the volume of pet licenses and the affluence of the particular area, the data also includes the [Statistics of Income 2015 dataset](https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2015-zip-code-data-soi) that features the number of tax returns received by the IRS from each Seattle zip code broken out by several income brackets. The uploaded data represents a clean set of data for analysis use. Acknowledgements The Seattle Pet Licenses dataset was compiled by the City of Seattle Department of Finance and Administrative Services through the city of Seattle's Open Data initiative, and all credit goes to the original creators and maintainers of the data, the [Seattle Animal Shelter](http://www.seattle.gov/animalshelter). I am merely trying to make the data available to a broader audience to help spread awareness. The [Statistics of Income (SOI) dataset](https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi) is owned and maintained by the IRS. The data presented here is a clean representation of the Washington Zip Code SOI 2015 dataset. Inspiration The dataset shows there were almost no pets licensed from 2005 up until mid-2014 when volume began rising drastically, for reasons as yet unknown (in that I wasn't able to find any sources mentioning any news that would cause such a significant increase). There also appears to be a massive disparity in the number of dogs licensed compared to cats, even though there are approximately 5 million more owned cats in the United States over dogs. Thus, I hope that by making this data more available, users who analyze the data can find insights and recommendations for the Seattle Animal Shelter to increase pet licensing numbers and help show pet owners who haven't licensed their pets why it is essential. Extra An [analysis of the Seattle Pet Licenses dataset](https://aaronschlegel.me/extract-analyze-seattle-pet-licenses-dataset.html) with Python can also be found on my website. About Seattle Pet Licenses The [city of Seattle requires pets over eight weeks old be licensed](https://library.municode.com/wa/seattle/codes/municipal_code?nodeId=TIT9AN_CH9.25ANCO_9.25.050ANLIPEGE). There are several benefits to [licensing one's pet](https://www.seattle.gov/animal-shelter/license), including a return ride home if your pet is lost, and easier contact from a veterinarian if your pet is unfortunately injured. If the licensing is performed at the Seattle Animal Shelter on the third Saturday of any given month, a free rabies vaccine is included, as well as other vaccines and a microchip for a small additional fee. The 100 trees used as phylogenetic hypotheses in this study were subsets of those developed by Jetz et al. (2012) and available through birdtree.org. Several species are represented by more than one record in this dataset, thus phylogeny tips for those taxa were transformed into multichotomies (i.e., polytomies) in each phylogeny. Here, 101 phylogenies are included. Models using the 74th phylogeny failed to converge so an additional phylogeny (i.e., phylogeny 101) was added to complete a set of 100 for the analyses. For all phylogenies the tip labels are coded to match the "record_id" field code in the corresponding dataset. The objective of the study is to provide global grids (0.5°) of revised annual coefficients for the Priestley-Taylor (P-T) and Hargreaves-Samani (H-S) evapotranspiration methods after calibration based on the ASCE (American Society of Civil Engineers)-standardized Penman-Monteith method (the ASCE method includes two reference crops: short-clipped grass and tall alfalfa). The analysis also includes the development of a global grid of revised annual coefficients for solar radiation (Rs) estimations using the respective Rs formula of H-S. The analysis was based on global gridded climatic data of the period 1950-2000. The method for deriving annual coefficients of the P-T and H-S methods was based on partial weighted averages (PWAs) of their mean monthly values. This method estimates the annual values considering the amplitude of the parameter under investigation (ETo and Rs) giving more weight to the monthly coefficients of the months with higher ETo values (or Rs values for the case of the H-S radiation formula). The method also eliminates the effect of unreasonably high or low monthly coefficients that may occur during periods where ETo and Rs fall below a specific threshold. The new coefficients were validated based on data from 140 stations located in various climatic zones of the USA and Australia with expanded observations up to 2016. The validation procedure for ETo estimations of the short reference crop showed that the P-T and H-S methods with the new revised coefficients outperformed the standard methods reducing the estimated root mean square error (RMSE) in ETo values by 40 and 25 %, respectively. The estimations of Rs using the H-S formula with revised coefficients reduced the RMSE by 28 % in comparison to the standard H-S formula. Finally, a raster database was built consisting of (a) global maps for the mean monthly ETo values estimated by ASCE-standardized method for both reference crops, (b) global maps for the revised annual coefficients of the P-T and H-S evapotranspiration methods for both reference crops and a global map for the revised annual coefficient of the H-S radiation formula and (c) global maps that indicate the optimum locations for using the standard P-T and H-S methods and their possible annual errors based on reference values. The database can support estimations of ETo and solar radiation for locations where climatic data are limited and it can support studies which require such estimations on larger scales (e.g. country, continent, world). The datasets produced in this study are archived in the PANGAEA database (this data set) and in the ESRN database (http://www.esrn-database.org or http://esrn-database.weebly.com). Context Disk space captured for several months for a set of Windows servers Content Contents are the server name, disk drive, total disk space, free disk space and percentage of free space Acknowledgements Thanks to Kaggle for providing this development environment Inspiration My initial goal is to add a column with the moving average of free disk space (for 7 days), to be used for forcasting A set of preference judgements among generated random property pairs for 350 random Wikidata persons. For each (entity, property1, property2) record, 10 annotators judged which of the two properties is more interesting for the respective entity. The goal is then to predict the annotator judgments as good as possible. Current state-of-the-art methods (Wikidata Property Suggester and others) achieve 61% precision in this task, while methods based on linguistic similarity get to 74%, still significantly below annotator agreement (87.5%). Further details are in the paper "Doctoral Advisor or Medical Condition: Towards Entity-specific Rankings of Knowledge Base Properties", ADMA 2017, available at http://www.simonrazniewski.com/2017_ADMA.pdf Dataset connected with article: 'Increasing the use of conceptually-derived strategies in arithmetic: using inversion problems to promote the use of associativity shortcuts.' Abstract: Conceptual knowledge of key principles underlying arithmetic is an important precursor to understanding algebra and later success in mathematics. One such principle is associativity, which allows individuals to solve problems in different ways by decomposing and recombining subexpressions (e.g. ‘a + b – c’ = ‘b – c + a’). More than any other principle, children and adults alike have difficulty understanding it, and educators have called for this to change. We report three intervention studies that were conducted in university classrooms to investigate whether adults’ use of associativity could be improved. In all three studies, it was found that those who first solved inversion problems (e.g. ‘a + b – b’) were more likely than controls to then use associativity on ‘a + b – c’ problems. We suggest that ‘a + b – b’ inversion problems may either direct spatial attention to the location of ‘b – c’ on associativity problems, or implicitly communicate the validity and efficiency of a right-to-left strategy. These findings may be helpful for those designing brief activities that aim to aid the understanding of arithmetic principles and algebra. If a change of use for industrial land is proposed in the UK, there is usually a requirement to demonstrate that the change of use will not result in the land becoming Contaminated Land, as defined under the Environmental Protection Act 1990. Under certain circumstances, this demonstration can be made by showing that the mean concentration of contaminants of potential concern is below a suitable assessment level appropriate to the proposed new use. How much sampling effort is required for this purpose? Using a relatively large dataset for arsenic in soil, a developed approach is presented to determining the number of measurements required for a clearance investigation to demonstrate absence of contamination based on minimizing expectation of financial loss, taking into account both the actual cost of investigation and the possible cost of incorrectly determining that contamination is still present and undertaking unnecessary remediation. Abstract probabilities are discussed in terms of money spent and money potentially saved. Flowering time data from 2003 study in the greenhouse, performed at the Kellogg Biological Station near Kalamazoo MI.Populations: 9Individuals per pop: 22-46 Total individuals in dataset: 306 These are also the parental plants from Sahli et al., (2008. dataset 4, Pterostichus Melanarius (biological variable) x Corn (landscape variable) Dataset for paired data using the Neurophysiology of Pain Questionnaire and the HC-PAIRS **************** NTU Dataset ReadMe file *******************We had to remove our data temporarily for privacy reasons. This dataset summarizes the information for 11 journals, including the year for which the p-values of the published articles were extracted, the impact factor for the respective year and the acronym used in the R code for this study. 250 Benchmark queries for the DBpedia dataset Content More details about each file are in the individual file descriptions. Context This is a dataset hosted by the City of New York. The city has an open data platform found [here](https://opendata.cityofnewyork.us/) and they update their information according the amount of data that is brought in. Explore New York City using Kaggle and all of the data sources available through the City of New York [organization page](https://www.kaggle.com/new-york-city)! * Update Frequency: This dataset is updated annually. Acknowledgements This dataset is maintained using Socrata's API and Kaggle's API. [Socrata](https://socrata.com/) has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public. This dataset is distributed under the following licenses: Public Domain Ausgehend vom Verkehrsunfaelle_train Datensatz soll Dein Algorithmus in der Lage sein, die Unfallschwere (leicht, schwer, tödlich) eines Verkehrsunfalls zu prädizieren. Du erhältst auch einen zweiten Datensatz (Verkehrsunfaelle_test.csv), der verwendet wird, um die Vorhersage-Performance Ihres Algorithmus zu validieren. Dazu verwendest Du Deinen Algorithmus und reichst die Prädiktionen im .csv-format ein. Das file muss exakt 2 Spalten und 1000 Reihen plus eine headerreihe mit Unfall_ID und Unfallschwere besitzen. Die erste Spalte beinhaltet die ID des Unfalls in aufsteigender Nummerierung, die zweite die prädizierte Unfallschwere (1 = leicht, 2 = schwer, 3 = tödlich). Sex classification dataset from [Wikipedia](https://en.wikipedia.org/wiki/Naive_Bayes_classifierSex_classification), for the purpose of Naive Bayes classifier demonstration. Individual diagnosis history in the dataset cohort.This dataset support the manuscript "WAHDA: a Data Source to Promote the Impact of Checkup in Better Quality of Care", submitted to Scientific Data Values of Jaccard dissimilarity based on incidence datasets measured within and between communities. Identical = 0; completely dissimilar = 1. Abbreviations: MT, morphotypes; EE, evolutionary independent entities obtained with GMYC; OTU, operational taxonomic units obtained with single individuals; eOTU, operational taxonomic units obtained with environmental samples; SV, sequence variants. L, littoral; SL, sublittoral; O, offshore. Dissimilarity values were estimated using both the focal phyla (SV; eOTU) and whole meiofauna dataset (SV2; eOTU2). The dataset consists of respiration and methane production rates and methane oxidation potential obtained from soil microcosm studies carried out under controlled temperature and incubation conditions. Soils cores collected in 2012 represent the flat- and high-centered polygon active layers and permafrost (when present) from the NGEE Arctic Intensive Study Site 1, Barrow, Alaska. Subgrid variability introduces non-negligible scale effects on the GIS-based representation of snow. This heterogeneity is even more evident in semiarid regions, where the high variability of the climate produces various accumulation melting cycles throughout the year and a large spatial heterogeneity of the snow cover. This variability in a watershed can often be represented by snow depletion curves (DCs). In this study, terrestrial photography (TP) of a cell-sized area (30 x 30 m) was used to define local snow DCs at a Mediterranean site. Snow cover fraction (SCF) and snow depth (h) values obtained with this technique constituted the two datasets used to define DCs. A flexible sigmoid function was selected to parameterize snow behaviour on this subgrid scale. It was then fitted to meet five different snow patterns in the control area: one for the accumulation phase and four for the melting phase in a cycle within the snow season. Each pattern was successfully associated with the snow conditions and previous evolution. The resulting DCs were able to capture certain physical features of the snow, which were used in a decision-tree and included in the point snow model formulated by Herrero et al. (2009). The final performance of this model was tested against field observations recorded over four hydrological years (2009?2013). The calibration and validation of this DC-snow model was found to have a high level of accuracy with global RMSE values of 84.2 mm for the average snow depth and 0.18 m**2/m**2 for the snow cover fraction in the control area. The use of DCs on the cell scale proposed in this research provided a sound basis for the extension of point snow models to larger areas by means of a gridded distributed calculation. resolution: 512*512*244; unsigned short; a 0.25mm resolution.This dataset was obtained using the dental CBCT imaging system ZCB100 (Shenzhen ZhongKe TianYue Technology Co., Ltd.) at 110 kV and 10 mAs, with a 15.36-cm FOV Dataset for the practice in the data preprocessing and unsupervised learning in the introduction to bioinformatics course Using SVMs, KNN, and Random Forests on the MNIST dataset. I want to see which algorithm performs better. Context The data is based on images I have taken with my Lytro Illum camera (https://pictures.lytro.com/ksmader) they have been exported as image data and depth maps. The idea is to make and build tools for looking at Lytro Image data and improving the results Content The data are from the Lytro Illum and captured as 40MP images which are then converted to 5MP RGB+D images. All of the required data for several test images is provided The second datasets come from the Lenovo Phab2 (Project Tango) which utilizes dual image sensors to recreate point clouds of large 3D structures. These are provided as .ply and .obj datasets Acknowledgements The data is based on images I have taken with my Lytro Illum camera (https://pictures.lytro.com/ksmader). Inspiration 1. Build a neural network which automatically generates depth information from 2D RGB images 2. Build a tool to find gaps or holes in the depth images and fixes them automatically 3. Build a neural network which can reconstruct 3D pixel data from RGBD images This task is designed to test the differences between novices' and experts' relevance assessments. We employ the formulated queries obtained from the <em>query formulation</em> task to build a single system ranking of candidate relevant documents. Crowd workers were then provided with a medical cases (among 113 topics) and this list of top 10 candidate relevant documents. For each query-list of top 10 results, we obtained from 2 experts and 2 novices the relevance judgement in 3-point scale. More details of this task can be found in our paper in references. Fields of the csv file: <strong>dataset</strong>: <em>CLEF_eHealth</em> or <em>OHSUMED</em> <strong>topic_id</strong>: ID of the topic (from 1 to 50 for CLEF_eHealth, from 50 to 113 for OHSUMED) <strong>answerer_type</strong>: <em>expert</em> or <em>novice</em> <strong>answerer_id</strong>: ID of crowd worker <strong>doc_id</strong>: ID of the candidate document in the dataset <strong>relevance_score</strong>: the relevance rate given by the crowd worker <strong>information_need</strong>: the glues about the desired content of relevant documents <strong>task_context</strong>: the medical cases that triggers the information need Degree centrality (DC) and local functional connectivity density (lFCD) are statistics calculated from brain connectivity graphs that measure how important a brain region is to the graph. DC (a.k.a. global functional connectivity density) is calculated as the number of connections a region has with the rest of the brain (binary DC), or the sum of weights for those connections (weighted DC). lFCD was developed to be a surrogate measure of DC that is faster to calculate by restricting its computation to regions that are spatially adjacent. Although both of these measures are popular for investigating inter-individual variation in brain connectivity, efficient neuroimaging tools for computing them are scarce. The goal of this Brainhack project was to contribute optimized implementations of these algorithms to the widely used, open source, AFNI software package. Tools for calculating DC (3dDegreeCentrality) and lFCD (3dLFCD) were implemented by modifying the C source code of AFNI’s 3dAutoTcorrelate tool. 3dAutoTcorrelate calculates the voxel voxel correlation matrix for a dataset and includes most of the functionality we require, including support for OpenMP multithreading to improve calculation time, the ability to restrict the calculation using a user-supplied or auto-calculated mask, and support for both Pearson’s and Spearman correlation. Outputs from the newly developed tools were benchmarked to Python implementations of these measures from the Configurable Pipeline for the Analysis of Connectomes (C-PAC) using the publically shared Intrinsic Brain Activity Test-Retest (IBATRT) dataset from the Consortium for Reliability and Reproducibility. Copyright information:Taken from "Megx.net—database resources for marine ecological genomics"Nucleic Acids Research 2005;34(Database issue):D390-D393.Published online 28 Dec 2005PMCID:PMC1347433.© The Author 2006. Published by Oxford University Press. All rights reserved () Marine genomes and metagenomic fragments can be browsed and searched on a world map on our web-based system. () An example showing a Geographic-BLAST search for genes encoding proteorhodopsins in the currently available dataset. Translated homologous gene alignments from transcriptome data. There are two datasets with the partition files. The 70% complete supermatrix and the 80% complete supermatrix. See text for more details. Raw data for the 15/1 pollen surface sample dataset obtained from the Neotoma Paleoecological Database. Incidence dataset for Costa Rica. This file must be located in the same folder than rcode. Context This dataset is collected from the students of a prominent university in North India. This dataset should be used to create the overall Institutional Report on the basis of student feedback data. Content This dataset is comprised of 6 categories, which includes teaching, course content, examination, lab work, library facilities and extra curricular activities. Data for each category includes two columns, where each column can have any of the three labels, i.e. 0 (neutral), 1 (positive) and -1 (negative). Acknowledgements I am thankful to the students of the institution to share their opinions, which helped me to create this dataset. Inspiration You should try to create the overall institutional report in all disciplines (categories) by analyzing the text based response using the sentiment analysis methods. Context Chat-80 was a natural language system which allowed the user to interrogate a Prolog knowledge base in the domain of world geography. It was developed in the early '80s by Warren and Pereira; see http://acl.ldc.upenn.edu/J/J82/J82-3002.pdf for a description and http://www.cis.upenn.edu/~pereira/oldies.html for the source files. The canonical metadata on NLTK: Context The SMS Spam Collection. Content Base on the text on SMS message, we should predict it is spam or not spam. Acknowledgements Thanks for Machine Learning Repository. The E2E data, a new dataset for training end-to-end, data-driven natural language generation systems in the restaurant domain, which is ten times bigger than existing, frequently used datasets in this area (&gt;5k distinct meaning representations with &gt;50k corresponding natural language reference texts). The E2E dataset poses new challenges: (1) its human reference texts show more lexical richness and syntactic variation, including discourse phenomena; (2) generating from this set requires content selection. What influences love at first sight? (Or, at least, love in the first four minutes?) This [dataset][1] was compiled by Columbia Business School professors Ray Fisman and Sheena Iyengar for their paper [Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment][2]. Data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four minute "first date" with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests. The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include: demographics, dating habits, self-perception across key attributes, beliefs on what others find valuable in a mate, and lifestyle information. See the Speed Dating Data Key document below for details. For more analysis from Iyengar and Fisman, read [Racial Preferences in Dating][3]. Data Exploration Ideas ---------------------- - What are the least desirable attributes in a male partner? Does this differ for female partners? - How important do people think attractiveness is in potential mate selection vs. its real impact? - Are shared interests more important than a shared racial background? - Can people accurately predict their own perceived value in the dating market? - In terms of getting a second date, is it better to be someone\'s first speed date of the night or their last? [1]: http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating/ [2]: http://faculty.chicagobooth.edu/emir.kamenica/documents/genderDifferences.pdf [3]: http://faculty.chicagobooth.edu/emir.kamenica/documents/racialpreferences.pdf The attached file contains the minimal dataset for the above mentioned study, including the data from the (1) preliminary studies (hemolysis and flow measurements), (2) combined laser Doppler flowmetry and remission spectroscopy (O2C), (3) rate of necrosis and (4) blood gas analysis. All files are Microsoft excel sheets. A legend explaining all abbreviations used in the data sheets is attached to each file as a tab. The customer segments data is included as a selection of 440 data points collected on data found from clients of a wholesale distributor in Lisbon, Portugal. More information can be found on the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Wholesale+customers). Note (m.u.) is shorthand for *monetary units*. **Features** 1) `Fresh`: annual spending (m.u.) on fresh products (Continuous); 2) `Milk`: annual spending (m.u.) on milk products (Continuous); 3) `Grocery`: annual spending (m.u.) on grocery products (Continuous); 4) `Frozen`: annual spending (m.u.) on frozen products (Continuous); 5) `Detergents_Paper`: annual spending (m.u.) on detergents and paper products (Continuous); 6) `Delicatessen`: annual spending (m.u.) on and delicatessen products (Continuous); 7) `Channel`: {Hotel/Restaurant/Cafe - 1, Retail - 2} (Nominal) 8) `Region`: {Lisbon - 1, Oporto - 2, or Other - 3} (Nominal) The dataset includes the model output data shown in Figs. 2-9, and S1-S3, together with the GMT scripts used to generate the plots. Platforms (or applications) used to process RDF datasets. NOTE: This dataset has been superseded by GPCP Version 2.3, which is available in RDA dataset ds728.4 [https://rda.ucar.edu/datasets/ds728.4/]. Users are advised to transition to this updated dataset. This dataset contains Version 2.2 of the Global Precipitation Climatology Project (GPCP) combined satellite-gauge precipitation estimate and combined satellite-gauge error estimate. The data are monthly analyses defined on a global 2.5 degree by 2.5 degree longitude/latitude grid and cover the period January 1979 to (delayed) present. A monthly climatology (1979-2011) is also available. Please note that the original binary data were written using the big endian representation of unformatted binary words. Users reading this data on little endian platforms, therefore, will need to byte swap the data. The GPCP was established by the World Climate Research Program (WCRP) and subsequently attached to the Global Energy and Water Exchange program (GEWEX) to address the problem of quantifying the distribution of precipitation around the globe over many years. The general approach is to combine the precipitation information available from each of several sources into a final merged product, taking advantage of the strengths of each data type. The passive microwave estimates are based on Special Sensor Microwave/Imager (SSM/I) and Special Sensor Microwave Imager/Sounder (SSMIS) data from the series of Defense Meteorological Satellite Program (DMSP, United States) satellites that fly in sun-synchronous low-earth orbits at 6am / 6pm. The infrared precipitation estimates are computed primarily from geostationary satellites (United States, Europe, Japan), and secondarily from NOAA series polar-orbiting satellites (United States). Additional low-Earth orbit estimates include Atmospheric Infrared Sounder (AIRS) data from the NASA Aqua, and Television Infrared Observation Satellite Program (TIROS) Operational Vertical Sounder (TOVS) and Outgoing Longwave Radiation Precipitation Index (OPI) data from the NOAA series satellites. The precipitation gauge data are assembled and analyzed by the Global Precipitation Climatology Centre (GPCC) of the Deutscher Wetterdienst. The Version 2.2 Data Set contains data from the following contributing centers: * GPCP Polar Satellite Precipitation Data Centre - Emission (SSM/I and SSMIS emission estimates) * GPCP Polar Satellite Precipitation Data Centre - Scattering (SSM/I and SSMIS scattering estimates) * GPCP Geostationary Satellite Precipitation Data Centre (GPI and OPI estimates) * NASA/GSFC Sounder Research Team (TOVS and AIRS estimates) * GPCP Global Precipitation Climatology Centre (precipitation gauge analyses) Request to users from the data authors: The GPCP datasets are developed and maintained with international cooperation and are used by the worldwide scientific community. To better understand the evolving requirements across the GPCP user community and to increase the utility of the GPCP product suite, the dataset authors request that a citation be provided for each publication that uses the GPCP products. Please email the citation to george.j.huffman@nasa.gov or david.t.bolvin@nasa.gov. Your help and cooperation will provide valuable information for making future enhancements to the GPCP product suite. Context This dataset contains CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) images. Built in 1997 as way for users to identify and block bots (in order to prevent spam, DDOS etc.). They have since then been replace by reCAPTCHA because they are breakable using Artificial Intelligence (as I encourage you to do). Content The images are 5 letter words that can contain numbers. The images have had noise applied to them (blur and a line). They are 200 x 50 PNGs. Acknowledgements The dataset comes from [Wilhelmy, Rodrigo &amp; Rosas, Horacio. (2013). captcha dataset.][1] [1]: https://www.researchgate.net/publication/248380891_captcha_dataset Thumbnail image from [Accessibility of CAPTCHAs] [2]: http://www.bespecular.com/blog/accessibility-of-captchas/ Inspiration This dataset is a perfect opportunity to attempt to make Optical Character Recognition algorithms. Context Coming Soon Content Coming Soon Acknowledgements Special thanks to; http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ Inspiration Coming soon Complete dataset used in this article Genotypes used to construct linkage maps used in paper. First sheet contains the data from the smolt dataset. Second sheet contains the data from the meristics dataset. Context The funniness of joke is very subjective. Having more than 70,000 users rate jokes, can an algorithm be written to identify the universally funny joke? Content - The data file are in **.csv** format. - The complete dataset is 100 rows and 73422 columns. - The complete dataset is split into 3 **.csv** files. - **JokeText.csv** contains the Id of the joke and the complete joke string. - **UserRatings1.csv** contains the ratings provided by the first 36710 users. - **UserRatings2.csv** contains the ratings provided by the last 36711 users. - The dataset is arranged such that the initial users have rated higher number of jokes than the later users. - The rating is a real value between **-10.0** and **+10.0**. - The **empty values** indicate that the user has not provided any rating for that particular joke. Acknowledgements The dataset is associated with the below research paper. [Eigentaste: A Constant Time Collaborative Filtering Algorithm.](http://www.ieor.berkeley.edu/~goldberg/pubs/eigentaste.pdf) Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001. More information and datasets can be found at [http://eigentaste.berkeley.edu/dataset/](http://eigentaste.berkeley.edu/dataset/) Inspiration Since funniness is a very subjective matter, it will be very interesting to see if data science can bring out the details on what makes something funny. Context This dataset provides user vote data on which video from a pair of videos was funnier. YouTube Comedy Slam was a discovery experiment running on YouTube 2011 and 2012. In the experiment, pairs of videos were shown to users and the users voted for the video that they found funniest. Content The datasets includes roughly 1.7 million votes recorded chronologically. The first 80% are provided here as the training dataset and the remaining 20% as the testing dataset. Each row in this text file represents one anonymous user vote and there are three comma-separated fields. - The first two fields are YouTube video IDs. - The third field is either 'left' or 'right'. - Left indicates the first video from the pair was voted to be funnier than the second. Right indicates the opposite preference. Acknowledgements Sanketh Shetty, 'Quantifying comedy on YouTube: why the number of o's in your LOL matter,' Google Research Blog, [https://research.googleblog.com/2012/02/quantifying-comedy-on-youtube-why.html][1]. Dataset was downloaded from UCI ML repository: [https://archive.ics.uci.edu/ml/datasets/YouTube+Comedy+Slam+Preference+Data][2] [1]: https://research.googleblog.com/2012/02/quantifying-comedy-on-youtube-why.html [2]: https://archive.ics.uci.edu/ml/datasets/YouTube+Comedy+Slam+Preference+Data Inspiration Predict which videos are going to be funny! Context 2126 fetal cardiotocograms (CTGs) were automatically processed and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both with respect to a morphologic pattern (A, B, C. ...) and to a fetal state (N, S, P). Therefore the dataset can be used either for 10-class or 3-class experiments. Acknowledgements Source: Marques de Sá, J.P., jpmdesa '@' gmail.com, Biomedical Engineering Institute, Porto, Portugal. Bernardes, J., joaobern '@' med.up.pt, Faculty of Medicine, University of Porto, Portugal. Ayres de Campos, D., sisporto '@' med.up.pt, Faculty of Medicine, University of Porto, Portugal. Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. ECG-ViEW II sample dataset person table In this study, we asked crowd workers to generate IoT scenarios by showing them list of trigger (input) and action (output) devices. Each crowd worker created 3 scenarios. Currently, the following attributes are included in the dataset for each scenario: <i>Category, Trigger Devices and their Triggers, Action Devices and their Actions, word count, word per sentences, Long words, unique words, difficult words, Mean Originality, Mean Practicality, Mean Creativity, Sum Creativity, Creative (dichotomous: 0 or 1). and some worker related information such as: Has Smart home experience (Boolean)? Total Experience (Months), Gender, Age, Family Size, Programming Experience, choices of input and output devices.</i> The example dataset for use in the example analyses babies.txt bwt - birth weight in ounces (999 unknown) gestation - gestation days parity - 0 means first born age - mom age in years height - mom height in inches weight - mom pre-pregnancy weight in pounds smoke - mom smoke, 0 means no, 1 means yes, 9 means unknown babies23.txt id - id number pluralty - 5 means single fetus outcome - 1 for live birth that survived at least 28 days date - birth date 1096=January 1, 1961 (this might be a timestamp, not very sure) gestation - gestation days sex - infant sex, 1=male, 2=female, 9=unknown wt - birth weight in ounces parity - 0 means first born race - mom race, 0-5=white, 6=mex, 7=black, 8=asian, 9=mix, 99=unknown age - mom age in years ed - mom education, 0=(&lt;8), 1=(8-&lt;12), 2=12, 3=12+trade, 4=12+some college, 5=16, 7=trade (hs unclear), 9=unknown ht - mom height in inches wt - mom pre-pregnancy weight in pounds (notice that this column name will be renamed to wt.1, since there are two duplicate wt column names) drace - dad race dage - dad age ded - dad education dht - dad height dwt - dad weight marital - 1=married, 2-4=sep, div, wid, 5=never married, blank inc - total income in 2500 increments, 0=under 2500, 1=2500-4999, ..., 9=22500+, 98=unknown, 99=not asked smoke - mom smoke, 0=never, 1=yes now, 2=until pregnancy, 3=once did not now, 9=unknown time - how long ago quit, 0=never, 1=still, 2=during preg, 3=up to 1 yr, 4=up to 2 yr, 5=up to 3 yr, 6=up to 4 yr, 7=5 to 9 yr, 8=10+ yr, 9=quit and don't know, 98=unknown number - number of cigs smoke a day for past and current smokers, 0=never, 1=1-4, 2=5-9, 3=10-14, 4=15-19, 5=20-29, 6=30-39, 7=40-60, 8=60+, 9=smoke but don't know, 98=unknown Context We do have a dataset with given loans and its arrears rate which allow a supervised machine learning. Content This is a dataset of given loans with its default rate Variation of hospital charges in the various hospitals in the US for the top 100 diagnoses. The dataset is owned by the US government. It is freely available on [data.gov](https://data.gov.) The dataset keeps getting updated periodically [here](https://data.cms.gov/Medicare/Inpatient-Prospective-Payment-System-IPPS-Provider/97k6-zzx3) This dataset will show you how price for the same diagnosis and the same treatment and in the same city can vary differently across different providers. It might help you or your loved one find a better hospital for your treatment. You can also analyze to detect fraud among providers. Dataset used for the analyses This dataset is a simplified version of the FocaLens dataset.Since the online space is limited, we have to resize the images to 480x320, which is the size of input image in our proposed model.The full size dataset will be published sooner. Context Medium is one of the most famous tools for spreading knowledge about almost any field. It is widely used to published articles on ML, AI, and data science. This dataset is the collection of about 350 articles in such fields. Content The dataset contains articles, their title, number of claps it has received, their links and their reading time. Acknowledgements This dataset was scraped from [Medium](https://medium.com/). I created a Python script to scrap all the required articles using just their tags from Medium. Check out the script [here](https://github.com/Hsankesara/medium-scrapper) Inspiration How to write a good article? How to inform the reader in an interesting way? What sort of title attracts more crowd? How long an article should be? Context Find the best strategies to improve for the next marketing campaign. How can the financial institution have a greater effectiveness for future marketing campaigns? In order to answer this, we have to analyze the last marketing campaign the bank performed and identify the patterns that will help us find conclusions in order to develop future strategies. Source [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 Context [Emergent.info](http://www.emergent.info/) was a major rumor tracker, created by veteran journalist [Craig Silverman](https://twitter.com/CraigSilverman). It has been defunct for a while, but its well-structured format and well-documented content provides an opportunity for analyzing rumors on the web. [Snopes.com](http://www.snopes.com/) is one of the oldest rumors trackers on the web. Originally launched by Barbara and David Mikkelson, it is now run by a team of editors who investigate urban legends, myths, viral rumors and fake news. The investigators try to provide a detailed explanation for why they have chosen to confirm or debunk a rumor, often citing several web pages and other external sources. [Politifact.com](http://www.politifact.com/) is a fact-checker that is focused on statements made by politicians and claims circulated by political campaigns, blogs and similar websites. Politifact\'s labels range from "true," to "pants on fire!" --- Content This dataset consists of three files. One file is a collection of all webpages cited in Emergent.info, and the second is a collection of webpages cited in Snopes.com, and the third is a similar collection from Politifact.com. The webpages were often cited because they had started a rumor, shared a rumor, or debunked a rumor. Emergent.info Emergent.info often provides a clean timeline of the rumor\'s propagation on the web, and identifies which page was for the rumor, which page was against it, and which page was simply observing it. Please refer to the image below to learn more about the fields in this dataset. ![The image displays a sample post from Emergent.info and highlights the corresponding fields in emergent.csv.][1] Snopes.com The structure of posts on **Snopes.com** is not as well-defined. Please refer to the image below to learn more about the fields in the Snopes dataset. ![This image displays a sample post from Snopes.com and highlights the corresponding fields in snopes.csv.][2] Politifact.com Similar to Emergent.info, Politifact.com follows a well-structured format in reporting and documenting rumors. There is a sidebar on the right side of each page that lists all of the sources cited within the page. The top link is the likeliest to be the original source of the rumor. For this link, page_is_first_citation is set to true. ![This image displays a sample post from Politifact.com and highlights the corresponding fields in politifact.csv.][3] --- Inspiration I created this dataset in order to study domains that frequently start, propagate, or debunk rumors. By studying these domains and people who follow them, I hope to gain some insight into the dynamics of rumor propagation on the web, as well as social media. --- Notes/Disclaimer When using the Snopes dataset, please keep the following in mind: * In addition to debunking rumors, Snopes.com occasionally reports news and other types of content. This collection only includes data from "[Fact Check](http://www.snopes.com/category/facts)" posts on Snopes. * Snopes.com was launched years ago. Some of the older posts on the website do not follow the current format of the site, therefore some of the fields might be missing. * Snopes.com used to use a service named "[DoNotLink.com](https://twitter.com/donotlink?lang=en)" for citation purposes. That service is no longer active and as a result some of the links are missing from older posts on Snopes. * In addition, some of the shortened links would time-out prior to resolution, in which case they would not be added to the dataset. * Occasionally, a website that has been cited has not maliciously started a rumor. For instance, Andy Borowitz is a humorist who writes for *The New Yorker*. His satirical column is sometimes mistaken for real news; as a result, *The New Yorker* may be cited as a source of fake news on [Snopes.com](http://www.snopes.com/trump-blasts-media-for-reporting-things-he-says/). This does not mean that *The New Yorker* is a fake news website. When using the Politifact dataset, please keep the following in mind: * The data included in this dataset are collected from the "[truth-o-meter](http://www.politifact.com/punditfact/statements/)" page of Politifact.com. * Politifact often fact-checks statements made by politicians. Since this dataset is focused on websites, I have ignored all the posts in which the rumor was attributed to a person, a political party, a campaign, or an organization. Instead, I have only included rumors attributed explicitly to websites or blogs. --- Useful Tips for Using the Snopes collection As opposed to the Emergent collection where each page is flagged with whether it was for or against a rumor, no such information is available for the Snopes dataset. To avoid manually labeling the data, you may use the following heuristics to identify which page started a rumor: * Webpages that are cited in the "Examples" section of a post are often "observing" the rumor, i.e. they have not started it, but they are repeating it. In the snopes.csv file, these webpages have been flagged as "page_is_example." * Webpages that are cited in the "Featured Image" section of a post are often not related to the rumor. The editors on Snopes have simply extracted an image from those pages to embed in their posts. In the snopes.csv file, these webpages have been flagged as "page_is_image_credit." * Webpages that are cited through a secondary service (such as [archive.is](http://archive.is/)) are likelier to be rumor-propagators. Editors do not link to them directly so that a record of their page is available, even if it is later deleted. * If neither of these hints help, very often (but not always) the first link cited on the page (for which "page_is_example" and "page_is_image_credit" are false) is the link to a page that started the rumor. This link is identified by the "page_is_first_citation" field. Pages for which both "page_is_first_citation" and "page_is_archived" are true are very likely to be rumor propagators. * To identify satirical websites that are mistaken for real news, it\'s useful to inspect the way they are cited on Snopes. To demonstrate that a website contains satire or humor, Snopes writers often cite the "about us" page of the site. Therefore it\'s useful to see which domains often contain a URI to their "about" page (e.g. "http://politicops.com/about-us/"). [1]: http://imgur.com/JZPExar.png [2]: http://i.imgur.com/jFT6Vdb.png [3]: http://i.imgur.com/Z83JP7c.png History I have made the database of photos sorted by products and brands. Screenshots were performed only on official brand websites. Content The main dataset (style.zip) is 2184 color images (150x150x3) with 7 brands and 10 products, and the file with labels `style.csv`. Photo files are in the `.png` format and the labels are integers and values. The file `StyleColorImages.h5` consists of preprocessing images of this set: image tensors and targets (labels). Acknowledgements I have published the data for absolutely free using by any site visitor. But this database contains the names of famous brands, so it can not be used for commercial purposes. Usage Classification, image recognition and colorizing, etc. in a case of a small number of images are useful exercises. The main question we can try to answer with the help of the data is whether the algorithms can recognize the unique design style well enough. To facilitate the task, I chose the most easily recognizable brands with a bright style. [The example of usage](https://github.com/OlgaBelitskaya/deep_learning_projects/blob/master/DL_PP4/DL_PP4_Solutions.ipynb) Improvement There are lots of ways for improving this set and the machine learning algorithms applying to it. At first, it needs to increase the number of photos. 150 normal and134 nodule images in dataset. This isn't a dataset, it is a collection of kernels written on Kaggle that use no data at all. Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? This dataset provides 8-day global gross primary production (GPP) at 0.05° latitude by 0.05° longitude for 1982-2017. (1) Model description and performance The GPP dataset was generated by the revised EC-LUE model by integrating the regulations of several major environmental variables: atmospheric CO2 concentration, radiation components (i.e., direct and diffuse radiations), and atmospheric vapor pressure deficit (VPD). The revised EC-LUE performed well in simulating the spatial, seasonal, and interannual variations in global GPP. Particularly, it has a unique superiority in reproducing the interannual variations in GPP at both site level and global scales. (2) Dataset information Each .zip file contains all the 8-day GPP of a year at daily value. To obtain the summation of each 8-day (or 5-day or 6-day) period, please multiply the GPP value by corresponding days (8 for the first 45 values, and 5 or 6 for the last value). Data format: HDF Spatial extent: 90S-90N 180W-180E Fill value: 65535 Scale factor: 0.01 Unit: g C m<sup>-2</sup> day<sup>-1</sup> Any questions about the GPP dataset can be corresponded to yuanwpcn@126.com (Wenping Yuan). This file contains GIS information on the aggregated distribution of ecosystem services (ES) over the study area (French Alps). Sixteen ES were included as binary datasets (presence/absence - threshold at third quartile) to calculate the number of present ES at a resolution of 1*1km. This presentation, part of the RDS-in-Flight series, explores one of the prime motivators for sharing data openly, namely the prospect for its reuse by other researchers. Various open data platforms store, describe and cite their datasets differently, with a range of practices around providing data citations, licensing practices, and the use of metadata schema. This presentation, aimed at librarians, focuses on various open data repositories, aggregators, and data tools and how they can be used to find, gather, and cite open data for future studies, and how to search them for open data sets relevant to specific disciplines. The presentation also describes how to create a collection on ZivaHub, UCT's institutional data repository, and how to link open datasets to theses/dissertations or other kinds of research outputs using a data availability statement. This task is designed to study the impact of domain expertise on query formulation. The crowd workers, divided in two types (expert and novice in medical domain) were asked to build the appropriate query that allows achieving the search task using a pair of facets: (1) search context and (2) information need. A total of 113 topics (of 2 datasets CLEFeHealth and OHSUMED) were submitted to 3 experts and 3 novices for self-query generations; therefore 6 queries ware formulated for each topic. More details of this task can be found in our paper in references. Fields of the csv file: <strong>dataset</strong>: <em>CLEF_eHealth</em> or <em>OHSUMED</em> <strong>topic_id</strong>: ID of the topic (from 1 to 50 for CLEF_eHealth, from 50 to 113 for OHSUMED) <strong>answerer_type</strong>: <em>expert</em> or <em>novice</em> <strong>answerer_id</strong>: ID of crowd worker <strong>composed_query</strong>: query formulated by the crowd worker <strong>information_need</strong>: the glues about the desired content of relevant documents <strong>task_context</strong>: the medical cases that triggers the information need Dataset of the chromosome number polymorphism in Asteraceae family. Context ISCO is a tool for organizing jobs into a clearly defined set of groups according to the tasks and duties undertaken in the job. Content Occupational codes broken down by Major, Sub-Major, Minor and Unit groups. Acknowledgements The International Labour Organization - [http://www.ilo.org/][1] Inspiration A simple breakdown of the occupational codes provided by the ILO. Country specific information to be added soon. [1]: http://www.ilo.org/ Data description:This product provides annual (1985-2015) 30-m vegetation phenology (i.e., start of season-SOS; end of season-EOS) in urban areas of the conterminous United States, including: (1) Information about urban clusters *** uCluster_USA_gt500.zip (format: ESRI shapefile): bounding box of each urban cluster with the cluster ID and cityName. The ESRI file can be opened by many opensource softwares (e.g., QGIS)*** US_uCluster_UrbanRuralExtents.zip: spatial extent of urban clusters (‘US_uCluster_label.tif’) and urban and surrounding rural clusters (‘US_uCluster_label_withRural.tif’).(2) Phenology dataset In each zip file, it includes three phenology datasets:*** annual SOS from 1985-2015 for each urban cluster *** annual EOS from 1985-2015 for each urban cluster *** COR: correlation of fitted double logistic curve to the observed EVIS for each pixel. The COR should be divided by 10000, and it serves as an uncertainty layer to indicate the fitting performance of the double logistic model.Any questions about this data can be corresponded to Prof. Yuyu Zhou (zhouyuyu@gmail.com) We introduce the Million Song Dataset, a freely-available collection of audio features and metadata for a million contemporary popular music tracks. We describe its creation process, its content, and its possible uses. Attractive features of the Million Song Database include the range of existing resources to which it is linked, and the fact that it is the largest current research dataset in our field. As an illustration, we present year prediction as an example application, a task that has, until now, been difficult to study owing to the absence of a large set of suitable data. We show positive results on year prediction, and discuss more generally the future development of the dataset. Despite the fact that extensive list of linked open datasets are available in catalogues, most of the data publishers still connects their datasets to the most popular ones, such as DBpedia, Freebase and Geonames. Although the linkage with popular datasets would allow us to explore external resources, it would fail to cover highly specialized information. Catalogues of linked data describe the content of datasets in terms of the update periodicity, authors, SPARQL endpoints, linksets, amongst others, as recommended by W3C VoID Vocabulary. However, catalogues by themselves do not provide any explicit information to help the URI linkage process. Searching techniques can rank available datasets according to the likelyhood that it will be possible to find links between them and a given target dataset, so that most of the links, if not all, could be found by inspecting the most relevant datasets in the ranking.This dataset contains dataset descriptions using the VoID vocabulary for supporting the evaluation of searching techniques. The descriptions of each dataset include their linksets, classes, properties and topic categories which were harvested from the Datahub catalogue, dataset dumps, void files and the DBpedia knowledge graph. The DBpedia Spotlight allowed the detection of named entities in textual literals and thereafter of the DBpedia topic categories of each entity which were taken as the topic categories od the datasets containing de entities. Context This data is a representation of a minutiae in a 16x16 image, which became a 256 vector. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? Arabic Handwritten Characters Dataset Astract Handwritten Arabic character recognition systems face several challenges, including the unlimited variation in human handwriting and large public databases. In this work, we model a deep learning architecture that can be effectively apply to recognizing Arabic handwritten characters. A Convolutional Neural Network (CNN) is a special type of feed-forward multilayer trained in supervised mode. The CNN trained and tested our database that contain 16800 of handwritten Arabic characters. In this paper, the optimization methods implemented to increase the performance of CNN. Common machine learning methods usually apply a combination of feature extractor and trainable classifier. The use of CNN leads to significant improvements across different machine-learning classification algorithms. Our proposed CNN is giving an average 5.1% misclassification error on testing data. Context The motivation of this study is to use cross knowledge learned from multiple works to enhancement the performance of Arabic handwritten character recognition. In recent years, Arabic handwritten characters recognition with different handwriting styles as well, making it important to find and work on a new and advanced solution for handwriting recognition. A deep learning systems needs a huge number of data (images) to be able to make a good decisions. Content The data-set is composed of **16,800** characters written by 60 participants, the age range is between 19 to 40 years, and 90% of participants are right-hand. Each participant wrote each character (from ’alef’ to ’yeh’) ten times on two forms as shown in Fig. 7(a) &amp; 7(b). The forms were scanned at the resolution of 300 dpi. Each block is segmented automatically using Matlab 2016a to determining the coordinates for each block. The database is partitioned into two sets: a training set (13,440 characters to 480 images per class) and a test set (3,360 characters to 120 images per class). Writers of training set and test set are exclusive. Ordering of including writers to test set are randomized to make sure that writers of test set are not from a single institution (to ensure variability of the test set). In an experimental section we showed that the results were promising with **94.9%** classification accuracy rate on testing images. In future work, we plan to work on improving the performance of handwritten Arabic character recognition. Acknowledgements Ahmed El-Sawy, **Mohamed Loey**, Hazem EL-Bakry, **Arabic Handwritten Characters Recognition using Convolutional Neural Network**, WSEAS, 2017 Our proposed CNN is giving an average **5.1%** misclassification error on testing data. Inspiration Creating the proposed database presents more challenges because it deals with many issues such as style of writing, thickness, dots number and position. Some characters have different shapes while written in the same position. For example the teh character has different shapes in isolated position. Benha University http://bu.edu.eg/staff/mloey https://mloey.github.io/ Dataset has been taken form hackerearth deep learning competition [enter link description here][1] [1]: https://www.hackerearth.com/challenge/competitive/deep-learning-beginner-challenge/machine-learning/predict-the-energy-used-612632a9-9de79188 The data was colleted on october and november 2017 Copyright information:Taken from "WholePathwayScope: a comprehensive pathway-based analysis tool for high-throughput data"BMC Bioinformatics 2006;7():30-30.Published online 19 Jan 2006PMCID:PMC1388242.Copyright © 2006 Yi et al; licensee BioMed Central Ltd. The PSCP file for "Cholesterol Synthesis" pathway analyzed for the data from 11 different microarray datasets or CRI files representing a time course experiment (see for description of material and data preparation). Pooled hepatic mRNA were isolated from female wild type mice sacrificed at different time points during fetal and post-natal development indicated. The time point at 9 day before birth was used as the reference level of mRNA. "Day-5" and "Day-3" indicates 5 days or 3 day prior to birth, respectively. In process to edition .. Context Malicious websites are of great concern due it is a problem to analyze one by one and to index each URL in a black list. Unfortunately, there is a lack of datasets with malicious and benign web characteristics. This dataset is a research production of my bachelor students whose aims to fill this gap. *This is our first dataset version got from our web security project, we are working to improve its results* Content The project consisted to evaluate different classification models to predict malicious and benign websites, based on application layer and network characteristics. The data were obtained by using different verified sources of benign and malicious URL's, in a low interactive client honeypot to isolate network traffic. We used additional tools to get other information, such as, server country with Whois. This is the first version and we have some initial results from applying machine learning classifiers in a bachelor thesis. Further details on the data process making and the data description can be found in the article below. URL Dataset This is an important topic and one of the most difficult thing to process, according to other articles and another open resource, we used three black list: + machinelearning.inginf.units.it/data-andtools/hidden-fraudulent-urls-dataset + malwaredomainlist.com + zeuztacker.abuse.ch From them we got around 185181 URLs, we supposed that they were malicious according to their information, we recommend in a next research step to verity them though another security tool, such as, VirusTotal. We got the benign URLs (345000) from https://github.com/faizann24/Using-machinelearning-to-detect-malicious-URLs.git, similar to the previous step, a verification process is also recommended through other security systems. Framework First we made different scripts in Python in order to systematically analyze and generate the information of each URL (**During the next months we will liberate them to the open source community on GitHub**). First we verified that each URL was available through the libraries in Python (such as request), we started with around 530181 samples, but as a results of this step the samples were filtered and we got 63191 URLs. ![Framework to detect malicious websites][1] Feature generator: During the research process we found that one way to study a malicious website was the analysis of features from its application layer and network layer, in order to get them, the idea is to apply the dynamic and static analysis. In the dynamic analysis some articles used web application honeypots kind high interaction, but these resources have not been updated in the last months, so maybe some important vulnerabilities were not mapped. Data Description + URL: it is the anonimous identification of the URL analyzed in the study + URL_LENGTH: it is the number of characters in the URL + NUMBER_SPECIAL_CHARACTERS: it is number of special characters identified in the URL, such as, “/”, “%”, “”, “&amp;”, “. “, “=” + CHARSET: it is a categorical value and its meaning is the character encoding standard (also called character set). + SERVER: it is a categorical value and its meaning is the operative system of the server got from the packet response. + CONTENT_LENGTH: it represents the content size of the HTTP header. + WHOIS_COUNTRY: it is a categorical variable, its values are the countries we got from the server response (specifically, our script used the API of Whois). + WHOIS_STATEPRO: it is a categorical variable, its values are the states we got from the server response (specifically, our script used the API of Whois). + WHOIS_REGDATE: Whois provides the server registration date, so, this variable has date values with format DD/MM/YYY HH:MM + WHOIS_UPDATED_DATE: Through the Whois we got the last update date from the server analyzed + TCP_CONVERSATION_EXCHANGE: This variable is the number of TCP packets exchanged between the server and our honeypot client + DIST_REMOTE_TCP_PORT: it is the number of the ports detected and different to TCP + REMOTE_IPS: this variable has the total number of IPs connected to the honeypot + APP_BYTES: this is the number of bytes transfered + SOURCE_APP_PACKETS: packets sent from the honeypot to the server + REMOTE_APP_PACKETS: packets received from the server + APP_PACKETS: this is the total number of IP packets generated during the communication between the honeypot and the server + DNS_QUERY_TIMES: this is the number of DNS packets generated during the communication between the honeypot and the server + TYPE: this is a categorical variable, its values represent the type of web page analyzed, specifically, 1 is for malicious websites and 0 is for benign websites Conclusions and future works Acknowledgements If your papers or other works use our dataset, please cite our paper: Urcuqui, C., Navarro, A., Osorio, J., &amp; Garcıa, M. (2017). Machine Learning Classifiers to Detect Malicious Websites. CEUR Workshop Proceedings. Vol 1950, 14-17. If you need a review article of website cybersecurity state of the art (in English and Spanish): Urcuqui, C., Peña, M. G., Quintero, J. L. O., &amp; Cadavid, A. N. (2017). Antidefacement. Sistemas &amp; Telemática, 14(39), 9-27 If you have any question or feedback, please contact me: ccurcuqui@icesi.edu.co Thank you for your comments, it is so important to get your feedback for our future work - deardle GitHub https://github.com/urcuqui/WhiteHat/tree/master/Research/Web%20security [1]: https://github.com/urcuqui/WhiteHat/blob/master/Research/Web%20security/frameworks/framework%20to%20detect%20malicious%20websites.jpg Download the dataset from our project page:https://github.com/wsdream/wsdream-dataset Non-redundant dataset of ncRNA from ovary, pituitary and hypothalamus in fasta format. (TXT 64614 kb) Comprehensive hydrometeorologcial dataset collected at experimental farms in the University of Melbourne's Dookie Campus. This dataset is scrapped from goodreads website. It has the ratings of first 99 users of the website. The Bookfeatures csv contains the features of the books read and rated by these users. Transect dataset. This Excel document lists all transects, their location, how much they were walked, and the number of flocks seen on them. The number of flocks includes incomplete flocks and is divided into Orange-billed Babbler (OBBA) led flocks and all other flocks, which are DISTANCE-adjusted to determine flock density. Fungi spectral dataset in MATLAB format. Please cite:- Costa FSL, Silva PP, Morais CLM, et al. Attenuated total reflection Fourier transform infrared (ATR-FTIR) spectroscopy as a new technology for discrimination between Cryptococcus neoformans and Cryptococcus gattii. Anal Methods 2016; 8: 7107–7115.- Morais CLM, Costa FSL, Lima KMG. Variable selection with a support vector machine for discriminating Cryptococcus fungal species based on ATR FTIR spectroscopy. Anal Methods 2017; 9: 2964–2970. Context Is the movie industry dying? is Netflix the new entertainment king? Those were the first questions that lead me to create a dataset focused on movie revenue and analyze it over the last decades. But, why stop there? There are more factors that intervene in this kind of thing, like actors, genres, user ratings and more. And now, anyone with experience (you) can ask specific questions about the movie industry, and get answers. Content There are 6820 movies in the dataset (220 movies per year, 1986-2016). Each movie has the following attributes: - budget: the budget of a movie. Some movies don't have this, so it appears as 0 - company: the production company - country: country of origin - director: the director - genre: main genre of the movie. - gross: revenue of the movie - name: name of the movie - rating: rating of the movie (R, PG, etc.) - released: release date (YYYY-MM-DD) - runtime: duration of the movie - score: IMDb user rating - votes: number of user votes - star: main actor/actress - writer: writer of the movie - year: year of release Acknowledgements This data was scraped from IMDb. Contribute You can contribute via [GitHub](https://github.com/Juanets/movie-stats). A large scale dataset for complex Question Answering. This is primary and secondary event record file of LSDO dataset. Please see reference link to get access to full dataset with solar images. Talk given during the "Harmonise This! Analyzing Diverse Neuroimaging Datasets" workshop at the 2015 Organization for Human Brain Mapping (OHBM) conference in Hawaii, 14-18 June. NEXUS file containing the 'concatenated mtDNA' dataset alignment, which includes 171 mtDNA subsamples sequenced for the mitochondrial cytochrome b and/or cytochrome oxidase 1 genes. The below information is from the project page: https://nlp.stanford.edu/projects/glove/ Context GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. Content Due to size constraints, only the 25 dimension version is uploaded. Please visit the project page for GloVe of other dimensions. This dataset (https://www.kaggle.com/rtatman/glove-global-vectors-for-word-representation) contains GloVe extracted from Wikipedia 2014 + Gigaword 5. 1. Nearest neighbors The Euclidean distance (or cosine similarity) between two word vectors provides an effective method for measuring the linguistic or semantic similarity of the corresponding words. Sometimes, the nearest neighbors according to this metric reveal rare but relevant words that lie outside an average human's vocabulary. 2. Linear substructures The similarity metrics used for nearest neighbor evaluations produce a single scalar that quantifies the relatedness of two words. This simplicity can be problematic since two given words almost always exhibit more intricate relationships than can be captured by a single number. For example, man may be regarded as similar to woman in that both words describe human beings; on the other hand, the two words are often considered opposites since they highlight a primary axis along which humans differ from one another. In order to capture in a quantitative way the nuance necessary to distinguish man from woman, it is necessary for a model to associate more than a single number to the word pair. A natural and simple candidate for an enlarged set of discriminative numbers is the vector difference between the two word vectors. GloVe is designed in order that such vector differences capture as much as possible the meaning specified by the juxtaposition of two words. Acknowledgements Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Inspiration The dataset specifically includes tokens extracted from Twitter, which unlike tokens from Wikipedia, include many abbreviations that have interesting content. resolution: 512*512*230; unsigned short; a 0.3mm resolution.This dataset was obtained using the dental CBCT imaging system ZCB100 (Shenzhen ZhongKe TianYue Technology Co., Ltd.) at 110 kV and 10 mAs, with a 15.36-cm FOV The file contains a correlation analysis of the skill requirements for software testers. The dataset comes from 400 job advertisements.We use the file to look for correlated skills, in our quest to find if there are preset profiles of the software testers emerging from the demands formulated by employers at hiring. This dataset presents the first global fuel map, containing all the parameters required to be input in the Fuel Characteristic Classification System (FCCS). The dataset was developed from different spatial variables, both based on satellite Earth observation products and fuel databases, and is comprised by a global fuelbed map and a database that includes the parameters of each fuelbed that affect fire behavior and effects. A total of 274 fuelbeds were created and parameterized, and can be input into FCCS to obtain fire potentials, surface fire behavior and carbon biomass for each fuelbed.The global fuel dataset can be used for a varied range of applications, including fire danger assessment, fire behavior estimations, fuel consumption calculations and emissions inventories. Description The dataset here is a sample of the transactions made in a retail store. The store wants to know better the customer purchase behaviour against different products. Specifically, here the problem is a regression problem where we are trying to predict the dependent variable (the amount of purchase) with the help of the information contained in the other variables. Classification problem can also be settled in this dataset since several variables are categorical, and some other approaches could be "Predicting the age of the consumer" or even "Predict the category of goods bought". This dataset is also particularly convenient for clustering and maybe find different clusters of consumers within it. Acknowledgements The dataset comes from a competition hosted by Analytics Vidhya. This failure dataset contains the injected faults, the workload, the effects of failure (both the user-side impact and our own in-depth correctness checks), and the error logs produced by the OpenStack cloud management system.Please refers to the paper "Empirical analysis of software failures in the OpenStack cloud computing platform" (ESEC/FSE \'19). Context The key to success in any organization is attracting and retaining top talent. I’m an HR analyst at my company, and one of my tasks is to determine which factors keep employees at my company and which prompt others to leave. I need to know what factors I can change to prevent the loss of good people. Watson Analytics is going to help. Content I have data about past and current employees in a spreadsheet on my desk top. It has various data points on our employees, but I’m most interested in whether they’re still with my company or whether they’ve gone to work somewhere else. And I want to understand how this relates to workforce attrition. **Education** 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor' **EnvironmentSatisfaction** 1 'Low' 2 'Medium' 3 'High' 4 'Very High' **JobInvolvement** 1 'Low' 2 'Medium' 3 'High' 4 'Very High' **JobSatisfaction** 1 'Low' 2 'Medium' 3 'High' 4 'Very High' **PerformanceRating** 1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding' **RelationshipSatisfaction** 1 'Low' 2 'Medium' 3 'High' 4 'Very High' **WorkLifeBalance** 1 'Bad' 2 'Good' 3 'Better' 4 'Best' Acknowledgements https://www.ibm.com/communities/analytics/watson-analytics-blog/watson-analytics-use-case-for-hr-retaining-valuable-employees/ Inspiration Which factors led to employee attrition? EmBeD,<i> Energy-based anomaly detector in the cloud</i>, is an approach to detect anomalies at runtime based on the free energy of a Restricted Boltzmann Machine (RBM) model. The free energy is a stochastic function that can be used to efficiently score anomalies for detecting outliers. EmBeD analyzes the system behavior from raw metric data, does not require extensive training with seeded faults, and classifies the relation of anomalous behaviors with future failures with very few false positives. The file data.zip contains the dataset used for validating <i>EmBeD</i>. The objective of the BRFSS is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases in the adult population. Factors assessed by the BRFSS include tobacco use, health care coverage, HIV/AIDS knowledge or prevention, physical activity, and fruit and vegetable consumption. Data are collected from a random sample of adults (one per household) through a telephone survey. The Behavioral Risk Factor Surveillance System (BRFSS) is the nation's premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world. Content - Each year contains a few hundred columns. Please see one of the [annual code books][1] for complete details. - These CSV files were converted from a SAS data format using pandas; there may be some data artifacts as a result. - If you like this dataset, you might also like the data for 2001-2010. Acknowledgements This dataset was released by the CDC. You can find the original dataset and [additional years of data here][2]. [1]: https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf [2]: https://www.cdc.gov/brfss/annual_data/annual_data.htm Context Fashion-MNIST is a dataset of Zalando\'s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. The original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. "If it doesn\'t work on MNIST, it won\'t work at all", they said. "Well, if it does work on MNIST, it may still fail on others." Zalando seeks to replace the original MNIST dataset Content Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255. The training and test data sets have 785 columns. The first column consists of the class labels (see above), and represents the article of clothing. The rest of the columns contain the pixel-values of the associated image. To locate a pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27. The pixel is located on row i and column j of a 28 x 28 matrix. For example, pixel31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ascii-diagram below. Labels Each training and test example is assigned to one of the following labels: 0 T-shirt/top 1 Trouser 2 Pullover 3 Dress 4 Coat 5 Sandal 6 Shirt 7 Sneaker 8 Bag 9 Ankle boot TL;DR Each row is a separate image Column 1 is the class label. Remaining columns are pixel numbers (784 total). Each value is the darkness of the pixel (1 to 255) Acknowledgements Original dataset was downloaded from https://github.com/zalandoresearch/fashion-mnist Dataset was converted to CSV with this script: https://pjreddie.com/projects/mnist-in-csv/ License The MIT License (MIT) Copyright © [2017] Zalando SE, https://tech.zalando.com Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Context Simultaneous tracking of multiple people is still a very challenging computer vision problem. This is especially true for sports activities, for which people often wear similar uniforms, move quickly and erratically, and have close interactions with each other. This dataset is captured with thermal cameras, which enables easier segmentation and ensures privacy of people in public facilities, but at the same time we are left with no distinct appearance information to guide our tracking algorithms. Content This dataset contains four 30-seconds video sequences of eight people playing soccer in an indoor arena (court size 40*20 metres). The video is captured by thermal cameras of type AXIS Q1922 with a resolution of 640*480 pixels and 25 fps. The three images are stitched to one image of 1920*480 pixels. The videos are manually annotated for tracking. Acknowledgements Gade, R. &amp; Moeslund, T.B.: Constrained multi-target tracking for team sports activities. IPSJ Transactions on Computer Vision and Applications (2018) 10: 2. https://doi.org/10.1186/s41074-017-0038-z Supplementary Data 7. Dataset and best tree obtained considering the 9 configurations. The script used to run the analysis in TNT (land_searches.run) is also included. Context The No Show problem is one of the bigest on the health industry, about 30% of the patient fail theirs appointments. Content 61K points, from 2017.01.01 to 2017.04.30 and 19 features to work with Data Dictionary 1. especialidad : what kind of specialist is going to. Ie dematologist, etc. 2. edad: Age 3. sexo: sex, 1: Male, 2: Female 4. reserva_mes_d : discrete value for the month of the appointment, 1: Jan, 2: Feb... 5. reserva_mes_c : continue value for the month of the appointment, the formula is COS(2*reserva_mes_d*Pi/12) 6. reserva_dia_d : day of the week for the appointment, 1: Mon... 7: Sun 7. reserva_dia_c : continous value for the day of the week, the formula is COS(2*reserva_dia_d*Pi/7) 8. reserva_hora_d : discrete value for hour of the appointment 9. reserva_hora_c : continous value for the hour of the appointment, the formula is COS(2*reserva_hora_d*Pi/24) 10. creacion_mes_d : discrete value for the month when the appointment was created 11. creacion_mes_c : continous value for the month when the appointment was created, the formula is COS(2*creacion_mes_d*Pi/12) 12. creacion_dia_d : same as reserva_dia_d, but considering the day when the appointment was created 13. creacion_dia_c : same as reserva_dia_c, but considering the day when the appintment was created 14. creacion_hora_d : hour when the appointment was created 15. creacion_hora_c : continous value for the creacion_hour_d, the formula is COS(2*creacion_hora_d*Pi/24) 16. latencia : number of days between the appointment and the date when it was created 17. canal : channel used for the creation of the apppointment, 1: call center, 2: Personal, 3: Web 18. tipo : type of appointment, 1: medical, 2: procedures 19. show : 0: no show, 1: show Inspiration Can we use it to predict if a patient is going to show up for his appointment? Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements The datasets were downloaded from mortality.org at http://www.mortality.org/cgi-bin/hmd/country.php?cntr=CHE&amp;level=1 Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? Context Well, I am a beginner to data science world and decided to work on Natural Language Processing questions. So decided to use my own dataset by collecting the SMS spams. Content This dataset has 2 columns, one is Label- ham or spam and the other is Message which simply is the full message. Acknowledgements I guess AirDroid, otherwise it was pretty tedious to type out everything The Lahman Baseball Database 2012 Version Release Date: December 31, 2012 ---------- 0.1 Copyright Notice &amp; Limited Use License This database is copyright 1996-2013 by Sean Lahman. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. For details see: http://creativecommons.org/licenses/by-sa/3.0/ For licensing information or further information, contact Sean Lahman at: seanlahman@gmail.com ---------------------------------------------------------------------- 0.2 Contact Information Web site: http://www.baseball1.com E-Mail : seanlahman@gmail.com If you\'re interested in contributing to the maintenance of this database or making suggestions for improvement, please consider joining our mailinglist at: http://groups.yahoo.com/group/baseball-databank/ If you are interested in similar databases for other sports, please vist the Open Source Sports website at http://OpenSourceSports.com ---------------------------------------------------------------------- 1.1 Introduction This database contains pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2012. It includes data from the two current leagues (American and National), the four other "major" leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875. This database was created by Sean Lahman, who pioneered the effort to make baseball statistics freely available to the general public. What started as a one man effort in 1994 has grown tremendously, and now a team of researchers have collected their efforts to make this the largest and most accurate source for baseball statistics available anywhere. (See Acknowledgements below for a list of the key contributors to this project.) None of what we have done would have been possible without the pioneering work of Hy Turkin, S.C. Thompson, David Neft, and Pete Palmer (among others). All baseball fans owe a debt of gratitude to the people who have worked so hard to build the tremendous set of data that we have today. Our thanks also to the many members of the Society for American Baseball Research who have helped us over the years. We strongly urge you to support and join their efforts. Please vist their website (www.sabr.org). This database can never take the place of a good reference book like The Baseball Encyclopedia. But it will enable people do to the kind of queries and analysis that those traditional sources don\'t allow. If you have any problems or find any errors, please let us know. Any feedback is appreciated ---------------------------------------------------------------------- 1.2 What\'s New in 2012 There has been significant cleanup in the master file MLB\'s addition of wildcard games in 2012 adds two new types of records to the post-season files. The abbreviations ALWC and NLWC are used to denote each league\'s wild card game. Added the MLB "Comeback Player of the Year" award to the awards table Florida Marlins changed their name to the Miami Marlins, new team abbr is MIA ---------------------------------------------------------------------- 1.3 Acknowledgements Much of the raw data contained in this database comes from the work of Pete Palmer, the legendary statistician, who has had a hand in most of the baseball encylopedias published since 1974. He is largely responsible for bringing the batting, pitching, and fielding data out of the dark ages and into the computer era. Without him, none of this would be possible. For more on Pete\'s work, please read his own account at: http://sabr.org/cmsfiles/PalmerDatabaseHistory.pdf Two people have been key contributors to the work that followed, first by taking the raw data and creating a relational database, and later by extending the database to make it more accesible to researchers. Sean Lahman launched the Baseball Archive\'s website back before most people had heard of the world wide web. Frustrated by the lack of sports data available, he led the effort to build a baseball database that everyone could use. Baseball researchers everywhere owe him a debt of gratitude. Lahman served as an associate editor for three editions of Total Baseball and contributed to five editions of The ESPN Baseball Encyclopedia. He has also been active in developing databases for other sports. The work of Sean Forman to create and maintain an online encyclopedia at "baseball-reference.com" has been remarkable. Recognized as the premier online reference source, Forman\'s site provides an oustanding interface to the raw data. His efforts to help streamline the database have been extremely helpful. Most importantly, Forman has spearheaded the effort to provide standards that enable several different baseball databases to be used together. He was also instrumental in launching the Baseball Databank, a forum for researchers to gather and share their work. Since 2001, these two Seans have led a group of researchers who volunteered to maintain and update the database. A handful of researchers have made substantial contributions to maintain this database in recent years. Listed alphabetically, they are: Derek Adair, Mike Crain, Kevin Johnson, Rod Nelson, Tom Tango, and Paul Wendt. These folks did much of the heavy lifting, and are largely responsible for the improvements made in the last decade. Others who made important contributions include: Dvd Avins, Clifford Blau, Bill Burgess, Clem Comly, Jeff Burk, Randy Cox, Mitch Dickerman, Paul DuBois, Mike Emeigh, F.X. Flinn, Bill Hickman, Jerry Hoffman, Dan Holmes, Micke Hovmoller, Peter Kreutzer, Danile Levine, Bruce Macleod, Ken Matinale, Michael Mavrogiannis, Cliff Otto, Alberto Perdomo, Dave Quinn, John Rickert, Tom Ruane, Theron Skyles, Hans Van Slootenm, Michael Westbay, and Rob Wood. Many other people have made significant contributions to the database over the years. The contribution of Tom Ruane\'s effort to the overall quality of the underlying data has been tremendous. His work at retrosheet.org integrates the yearly data with the day-by-day data, creating a reference source of startling depth. It is unlikely than any individual has contributed as much to the field of baseball research in the past five years as Ruane has. Sean Holtz helped with a major overhaul and redesign before the 2000 season. Keith Woolner was instrumental in helping turn a huge collection of stats into a relational database in the mid-1990s. Clifford Otto &amp; Ted Nye also helped provide guidance to the early versions. Lee Sinnis, John Northey &amp; Erik Greenwood helped supply key pieces of data. Many others have written in with corrections and suggestions that made each subsequent version even better than what preceded it. The work of the SABR Baseball Records Committee, led by Lyle Spatz has been invaluable. So has the work of Bill Carle and the SABR Biographical Committee. David Vincent, keeper of the Home Run Log and other bits of hard to find info, has always been helpful. The recent addition of colleges to player bios is the result of much research by members of SABR\'s Collegiate Baseball committee. Salary data has been supplied by Doug Pappas, who passed away during the summer of 2004. He was the leading authority on many subjects, most significantly the financial history of Major League Baseball. We are grateful that he allowed us to include some of the data he compiled. His work has been continued by the SABR Business of Baseball committee. Thanks is also due to the staff at the National Baseball Library in Cooperstown who have been so helpful -- Tim Wiles, Jim Gates, Bruce Markusen, and the rest of the staff. A special debt of gratitude is owed to Dave Smith and the folks at Retrosheet. There is no other group working so hard to compile and share baseball data. Their website (www.retrosheet.org) will give you a taste of the wealth of information Dave and the gang have collected. The 2012 database beneifited from the work of Ted Turocy and his Chadwick baseball Bureau. For more details on his tools and services, visit: http://chadwick.sourceforge.net/doc/index.html Thanks to all contributors great and small. What you have created is a wonderful thing. 2.0 Data Tables The design follows these general principles. Each player is assigned a unique number (playerID). All of the information relating to that player is tagged with his playerID. The playerIDs are linked to names and birthdates in the MASTER table. The database is comprised of the following main tables: MASTER - Player names, DOB, and biographical info Batting - batting statistics Pitching - pitching statistics Fielding - fielding statistics It is supplemented by these tables: AllStarFull - All-Star appearances Hall of Fame - Hall of Fame voting data Managers - managerial statistics Teams - yearly stats and standings BattingPost - post-season batting statistics PitchingPost - post-season pitching statistics TeamFranchises - franchise information FieldingOF - outfield position data FieldingPost- post-season fieldinf data ManagersHalf - split season data for managers TeamsHalf - split season data for teams Salaries - player salary data SeriesPost - post-season series information AwardsManagers - awards won by managers AwardsPlayers - awards won by players AwardsShareManagers - award voting for manager awards AwardsSharePlayers - award voting for player awards Appearances Schools SchoolsPlayers Piezoelectric tensor data.Note on citations: If you found this dataset useful and would like to cite it in your work, please be sure to cite its original sources below rather than or in addition to this page.Dataset is described in the following:De Jong M, Chen W, Geerlings H, Asta M, Persson K (2015) A database to enable discovery and design of piezoelectric materials. Scientific Data 2: 150053. https://doi.org/10.1038/sdata.2015.53Data adapted from JSON files available here:De Jong M, Chen W, Geerlings H, Asta M, Persson K (2015) Data from: A database to enable discovery and design of piezoelectric materials. Dryad Digital Repository. https://doi.org/10.5061/dryad.n63m4 Data analysis software and canonical datasets are the driving force behind many fields of empirical sciences. Despite being of paramount importance, those resources are most often not adequately cited. Although some can consider this a “social” problem, its roots are technical: Users of those resources often are simply not aware of the underlying computational libraries and methods they have been using in their research projects. This in-turn fosters inefficient practices that encourage the development of new projects, instead of contributing to existing established ones. Some projects (e.g. FSL) facilitate citation of the utilized methods, but such efforts are not uniform, and the output is rarely in commonly used citation formats (e.g. BibTeX). DueCredit is a simple framework to embed information about publications or other references within the original code or dataset descriptors. References are automatically reported to the user whenever a given functionality or dataset is being used.DueCredit is currently available for Python, but we envision extending support to other frameworks (e.g., Matlab, R). Until DueCredit gets adopted natively by the projects, it provides the functionality to “inject” references for 3rd party modules.For the developer, DueCredit implements a decorator @due.dcite that allows to link a method or class to a set of references that can be specified through a doi or BibTeX entry. The initial release of DueCredit (0.1.0) was implemented during the OHBM 2015 hackathon and uploaded to pypi and is freely available. DueCredit provides a concise API to associate a publication reference with any given module or function. DueCredit comes with a simple demo code, which demonstrates its utility. DueCredit is in its early stages of development, but two days of team development at the OHBM hackathon were sufficient to establish a usable prototype implementation. Since then, the code-base was further improved and multiple beta-releases followed, expanding the coverage of citable resources (e.g., within scipy, sklearn modules via injections and PyMVPA natively). The file contains a correlation analysis of the skill requirements for software testers. The dataset comes from 400 job advertisements. We use the file to look for correlated skills, in our quest to find if there are preset profiles of the software testers emerging from the demands formulated by employers at hiring. Background NLP is a hot topic currently! Team AI really want's to leverage the NLP research and this an attempt for all the NLP researchers to explore exciting insights from bilingual data The Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles” aims mainly at supporting research and development relevant to high-performance multilingual machine translation, information extraction, and other language processing technologies. Unique Features A precise and large-scale corpus containing about 500,000 pairs of manually-translated sentences. Can be exploited for research and development of high-performance multilingual machine translation, information extraction, and so on. The three-step translation process (primary translation -&gt; secondary translation to improve fluency -&gt; final check for technical terms) has been clearly recorded. Enables observation of how translations have been elaborated so it can be applied for uses such as research and development relevant to translation aids and error analysis of human translation. Translated articles concern Kyoto and other topics such as traditional Japanese culture, religion, and history. Can also be utilized for tourist information translation or to create glossaries for travel guides. The Japanese-English Bilingual Kyoto Lexicon is also available. This lexicon was created by extracting the Japanese-English word pairs from this corpus. Sample One Wikipedia article is stored as one XML file in this corpus, and the corpus contains 14,111 files in total. The following is a short quotation from a corpus file titled “Ryoan-ji Temple”. Each tag has different implications. For example: ` Values of richness estimation (underlined) obtained using Chao and Jackknife algorithms on morphotypes (MT) and sequence variants (SV). The richness is estimated for each phylum and for the total sampled area. Actual richness identified with each method is shown for a direct comparison. Species were estimated using both the focal phyla (SV) and whole meiofauna dataset (SV2). “*” indicates phyla which richness estimated using metabarcoding is lower than number of actual morphotypes identified with morphological taxonomy. Context Dummy data to demo matplotlib Content 43 CSV rows of sales (qty and price) of 5 products in 3 regions by 11 reps Acknowledgements https://www.wintellect.com https://www.superdatascience.com/ Inspiration Thanks! Dataset I used for the analyses of DNA barcoding gaps, species identification efficiency, sequence length and GC content Resized and compressed CelebA dataset by http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html Project: Cloudbase: satellite-derived cloud base heights - The Cloudbase project aims to provide datasets of satellite-derived cloud base heights and their uncertainties. Project website: https://home.uni-leipzig.de/~jmuelmen/projects/ and https://home.uni-leipzig.de/~jmuelmen/projects/precerf.html This research was funded by the European Union under ERC Starting Grant QUAERERE, grant agreement 306284, and by the United States National Science Foundation under grant agreements AGS-1013423 and AGS-1048995. Summary: Attenuated backscatter profiles from the CALIOP satellite lidar are used to estimate cloud base heights of lower-troposphere liquid clouds (cloud base height below approximately 3 km). Even when clouds are thick enough to attenuate the lidar beam (optical thickness &gt; 5), the technique provides cloud base heights by treating the cloud base height of nearby thinner clouds as representative of the surrounding cloud field. Using ground-based ceilometer data, uncertainty estimates for the cloud base height product at retrieval resolution are derived as a function of various properties of the CALIOP lidar profiles. Evaluation of the predicted cloud base heights and their predicted uncertainty using a second, statistically independent, ceilometer dataset shows that cloud base heights and uncertainties are biased by less than 10%. CBASE provides two files for each CALIOP VFM input file: one using a 40 km window to detect the cloud field base height, and one using a 100 km window. (The input CALIOP VFM dataset is organized by the daytime/nighttime half of each orbit.) The file name pattern is CBASE_T.nc (identical to the input CALIOP VFM file name with the exception of the product name). Files are organized into subdirectories by half-orbit start date. The purpose of this Brainhack project was to create a simple application, with the least dependencies, for anonymization of DICOM files directly on a workstation. Anonymization of DICOM datasets is a requirement before an imaging study can be uploaded in a web-based database system, such as LORIS. Currently, a simple and efficient interface for the anonymization of such imaging datasets, which works on all operating systems and is very light in terms of dependencies, is not available. Here, we created a DICOM anonymizer that is a simple graphical tool that uses PyDICOM package to anonymize DICOM datasets easily on any operating system, with no dependencies except for the default Python and NumPy packages. DICOM anonymizer is available for all UNIX systems (including Mac OS) and can be easily installed on Windows computers as well (see PyDICOM installation). The GUI (using tkinter) and the processing pipeline were designed in Python. Executing the anonymizer_gui.py script with a python compiler will start the program. Figure 1 illustrates how to use the program to anonymize a DICOM study. The DICOM anonymizer is a simple standalone graphical tool that facilitates anonymization of DICOM datasets on any operating system. These anonymized studies can be uploaded to a web-based database system, such as LORIS, without compromising the patient or participant’s identity. population genetic dataset input file (119 individuals, 2,111 SNPs) Copyright information:Taken from "From co-expression to co-regulation: how many microarray experiments do we need?"Genome Biology 2004;5(7):R48-R48.Published online 28 Jun 2004PMCID:PMC463312.Copyright © 2004 Yeung et al.; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article\'s original URL. We applied different clustering algorithms to cluster the genes in yeast microarray datasets with different sizes to identify co-expressed genes. The level of co-regulation is evaluated using yeast transcription factor databases (SCPD and YPD) and ChIP data. The clustering results are then evaluated by determining the fraction of gene pairs from the same clusters that share at least one known common transcription factor. Mean body sizes, approximated from the natural logarithm of the lower first or second molar area, and the proposed evolutionary relationships between mammalian genera from the middle and late Clarkforkian (Cf2 to Cf3) of the Bighorn and Clarks Fork Basins, Wyoming, USA. For details of the dataset see caption for electronic supplementary material, dataset S1. The dataset includes fMRI raw data related to the paper entitled Lerner, Scherf, Katkov, Hasson and Behrmann (2018). Age-related differences in reliability of cortical activity under naturalistic viewing conditions. Dataset used for the article Factors associated with Intrauterine Growth Restriction in Zimbabwean women: A Secondary Data Analysis Project: WOCE-Argo Global Hydrographic Climatology - The WAGHC is a full-depth one-fourth degree resolution temperature and salinity climatology describing the mean state of the World Ocean between 1996 and 2011, with monthly gridded fields available between the surface and 1778 m depth. World Ocean Database 2013 (Locarnini et al., 2013) provided the majority of the temperature and salinity profiles, whereas data from the Alfred-Wegener-Institute, Bremerhaven, and from several Canadian Institutes helped to improve the data basis for the North Polar region considerably. A rigorous data quality control procedure was applied to the original profiles to exclude erroneous and highly untypical data. The spatial interpolation of the quality-controlled data was performed both on isobaric and isopycnal levels, so that essentially two climatologies are available. The isopycnally-averaged climatology mimics the process of isopycnal mixing in the real ocean and therefore is less prone to the production of artificial water masses. The WAGHC climatology represents the update of the WOCE Global Hydrographic Climatology (Gouretski and Koltermann, 2004). The name of the new climatology was chosen to highlight both the outstanding role of the WOCE hydrographic data for the historical global hydrographic archive and the importance of the more recent data from the Argo floats. Web-link: http://icdc.cen.uni-hamburg.de/1/daten/ocean/waghc/ References: Locarnini, R. A., A. V. Mishonov, J. I. Antonov, T. P. Boyer, H. E. Garcia, O. K. Baranova, M. M. Zweng, C. R. Paver, J. R. Reagan, D. R. Johnson, M. Hamilton, D. Seidov (2013) World Ocean Atlas 2013, Volume 1: Temperature. S. Levitus, Ed., A. Mishonov Technical Ed.; NOAA Atlas NESDIS 73, 40 pp. Gouretski, V., Koltermann, K.(2004) WOCE Global Hydrographic Climatology, Berichte des BSH, 35, 52pp., ISSN: 0946-6010. Funder: The work was conducted as part of the Excellence Initiative CLISAP at the Universität Hamburg, funded through the German Science Foundation (Grant EXC 177/2) Summary: The WOCE/ARGO Global Hydrographic Climatology (WAGHC) is concieved as the update of the previous WOCE Global Hydrographic Climatology (WGHC) (Gouretski and Koltermann, 2004). The following improvements have been made compared to the WGHC: 2) finer spatial resolution (0.25 degrees Lat/Lon compared to 0.5 degrees for WGHC); 3) finer vertical resolution (65 compared to 45 WGHC standard levels); 4) monthly temporal resolution compared to the all-data-mean WGHC parameters; 5) narrower overall time period; 6) calculation of the mean year corresponding to the optimally interpolated temperature and salinity values; 7) depth of the upper mixed layer. Similar to the WGHC the optimal spatial interpolation is performed on the local isopycnal surfaces. This approach diminishes the production of the artificial water masses. In addition to the isopycnally interpolated parameters parameter values interpolated on the isobaric levels are also provided. The monthly gridded vertical profiles extend to the depth of 1898 m, below only annual mean parameter values are available. Additionally, there is a dataset and a map available providing indexes for selected regions of the world ocean. Finally, the comparison with the last update of the NOAA World Ocean Atlas (Locarnini et al, 2013) was done. This dataset contains: gene expression values, physiological parameters, Q_values, Minisatellites length, MHCIIb alleles, parasitological parameters. Reconstructed slices of a 3D tomographic dataset after ring artifacts suppression using the wavelet-FFT-based method. This dataset is supplementary to the article of Scherler et al. (submitted), in which the global distribution of supraglacial debris cover is mapped and analyzed. For mapping supraglacial debris cover, we combined glacier outlines from the Randolph Glacier Inventory (RGI) version 6.0 (RGI consortium, 2017) with remote sensing-based ice and snow identification. Areas that belong to glaciers but that are neither ice nor snow were classified as debris cover. This dataset contains the outlines of the mapped debris-covered glaciers areas, stored in shapefiles (.shp). For creating this dataset, we used optical satellite data from Landsat 8 (for the time period 2013-2017), and from Sentinel-2A/B (2015-2017). For the ice and snow identification, we used three different algorithms: a red to short-wavelength infrared (swir) band ratio (RATIO; Hall et al., 1988), the normalized difference snow index (NDSI; Dozier, 1989), and linear spectral unmixing-derived fractional debris cover (FDC; e.g., Keshava and Mustard, 2002). For a detailed description of the debris-cover mapping and an analysis of the data, please see Scherler et al. (2019) to which these data are supplementary material. This dataset includes debris cover outlines based on either Landsat 8 (LS8; 30-m resolution) or Sentinel 2 (S2; 10-m resolution), and the three algorithms RATIO, NDSI, FDC. In total, there exist six different zip-files that each contain 19 shapefiles. The structure of the shapefiles follows that of the RGI version 6.0 (RGI consortium, 2017), with one shapefile for each RGI region. The original RGI shapefiles provide each glacier as one entry (feature) and include a variety of ancillary information, such as area, slope, aspect (RGI Consortium 2017a, Technical Note p. 12ff). Because the debris-cover outlines are based on the RGI v6.0 glacier outlines, all fields of the original shapefiles, which refer to the glacier, are retained, and expanded with four new fields: - DC_Area: Debris-covered area in m². Note that this unit for area is different from the unit used for reporting the glacier area (km²).- DC_BgnDate: Start of the time period from which satellite imagery was used to map debris cover.- DC_EndDate: End of the time period from which satellite imagery was used to map debris cover.- DC_CTSmean: Mean number of observations (CTS = COUNTS) per pixel and glacier. This number is derived from the number of available satellite images for the respective time period, reduced by filtering pixels due to cloud and snow cover. The dataset has a global extent and covers all of the glaciers in the RGI v. 6.0, but it exhibits poor coverage in the RGI region Subantarctic and Antarctic, where the debris cover extents are based on very few observations. Datasets obtained from the simulator of gametic phase disequilibrium between two loci + R script for producing the figures presented in the paper This presentation describes our current research to a layman's audience. It describes the National Health and Nutrition Examination Survey (NHANES) and our use of this publicly available dataset for the automatic discovery of associations between study abstracts and variables in the NHANES. This approach can be generalized to other scientific domains to gain insight in published literature. Inspiration What did we all upload to kaggle actually? And how did the community responded? We can find it out via looking at this dataset of the datasets. Content This dataset is in a csv format, where each column is the features and attributes of a dataset on Kaggle (e.g. tags, filetype, no. of Kernels, etc.) and each row is a dataset on Kaggle Acknowledgements Thanks kaggle for the super easy api endpoint design! Context The vast majority of food and food ingredients eaten today is processed in some way before they arrived at the kitchen or dinner table. Food processing equipment may leave trace amounts of various industrial chemical compounds in the foods we eat, and these chemicals, classed **indirect food additives**, are regulated by the United States Food and Drug Administration. This dataset is a list of indirect food additives approved by the FDA. Content This dataset contains the names of chemical compounds and references to the federal government regulatory code approving and controlling their usage. Acknowledgements This dataset is published by the FDA and available [online](https://www.accessdata.fda.gov/scripts/fdcc/?set=IndirectAdditives) as a for-Excel `CSV` file. A few errant header columns have been cleaned up prior to upload to Kaggle, but otherwise the dataset is published as-is. Inspiration * What tokens most commonly appear amongst the names contained in this list? * Any identifiable elements or compounds? Context Venue names and geo-coordinates of venues in New York City Content Venue names, latitude and longitude of venues in New York City Acknowledgements The venue names in New York City are fetched from : https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/locrec/gowalla-dataset.zip The project is being developed in the context of the SInteliGIS project financed by the Portuguese Foundation for Science and Technology (FCT) through project grant PTDC/EIA-EIA/109840/2009. Raw data for the HR 114 pollen surface sample dataset obtained from the Neotoma Paleoecological Database. Tillage is a central element in agricultural soil management and has direct and indirect effects on processes in the biosphere. Effects of agricultural soil management can be assessed by soil, crop, and ecosystem models but global assessments are hampered by lack of information on type and spatial distribution. This dataset is the result of a study on global classification of tillage practices and the spatially explicit mapping of crop-specific tillage systems for around the year 2005. This global gridded tillage system data set is dedicated to modeling communities interested in the quantitative assessment of biophysical and biogeochemical impacts of land use and soil management on cropland. The data set is complemented by the publication of the R- code and can be used for reproducing and build upon for scenarios including the expansion of sustainable soil management practices as Conservation Agriculture (Porwollik et al. 2018, http://doi.org/10.5880/PIK.2018.013). Both, the data set and the R-code are described in detail in Porwollik et al. (2018, ESSD). We present the mapping result of six tillage systems for 42 crop types and potential suitable Conservation Agriculture area as the following variables: We present the mapping result of six tillage systems for 42 crop types and potentially suitable Conservation Agriculture area as variables:1 = conventional annual tillage2 = traditional annual tillage3 = reduced tillage4 = Conservation Agriculture5 = rotational tillage6 = traditional rotational tillage7 = potential suitable Conservation Agriculture area Reference system: WGS84Geographic extent: Longitude (min, max) (-180, 180), Latitude (min, max) (-56, 84)Resolution: 5 arc-minutesTime period covered: around the year 2005Type: NetCDF Dataset sources (with indication of reference): 1. Grid cell allocation key to country: IFPRI/IIASA (2017, cell5m_allockey_xy.dbf.zip)2. Crop-specific physical cropland: IFPRI/IIASA (2017, spam2005v3r1_global_phys_area.geotiff.zip)3. SoilGrids depth to bedrock: Hengl et al. (2014)4. Aridity index: FAO (2015)5. Conservation Agriculture area: FAO (2016)6. Income level: World Bank (2017)7. Field size: Fritz et al. (2015)8. Water erosion: Nachtergaele et al. (2011) The dataset contains Number of Air passengers of each month from the year 1949 to 1960. We can use this data to forecast the future values and help the business. Movie Data Set This is a movie data set consisting of 3886 films scraped from [Hydra Movies full collection of movies.][1] Although there is a data dump available via their API - the data they release does not include cast, writers, directors or the short summary. Content For each of the 3886 movies you will find the following data: - Movie Title - Release Year - Summary Long - Summary Short - IMDB ID - Runtime - YouTube Trailer Code - IMDB Rating - Movie Poster (URL path) - Directors - Writers - Cast Inspiration This is a more complete data set than the public data dump via the [Hydra Movies API.][2] Hopefully you will find it more useful. [1]: https://hydramovies.com/ [2]: https://hydramovies.com/api/ Fruits 360 dataset: A dataset of images containing fruits Version: 2018.09.07.0 Content The following fruits are included: Apples (different varieties: Golden, Golden-Red, Granny Smith, Red, Red Delicious), Apricot, Avocado, Avocado ripe, Banana (Yellow, Red), Cactus fruit, Cantaloupe (2 varieties), Carambula, Cherry (different varieties, Rainier), Cherry Wax (Yellow, Red, Black), Clementine, Cocos, Dates, Granadilla, Grape (Pink, White, White2), Grapefruit (Pink, White), Guava, Huckleberry, Kiwi, Kaki, Kumsquats, Lemon (normal, Meyer), Lime, Lychee, Mandarine, Mango, Maracuja, Melon Piel de Sapo, Mulberry, Nectarine, Orange, Papaya, Passion fruit, Peach, Pepino, Pear (different varieties, Abate, Monster, Williams), Physalis (normal, with Husk), Pineapple (normal, Mini), Pitahaya Red, Plum, Pomegranate, Quince, Rambutan, Raspberry, Salak, Strawberry (normal, Wedge), Tamarillo, Tangelo, Tomato (different varieties, Maroon, Cherry Red), Walnut. Dataset properties Total number of images: 55244. Training set size: 41322 images (one fruit per image). Test set size: 13877 images (one fruit per image). Multi-fruits set size: 45 images (more than one fruit (or fruit class) per image) Number of classes: 81 (fruits). Image size: 100x100 pixels. Filename format: image_index_100.jpg (e.g. 32_100.jpg) or r_image_index_100.jpg (e.g. r_32_100.jpg) or r2_image_index_100.jpg. "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels). Different varieties of the same fruit (apple for instance) are stored as belonging to different classes. How we made it Fruits were planted in the shaft of a low speed motor (3 rpm) and a short movie of 20 seconds was recorded. A Logitech C920 camera was used for filming the fruits. This is one of the best webcams available. Behind the fruits we placed a white sheet of paper as background. However due to the variations in the lighting conditions, the background was not uniform and we wrote a dedicated algorithm which extract the fruit from the background. This algorithm is of flood fill type: we start from each edge of the image and we mark all pixels there, then we mark all pixels found in the neighborhood of the already marked pixels for which the distance between colors is less than a prescribed value. We repeat the previous step until no more pixels can be marked. All marked pixels are considered as being background (which is then filled with white) and the rest of pixels are considered as belonging to the object. The maximum value for the distance between 2 neighbor pixels is a parameter of the algorithm and is set (by trial and error) for each movie. Published research papers Horea Muresan, [Mihai Oltean](https://mihaioltean.github.io), [Fruit recognition from images using deep learning](https://www.researchgate.net/publication/321475443_Fruit_recognition_from_images_using_deep_learning), Acta Univ. Sapientiae, Informatica Vol. 10, Issue 1, pp. 26-42, 2018. The paper introduces the dataset and an implementation of a Neural Network trained to recognized the fruits in the dataset. Alternate download This dataset is also available for download from GitHub: [Fruits-360 dataset](https://github.com/Horea94/Fruit-Images-Dataset) History Fruits were filmed at the dates given below (YYYY.MM.DD): 2017.02.25 - Apple (golden). 2017.02.28 - Apple (red-yellow, red, golden2), Kiwi, Pear, Grapefruit, Lemon, Orange, Strawberry, Banana. 2017.03.05 - Apple (golden3, Braeburn, Granny Smith, red2). 2017.03.07 - Apple (red3). 2017.05.10 - Plum, Peach, Peach flat, Apricot, Nectarine, Pomegranate. 2017.05.27 - Avocado, Papaya, Grape, Cherrie. 2017.12.25 - Carambula, Cactus fruit, Granadilla, Kaki, Kumsquats, Passion fruit, Avocado ripe, Quince. 2017.12.28 - Clementine, Cocos, Mango, Lime, Lychee. 2017.12.31 - Apple Red Delicious, Pear Monster, Grape White. 2018.01.14 - Ananas, Grapefruit Pink, Mandarine, Pineapple, Tangelo. 2018.01.19 - Huckleberry, Raspberry. 2018.01.26 - Dates, Maracuja, Salak, Tamarillo. 2018.02.05 - Guava, Grape White 2, Lemon Meyer 2018.02.07 - Banana Red, Pepino, Pitahaya Red. 2018.02.08 - Pear Abate, Pear Williams. 2018.05.22 - Lemon rotated, Pomegranate rotated. 2018.05.24 - Cherry Rainier, Cherry 2, Strawberry Wedge. 2018.05.26 - Cantaloupe (2 varieties). 2018.05.31 - Melon Piel de Sapo. 2018.06.05 - Pineapple Mini, Physalis, Physalis with Husk, Rambutan. 2018.06.08 - Mulberry. 2018.06.16 - Walnut, Tomato Cherry Red. 2018.06.17 - Cherry Wax (Yellow, Red, Black). 2018.08.19 - Tomato Maroon, Tomato 1-4. License MIT License Copyright (c) 2017-2018 [Mihai Oltean](https://mihaioltean.github.io), Horea Muresan Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. This dataset contain the stimuli, response data, spss and Excel data files belonging to the paper entitled "Visualizing uncertainty in individual growth predictions in population charts"which has been submitted for publication in PeerJ. This archive contains data files for the sediment properties (distribution and grain size) and bedrock geology for the areas covered by North American ice sheets (including Greenland and Iceland). These datasets are distributed as shapefiles and NetCDF files. These files are intended for use in ice sheet models. The datasets were obtained by Tilman et al. (2001: Tilman, D., Knops, J., Wedin, D., Reich, P., Ritchie, M. & Siemann, E. Diversity and productivity in a long-term grassland experiment. Science, 294, 843-846) and Langenheder et al. (2010: Langenheder, S., Bulling, M. T., Solan, M. & Prosser, J. I. Bacterial Biodiversity-Ecosystem functioning Relations Are Modified by Environmental Complexity. PLoS ONE, 5, e10834. doi:10.1371/journal.pone.0010834). The R-files were written by Camille Richon and Benoît Jaillard. Details of the road network in the Scottish Borders area of the UK, formatted using Resource Description Framework (RDF) stored in a Jena TDB dataset (see http://jena.apache.org/documentation/tdb/). Data format is based on the OpenStreetMap (http://www.openstreetmap.org/map=5/51.500/-0.100) representation of ways as a series of nodes. Context There is a numpy array of Indian pine open source dataset with its ground truth numpy array. The data is small in size (145x145x220) , and is good introduction to Hyperspectral Remote Sensing. **Connect/Follow me on [LinkedIn](http://link.rajanand.org/linkedin) for more updates on interesting dataset like this. Thanks.** Context This data set contains yearly suicide detail of all the states/u.t of India by various parameters from 2001 to 2012. Content Time Period: 2001 - 2012 Granularity: Yearly Location: States and U.T's of India Parameters: a) Suicide causes b) Education status c) By means adopted d) Professional profile e) Social status Acknowledgements National Crime Records Bureau (NCRB), Govt of India has shared this [dataset](https://data.gov.in/dataset-group-name/accidental-deaths-and-suicides) under [Govt. Open Data License - India](https://data.gov.in/government-open-data-license-india). NCRB has also shared the historical data on their [website](http://ncrb.nic.in/StatPublications/ADSI/PrevPublications.htm) Context This is a dataset put together to allow data scientists to put their skills to the test against the efficiency of the horse racing betting market. It will be a great challenge for all data scientists to find out whether they are able to create a model that outperforms the market prices. Content The data includes results for all races, starting prices and 101 explanatory variables for each runner. Inspiration Is it possible to beat the horse racing betting market? Where is the market more efficient or less efficient? Is it possible to use the market price in conjunction with other variables to come up with a more accurate prediction? Context The Museum of Modern Art (MoMA) acquired its first artworks in 1929, the year it was established. Today, the Museum’s evolving collection contains almost 200,000 works from around the world spanning the last 150 years. The collection includes an ever-expanding range of visual expression, including painting, sculpture, printmaking, drawing, photography, architecture, design, film, and media and performance art. Content MoMA is committed to helping everyone understand, enjoy, and use our collection. The Museum’s website features 72,706 artworks from 20,956 artists. The artworks dataset contains 130,262 records, representing all of the works that have been accessioned into MoMA’s collection and cataloged in our database. It includes basic metadata for each work, including title, artist, date, medium, dimensions, and date acquired by the Museum. Some of these records have incomplete information and are noted as “not curator approved.” The artists dataset contains 15,091 records, representing all the artists who have work in MoMA's collection and have been cataloged in our database. It includes basic metadata for each artist, including name, nationality, gender, birth year, and death year. Inspiration Which artist has the most works in the museum collection or on display? What is the largest work of art in the collection? How many pieces in the collection were made during your birth year? What gift or donation is responsible for the most artwork in the collection? Snow depths and bulk densities of the annual snow layer were measured at 69 different locations on glaciers across Nordenskiöldland, Svalbard, during the spring seasons of the period 2014–2016. Sampling locations lie along nine transects extending over 17 individual glaciers. Several of the locations were visited repeatedly, leading to a total of 109 point measurements, on which we report in this study. Snow water equivalents were calculated for each point measurement. In the dataset, snow depth and density measurements are accompanied by appropriate uncertainties which are rigorously transferred to the calculated snow water equivalents using a straightforward Monte Carlo simulation-style procedure. The final dataset can be downloaded from the Pangaea data repository (https://www.pangaea.de; https://doi.org/10.1594/PANGAEA.896581). Snow cover data indicate a general and statistically significant increase of snow depths and water equivalents with terrain elevation. A significant increase of both quantities with decreasing distance towards the east coast of Nordenskiöldland is also evident, but shows distinct interannual variability. Snow density does not show any characteristic spatial pattern. Fasta files containing the concatenated multigene aignments of the datasets listed in Supplemental Table S4. Content of Excel files:--BASE calculation.xlsx:This is the core dataset for "Assessing the Efficiency of Land Use Changes for Mitigating Climate Change" using a 4% discount rate for all calculations.--Sensitivity variant DISC 2%.xlsxSensitivity variant DISC 6%.xlsx:Sensitivity calculations using 2% and 6% discount rates.--Sensitivity variant GAIN.xlsxCalculations using the carbon gain method.--Sensitivity variant HIGH.xlsxSensitivity variant LOW.xlsx:Calculations using +/- 20% variations for native vegetation estimates and +/- 30% for soil carbon estimates under native vegetation. The HIGH variant uses 20% and 30% higher values for vegetation and soil carbon stocks in native vegetation, respectively, than the central scenario. The LOW variant uses 20% and 30% lower values for vegetation and soil carbon stocks in native vegetation, respectively,than the central scenario.---------------Description of raster files:--lpjml_anpp_avg_2001-2010.asc:Annual net primary productivity of potential native vegetation under current climate simulated with the LPJmL model.--lpjml_vegc_avg_cor_2001-2010.asc:Above- and below-ground carbon stocks of potential natural vegetation under current climate simulated with the LPJmL model and adjusted at the biome level according to reference values from the literature (see Supplementary Information).--lpjml_soilc_1m_avg_cor_2001-2010.asc:Soil carbon stocks of potential natural vegetation under current climate simulated with LPJmL and adjusted at the biome level according to reference values from the literature (see Supplementary Information). The dataset contains the unigenes from the longest contigs per transcripts generated by Trinity. The fb.flower bud.Unigene.fa file contains unigenes from flower of P. equestris, the L5.root.Unigene.fa file are unigenes from root of P. equestris, the L6.stem.Unigene.fa file contains unigenes from stem of P. equestris, the PHA.leaf. Unigene.fa file contains unigenes from leaf of P. equestris. 12_day.unigene.fasta, 7_day.unigene.fasta and 4_day.unigene.fasta files are unigenes from seeds respectively taken from sowing on 1/2 MS medium for 12 days, 7 days and 4 days. sepal.unigene.fasta, petal.unigene.fasta, lip.unigene.fasta and column.unigene.fasta files are unigenes from sepal, petal, lip and column. This dataset includes journal specific information aggregated from the mydata dataset including the journal specific delta values for all 114 journals in consideration. Illustration of the concept that transcription expression profiles (non-normalized) of regulator YML027W (YOX1, red line) and regulator YMR016C (SOK2, blue line) are dynamically combined. This demonstrates a significant match between the combinatorial expression profile and the expression of the target gene YOR039W (CKB2) in the studied dataset. The conversion efficiency, which indicates the ratio between the number of functional activated binding regulators and the number of available transcription factor transcripts, is presented as a percentage (10% and 70% here).Copyright information:Taken from "Dynamic cumulative activity of transcription factors as a mechanism of quantitative gene regulation"http://genomebiology.com/2007/8/9/R181Genome Biology 2007;8(9):R181-R181.Published online 4 Sep 2007PMCID:PMC2375019. endometrial cancer dataset from TCGA Supplementary Dataset 1. Sequence alignment used for phylogenetic analysis of C3 complement sequences. (FASTA 47 kb) Context This corpus contains 5001 female names and 2943 male names, sorted alphabetically, one per line created by Mark Kantrowitz and redistributed in NLTK. The `names.zip` file includes - README: The readme file. - female.txt: A line-delimited list of words. - male.txt: A line-delimited list of words. License/Usage Names Corpus, Version 1.3 (1994-03-29) Copyright (C) 1991 Mark Kantrowitz Additions by Bill Ross This corpus contains 5001 female names and 2943 male names, sorted alphabetically, one per line. You may use the lists of names for any purpose, so long as credit is given in any published work. You may also redistribute the list if you provide the recipients with a copy of this README file. The lists are not in the public domain (I retain the copyright on the lists) but are freely redistributable. If you have any additions to the lists of names, I would appreciate receiving them. Mark Kantrowitz This dataset contains key characteristics about the data described in the Data Descriptor CommonMind Consortium provides transcriptomic and epigenomic data for Schizophrenia and Bipolar Disorder. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format Copyright information:Taken from "Osprey: a network visualization system"Genome Biology 2003;4(3):R22-R22.Published online 27 Feb 2003http://www.ncbi.nlm.nih.gov/pmc/articles/PMC153462.Copyright © 2003 Breitkreutz et al.; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article\'s original URL. Network containing 2,245 vertices and 6,426 edges from combined datasets of Gavin [10], shown in red, and Ho [11], shown in yellow. A source filter reveals only those interactions shared by both datasets, namely 212 vertices and 188 edges. Comprehensive hydrometeorologcial dataset collected at experimental farms in the University of Melbourne's Dookie Campus. Dataset contains summary results data from the 2013 meta-analysis of Genome-wide Association data in Alzheimer's disease produced by the International Genomics of Alzheimer's Project (IGAP).Data set corresponds to the meta-analysis results of the 11,632 SNPs that were genotyped and tested for association in an independent set of 8,572 Alzheimer's disease cases and 11,312 controls with the combined stage1/stage2 P-values. Details are described in the publication https://www.ncbi.nlm.nih.gov/pubmed/?term=24162737. List of the genes composing the merged matrix of all transcriptomic data from datasets and PCD samples (n=9939). Code Imports import numpy as np import pandas as pd from sklearn.ensemble import RandomForestClassifier Data import train = pd.read_csv("titanic_data/train.csv", dtype={"Age": np.float64}, ) test = pd.read_csv("titanic_data/test.csv", dtype={"Age": np.float64}, ) Convert the male and female groups to integer form trainSex[trainSex == "male = 0 trainSex[trainSex == "female = 1 testSex[testSex == "male = 0 testSex[testSex == "female = 1 Impute the Embarked and Age variable trainAge = trainAge.fillna(trainAge.median()) trainFare = trainFare.fillna(trainFare.median()) testAge = testAge.fillna(testAge.median()) testFare = testFare.fillna(testFare.median()) We want the Pclass, Age, Sex, Fare, SibSp and Parch variables features_forest = train[Pclass", "Age", "Sex", "Fare", "SibSp", "Parch].values target = trainSurvived.values print(features_forest[0]) Building and fitting my_forest forest = RandomForestClassifier(max_depth = 25, min_samples_split=10, min_samples_leaf=10, n_estimators = 1000, random_state = 1) my_forest = forest.fit(features_forest, target) Print the score of the fitted random forest print(my_forest.score(features_forest, target)) Compute predictions on our test set features then print the length of the prediction vector test_features = test[Pclass", "Age", "Sex", "Fare", "SibSp", "Parch].values pred_forest = my_forest.predict(test_features) PassengerId = np.array(testPassengerId).astype(int) my_solution = pd.DataFrame( {\'PassengerId\': PassengerId, \'Survived\': pred_forest} ) pd.set_option(\'display.max_rows\', 500) my_solution.to_csv("titanic_random-forest.csv", index=False) print(len(pred_forest)) Context Violent Crime Rates by US State Content This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. Acknowledgements World Almanac and Book of facts 1975. (Crime rates). Statistical Abstracts of the United States 1975. (Urban rates). <h3>References</h3> McNeil, D. R. (1977) <em>Interactive Data Analysis</em>. New York: Wiley. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? This dataset is a subset of the Yelp Challenge, it contains all the reviews in the year of 2013 GDSC: Genomics of Drug Sensitivity in CancerCCLE:Cancer Cell Line EncyclopediagCSI: genentech Cell Screening InitiativeGRAY: A pharmacogenomic dataset of 70 breast cancer cell line from Joe Gray labUHN: A pharmacogenomic dataset of 84 breast cancer cell line from University Health Network Dataset of material properties used to predict dielectric constants. Note on citations: If you found this dataset useful and would like to cite it in your work, please be sure to cite its original sources below rather than or in addition to this page.Dataset described in the following publication:Petousis I, Mrdjenovich D, Ballouz E, Liu M, Winston D, Chen W, Graf T, Schladt TD, Persson KA, Prinz FB (2017) High-throughput screening of inorganic compounds for the discovery of novel dielectric and optical materials. Scientific Data 4: 160134. https://doi.org/10.1038/sdata.2016.134 Dataset was adapted by Hacking Materials group from json files originally sourced from Dryad (see references 3-4 below).Petousis I, Mrdjenovich D, Ballouz E, Liu M, Chen W, Graf T, Schladt TD, Persson KA, Prinz FB (2017) Data from: High-throughput screening of inorganic compounds for dielectric and optical properties to enable the discovery of novel materials. Dryad Digital Repository. https://doi.org/10.5061/dryad.ph81h The data relates to the PhD thesis submitted to Cardiff University in candidature for the degree ofDoctor of Philosophy by Muditha Abeysekera. The thesis presents research undertaken to develop a model for the combined steady state simulation and operation planning of integrated energy supply systems. As part of the thesis, three key components of the model were developed i.e: 1) Optimal power dispatch of an integrated energy system: A real case study was used to demonstrate the economic benefits of considering the interactions between different energy systems in their design and operation planning. This work is presented in Chapter 3 of the PhD thesis. The data related to this work is available in the XLS file titled 'Chapter 3_Optimal power dispatch_Dataset’. These data provide, for the period 1/4/2014 - 31/3/2015, half-hourly figures for electricity demand (kW) and from two centres, the heat demand (kW), and optimal gas input to gas boiler (kW), optimal gas input to CHP unit (kW), optimal electricity input (kW), optimal electric chiller electricity input (kW), optimal heat input to absorption chiller (kW), marginal cost of electricity (£), marginal cost of heat (£), marginal cost of cooling (£) in terms of electricity demand (kW) and heat demand (kW) in 500kW bins. 2) Simultaneous steady state analysis of coupled energy networks: An example of a coupled electricity, gas, district heating and district cooling network system was used to illustrate the formulation of equations and the iterative solution method. A case study was carried out to demonstrate the application of the method for integrated energy network analysis. This work is presented in Chapter 5 of the PhD thesis. The data related to this work is available in the XLS file titled ‘Chapter 5_case study_Data set’. The data provide, for 3 cases: electricity network results - for bus bar, energy demands (MW), power generation (MW) and voltage magnitude and voltage angle, and branch results comprising active power and re-active power to and from bus injection, and real and reactive power losses; gas network information - for gas node, gas demand, CHP gas demand, boiler gas demand, total gas supply and gas pressure, and pipe data for gas flow rate and pressure drop; heat network information - for hear node, fixed heat demand, absorption chiller heat demand, total heat demand, heat supply, mass flow into node, supply-line and return-line temperatures, supply-line and return-line pressures, and pubmping power, and branch information of mass flow rate, supply-line and return-line heat loss, line heat loss, supply-line and return-line temperature loss, line heat loss, supply-line and return-line termperature drop, branch pressure drop and branch specific pressure drop; cooling network information - for cooling node, fixed cooling demand, cooling supply, mass flow into node, supply-line and return-line temperature, supply-line and return-line pressure, pumping power, and for branch, mass flow rate, supply-line heat gain, return-lin heat loss and branch pressure drop. 3) Steady state analysis of gas networks with the distributed injection of alternative gases: A case study was carried out to demonstrate the impact of alternative gas injections on the pressure delivery and gas quality in the network. This work is presented in Chapter 6 of the PhD thesis. The data related to this work is available in the XLS file titled ‘Chapter 5_case study_Data set’. Data provides: node pressure (mbar) and branch flow-rate(m^3/hr); pressure (mbar) for the nodes under hydrogen-enriched natural gas mixture and for upgraded biogas mixture; for the nodes the actual energy demand (kJ/s) and available energy; pressure (mbar) and wobbe index at nodes for two methodes and flow rate in branches, for the two methods, over various periods. Manual stance and veracity judgments over a large set of English social media conversations. Judgments are expertsourced and crowdsourced with extensive quality control as detailed in the referenced paper. Conversations are around a central claim; claims are grouped into news themes, each theme centering on a current event. Social media platforms include Twitter and Reddit. This is the dataset for the SemEval-2019 task, RumourEval. Copyright information:Taken from "A new computational approach to analyze human protein complexes and predict novel protein interactions"http://genomebiology.com/2007/8/12/R256Genome Biology 2007;8(12):R256-R256.Published online 4 Dec 2007PMCID:PMC2246258. The number of complexes with a best value equal to or lower than the corresponding one on the x-axis is plotted for three non-synchronized and stressed HeLa datasets at a fixed FDR: dithiothreitol (DTT); heat shock; tunicamycin. Context When Twitter introduced its thread functionality, a debate emerged: "If you\'re gonna write a f*ck ton of tweets at once, why not write a blog post instead of cluttering my feed?"... "It\'s easier and user-friendlier to share ideas in a single app"... I\'m not getting into that debate. Both blog posts and Twitter threads have their own advantages. But I noticed a phenomenon while reading threads on Twitter: **the engagement—*retweets, likes and replies*—drops with each subsequent tweet!** Now, this has some logical explanations. Like, people don\'t want to retweet or like *every* tweet in a thread, because that\'d be annoying. But this trend kept appearing in every single thread I read. It was bugging me, so I had to gather some data. Content The dataset is divided into **five** parts: - `five_ten.csv`: data of threads 5-10 tweets long - `ten_fifteen.csv`: data of threads 10-15 tweets long - `fifteen_twenty.csv`: data of threads 15-20 tweets long - `twenty_twentyfive.csv`: data of threads 20-25 tweets long - `twentyfive_thirty.csv`: data of threads 25-30 tweets long They all contain the same data: - `id`: Tweet ID (maybe I should remove it to anonymize the data?) - `thread_number`: Thread identifier, used for grouping each thread and its tweets - `timestamp`: Creation date of each tweet - `text`: The content of each tweet - `retweets`: Retweet count for each tweet - `likes`: Like count for each tweet - `replies`: Reply count for each tweet Each "bin" contains around 100 threads... so in total there are ~500 threads. Acknowledgements The threads were manually gathered using [Thread Reader][1] (both the web page and the [bot][2]). Disclaimer The content of the threads/tweets **did not** had any influence in choosing a thread or not. The only parameter was the length of the thread (5-30 tweets tops). The tweets collected date from October 2017 to May 2018. Inspiration Some things I noticed while gathering the data was that political threads have a steadier engagement than, say, art threads. So **context might influence thread engagement**, and it\'d be interesting to do some NLP to figure that out. Also it\'d be cool to find a "formula" for better engagement in Twitter threads, like how long should a thread be? or maybe a probability of engagement based on the success of the initial tweet? Finally, this whole issue reminds me of [the headline problem][3]: most people don\'t go beyond the headline. Maybe Twitter threads suffer from that too. [1]: http://threadreaderapp.com/ [2]: https://twitter.com/threadreaderapp [3]: https://www.washingtonpost.com/news/the-fix/wp/2014/03/19/americans-read-headlines-and-not-much-else/ Context This dataset is an exported version of the [Atlanta Crime Data Report](http://www.atlantapd.org/i-want-to/crime-data-downloads), a dataset on crimes in the city of Atalanta, Georgia published by the city's police department. Content This data is regarding crime data from the City of Atlanta. This area contains weekly crime reports commanders use to best deploy Atlanta officers to combat crime. It also contains a raw crime data dump that is updated weekly. Crime data in this area is counted by incident in the area. Acknowledgements The original source for this dataset is located [on the Atlanta PD website](http://www.atlantapd.org/i-want-to/crime-data-downloads). Inspiration What can you learn about crime in Atlanta from this dataset? How does it compare to crimes committed in other cities with data on Kaggle, like [New York City](https://www.kaggle.com/adamschroeder/crimes-new-york-city)? Context This dataset was downloaded from INEP, a department from the Brazilian Education Ministry. It contains data from the applicants for the 2015 National High School Exam. Content Inside this dataset there are not only the exam results, but the social and economic context of the applicants. Acknowledgements The original dataset is provided by INEP (http://portal.inep.gov.br/microdados). I removed some information from original files to fit the file size into the Kaggle constraints. Inspiration The objective is to explore the dataset to achieve a better understanding of the social and economic context of the applicants in the exams results. Context The DataScienceBowl covered the whole process of diagnosing lung cancer and I am to make the individual steps more clear. After segmenting lungs and identifying suspicious nodes, it is important to classify them as malignant or benign. Content This dataset consists of several thousand examples formatted in multipage TIFF (for use with tools like ImageJ and KNIME) and HDF5 (for Python and R). Acknowledgements The data were preprocessed and extracted partially from the LUNA16 competition (https://luna16.grand-challenge.org/description/) and should be used with the same policy that data has. Inspiration The dataset is more for practice with medical images and CNN's but it would be interesting to see how the best manually created features (HoG, SIFT, ...) perform against various Deep Learning approaches. It would also be quite interesting to try and visualize exactly which parts of an image made the algorithm guess malignant or benign. Context There's a story behind every dataset and here's your opportunity to share yours. NBA Draft history 2012-2017 Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? This is a dataset from https://webhose.io/datasets/ containing company reviews. A portion of it is extracted to get a balanced number of positive and negative reviews as well as to reduce the size of the dataset. Summary of the dataset with information on human subjects involved and indication of relevant file locations Bug triage. The compressed file contains the dataset and the source-code for the paper: Prediction of Pathological Stage in Patients with Prostate Cancer: A Neuro-Fuzzy ModelGeorgina Cosma, Giovanni Acampora, David Brown, Robert C. Rees,Masood Khan, A. Graham Pockley. PLoS ONE, 2016. Context This dataset is a playground for fundamental and technical analysis. It is said that 30% of traffic on stocks is already generated by machines, can trading be fully automated? If not, there is still a lot to learn from historical data. Content Dataset consists of following files: - **prices.csv**: raw, as-is daily prices. Most of data spans from 2010 to the end 2016, for companies new on stock market date range is shorter. There have been approx. 140 stock splits in that time, this set doesn't account for that. - **prices-split-adjusted.csv**: same as prices, but there have been added adjustments for splits. - **securities.csv**: general description of each company with division on sectors - **fundamentals.csv**: metrics extracted from annual SEC 10K fillings (2012-2016), should be enough to derive most of popular fundamental indicators. Acknowledgements Prices were fetched from Yahoo Finance, fundamentals are from Nasdaq Financials, extended by some fields from EDGAR SEC databases. Inspiration Here is couple of things one could try out with this data: - One day ahead prediction: Rolling Linear Regression, ARIMA, Neural Networks, LSTM - Momentum/Mean-Reversion Strategies - Security clustering, portfolio construction/hedging Which company has biggest chance of being bankrupt? Which one is undervalued (how prices behaved afterwards), what is Return on Investment? This dataset is for running the code from this site: https://becominghuman.ai/building-an-image-classifier-using-deep-learning-in-python-totally-from-a-beginners-perspective-be8dbaf22dd8. This is how to show a picture from the training set: display(Image('../input/cat-and-dog/training_set/training_set/dogs/dog.423.jpg')) From the test set: display(Image('../input/cat-and-dog/test_set/test_set/cats/cat.4453.jpg')) See an example of using this dataset. https://www.kaggle.com/tongpython/nattawut-5920421014-cat-vs-dog-dl Dataset of Nobel Prize winners for the articleA novel bibliometric index with a simple geometric interpretationTo be published in PloS OneAuthored byTrevor Fenner, Martyn Harris, Mark Levene and Judit Bar-Ilan A new global ST dataset, the China Merged Surface Temperature (CMST) dataset is developed recently. CMST is created by merging the China-Land Surface Air Temperature (C-LSAT) with the sea surface temperature (SST) data from the Extended Reconstructed Sea Surface Temperature version 5 (ERSSTv5). readme of the data files CMST The CMST(China Merged Surface Temperature) is produced by merging data from the C-LSAT land surface air temperature dataset and the ERSSTv5 sea-surface temperature dataset. ------------------------------------------------------------------------------------------------------------------- All values are stored as temperature anomalies in degrees celsius Missing data are set to the value -999.99 Grids are 5x5deg monthly reference_period = [1961 1990] Time:190001-201812(size:1416) Data Array (36x72) Item (1,1) stores the value for the 5-deg-area centred at 0° and 87.5°S Item (36,72) stores the value for the 5-deg-area centred at 360° and 87.5°N The Twentieth Century Reanalysis Project, produced by the Earth System Research Laboratory Physical Sciences Division from NOAA and the University of Colorado Cooperative Institute for Research in Environmental Sciences using resources from Department of Energy supercomputers, is an effort to produce a global reanalysis dataset spanning a portion of the nineteenth century and the entire twentieth century (1836 - 2015), assimilating only surface observations of synoptic pressure into an 80-member ensemble of estimates of the Earth system. Boundary conditions of pentad sea surface temperature and monthly sea ice concentration and time-varying solar, volcanic, and carbon dioxide radiative forcings are prescribed. Products include 3 and 6-hourly ensemble mean and spread analysis fields and 6-hourly ensemble mean and spread forecast (first guess) fields on a global Gaussian T254 grid. Fields are accessible in yearly time series (1 file per parameter). The Twentieth Century Reanalysis Version 3 uses the NCEP Global Forecast Model that was operational in autumn 2017, with differences as described in (Slivinski et al. 2019). Sea ice boundary conditions are specified from HadISST 2.3 (Slivinski et al. 2019). Sea surface temperature fields prior to 1981 are prescribed from the 8-member ensemble of pentad Simple Ocean Data Assimilation with sparse input (SODAsi.3, Giese et al. 2016) and from the 8-member ensemble of pentad HadISST 2.2 for 1981 to 2015. Observations from ISPD version 4.7 are assimilated using an ensemble Kalman filter. The Twentieth Century Reanalysis Project version 3 used resources of the National Energy Research Scientific Computing Center managed by Lawrence Berkeley National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. Version 3 is a contribution to the international Atmospheric Circulation Reconstructions over the Earth initiative. Support for the Twentieth Century Reanalysis Project is provided by the Physical Sciences Division of the NOAA Earth System Research Laboratory, the U.S. Department of Energy Office of Science (BER), and the NOAA Climate Program Office MAPP program. Raw data for the Figurnoye Lake pollen dataset obtained from the Neotoma Paleoecological Database.Raw data for the Figurnoye Lake pollen dataset obtained from the Neotoma Paleoecological Database. Context The WVS consists of nationally representative surveys conducted in almost 100 countries which contain almost 90 percent of the world’s population, using a common questionnaire. The WVS is the largest non-commercial, cross-national, time series investigation of human beliefs and values ever executed, currently including interviews with almost 400,000 respondents. Content The World Value Survey data grouped by country and wave. Question codes are matched with the mean for the subgroup if numeric, and else the mode. Also, standard deviation of answers in subgroup are given in columns with code name plus suffix '_SD'. Attached Code File links the variables to their original questionnaire content, including the possible reactions. All negative, and thus missing, responses have been indicated as NA. Acknowledgements The entire dataset has been created and is maintained by the World Values Survey organisation. Find the entire dataset at [their official website][1]. Please note the following disclaimer: These data files are available without restrictions, provided a) that they are used for non-profit purposes; and b) correct citations are provided and sent to the World Values Survey Association for each publication of results based in part or entirely on these data files. This citation will be made freely available; and c) the data files themselves are not redistributed. Inspiration [Quote:][2] The WVS seeks to help scientists and policy makers understand changes in the beliefs, values and motivations of people throughout the world. Thousands of political scientists, sociologists, social psychologists, anthropologists and economists have used these data to analyze such topics as economic development, democratization, religion, gender equality, social capital, and subjective well-being. These data have also been widely used by government officials, journalists and students, and groups at the World Bank have analyzed the linkages between cultural factors and economic development. [1]: http://www.worldvaluessurvey.org/WVSDocumentationWVL.jsp [2]: http://www.worldvaluessurvey.org/WVSContents.jsp Seed dispersal distance dataset for the Ecology Letters paper Context ECG data from mit-bih database from physionet Content Raw signals in .csv files and original annotations in .txt. Acknowledgements https://www.physionet.org/physiobank/database/mitdb/ **Context-** This is a dataset containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred. **Content-** This dataset has 2,60,760 rows and 17 columns. - INCIDENT_NUMBER: - OFFENSE_CODE: - OFFENSE_CODE_GROUP: - OFFENSE_DESCRIPTION: - DISTRICT: - REPORTING_AREA: - SHOOTING: - OCCURRED_ON_DATE: - YEAR: - MONTH: - DAY_OF_WEEK: - HOUR: - UCR_PART: - STREET: - LATITUDE: - LONGITUDE: - LOCATION: **Acknowledgements-** I would like to thank the Boston Police Department for making this dataset available to everyone. **Inspiration** 1. How has crime changed over the years? 2. Is it possible to predict where or when a crime will be committed? 3. Which areas of the city have evolved over this time span? 4. In which area most crimes are committed? This dataset is a sub dataset of the Yelp Challenge. This dataset contains the data collected for the evaluation of RASH (Research Articles in Simplified HTML), which is presented in the paper 'Research Articles in Simplified HTML: a Web-first format for HTML-based scholarly articles'.In particular, it includes:1) Four CSV files reporting the surveys filled in by authors and reviewers of RASH papers published in the SAVE-SD 2015 and SAVE-SD 2016 workshops:- survey_authors_2015- survey_authors_2016- survey_reviewers_2015- survey_reviewers_20162) Four CSV files presenting the more frequent entities and vocabularies included in the RASH papers published in SAVE-SD 2015 and SAVE-SD 2016 workshops:- entities_analysis_2015.csv- entities_analysis_2016.csv- vocabularies_analysis_2015.csv- vocabularies_analysis_2016.csv For any question about the data please contact francesco.osborne@open.ac.uk or silvio.peroni@unibo.it crastallization dataset On August 16, 2018 Aretha Franklin died in Detroit, Michigan at the age of 76. Franklin, also known as the Queen of Soul, had an award winning career as a singer, songwriter, actress and pianist while also being described as the voice of the civil rights movement. This item contains two tweet id datasets. The first was collected from the search API during the response to the announcement of her death, which includes tweets from August 8 - August 19 using the query "Aretha Franklin" OR "Queen of Soul". The second dataset was collected over August 24 to September 3, which includes the date of her funeral on August 31. This second dataset was collected from the search API using the query "Aretha Franklin" OR "Queen of Soul" OR ArethaHomegoing OR ArethaFranklinFuneral OR ArethaFranklin which includes hashtags that were trending at the time. The datasets contain 2,832,128 and 1,332,442 tweet identifiers respectively. Datasets for all the figures included in the paper - 'Non-Gaussian distribution of collective operators in quantum spin chains', G. De Chiara et al. New J. Phys. 2018. The human cerebral cortex, whether tracing it through phylogeny or ontogeny, emerges through expansion and progressive differentiation into larger and more diverse areas. While current methodologies address this analytically by characterizing local cortical expansion in the form of surface area, several lines of research have proposed that the cortex in fact expands along trajectories from primordial anchor areas and furthermore, that the distance along the cortical surface is informative regarding cortical differentiation . We sought to investigate the geometric relationships that arise in the cortex based on expansion from such origin points. Towards this aim, we developed a Python package for measuring the geodesic distance along the cortical surface that restricts shortest paths from passing through nodes of non-cortical areas such as the non-cortical portions of the surface mesh described as the “medial wall’.The calculation of geodesic distance along a mesh surface is based in the cumulative distance of the shortest path between two points. The first challenge that arises is the sensitivity of the calculation to the resolution of the mesh: the coarser the mesh, the longer the shortest path may be, as the distance becomes progressively less direct. This problem has been previously addressed and subsequently implemented in the Python package gdist, which calculates the exact geodesic distance along a mesh by subdividing the shortest path until a straight line along the cortex is approximated. The second challenge, for which there was no prefabricated solution, was ensuring that the shortest path only traverses territory within the cortex proper, avoiding shortcuts through non-cortical areas included in the surface mesh — most prominently, the non-cortical portions along the medial wall. Were the shortest paths between two nodes to traverse non-cortical regions, the distance between nodes would be artificially decreased, which would have artifactual impact on the interpretation of results. This concern would be especially relevant to the ‘zones analysis’ described below, where the boundaries between regions would be altered. It was therefore necessary to remove mesh nodes prior to calculating the exact geodesic distance, which requires reconstructing the mesh and assigning the respective new node indices for any seed regions-of-interest.Finally, to facilitate applications to neuroscience research questions, we enabled the loading and visualization of data from commonly used formats such as FreeSurfer and the Human Connectome Project (HCP). A Nipype pipeline for group-level batch processing has also been made available . The pipeline is wrapped in a command-line interface and allows for straightforward distance calculations of entire FreeSurfer-preprocessed datasets. Group-level data are stored as CSV files for each requested mesh resolution, source label and hemisphere, facilitating further statistical analyses.The resultant package, SurfDist, achieves the aforementioned goals of faciliating the calculation of exact geodesic distance on the cortical surface. We present here the distance measures from the central and calcarine sulci labels on the FreeSurfer native surfaces. The distance measure provides a means to parcellate the cortex using the surface geometry. Towards that aim, we also implement a ‘zones analysis’, which constructs a Voronoi diagram, establishing partitions based on the greater proximity to a set of label nodes.The SurfDist package is designed to enable investigation of intrinisic geometric properties of the cerebral cortex based on geodesic distance measures. Towards the aim of enabling applications specific to neuroimaging-based research question, we have designed the package to facilitate analysis and visualization of geodesic distance metrics using standard cortical surface meshes. Final dataset used to generate results of this publication. Note that you must take the medians for each taxon to reproduce the results. NCEP ADP ETA / NAM Upper Air Observation Subsets are composed of a regional synoptic set of upper air reports centered over North America, operationally collected by the National Centers for Environmental Prediction (NCEP). These include radiosondes, pibals and aircraft reports from the Global Telecommunications System (GTS) and satellite data from the National Environmental Satellite Data and Information Service (NESDIS). The reports can include pressure, geopotential height, temperature, dew point depression, wind direction and speed. Data may be available at up to 20 mandatory levels from 1000 millibars to 1 millibar, plus a few significant levels. Report time intervals range from 3 hourly to 12 hourly. These data are the primary input to the EDAS / NAM Data Assimilation System (NDAS starting January 23, 2005). DS351.0 [https://rda.ucar.edu/datasets/ds351.0/] provides global data coverage over the same time period.This data set is no longer updated.If you have a need for North American data that is not met by DS351.0 NCEP ADP Global Upper Air Observational Weather Data, October 1999 - continuing [https://rda.ucar.edu/datasets/ds351.0/], contact the RDA for alternatives. This dataset contains the International Surface Pressure Databank version 3.2.9 (ISPDv3), the world's largest collection of pressure observations. It has been gathered through international cooperation with data recovery facilitated by the ACRE Initiative and the other contributing organizations and assembled under the auspices of the GCOS Working Group on Surface Pressure and the WCRP/GCOS Working Group on Observational Data Sets for Reanalysis by NOAA Earth System Research Laboratory (ESRL), NOAA's National Climatic Data Center (NCDC), and the University of Colorado's Cooperative Institute for Research in Environmental Sciences (CIRES). The ISPDv3 consists of three components: station, marine, and tropical cyclone best track pressure observations. The station component is a blend of many national and international collections. In addition to the pressure observations and metadata, ISPDv3 contains feedback from the 20th Century Reanalysis version 2c (20CRV2c), including quality control information and uncertainty information. Support for the International Surface Pressure Databank is provided by the U.S. Department of Energy, Office of Science Biological and Environmental Research (BER), and by the National Oceanic and Atmospheric Administration Climate Program Office. The International Surface Pressure Databank version 3 and 20th Century Reanalysis version 2c used resources of the National Energy Research Scientific Computing Center which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. This is a sample log of HDFS dataset. Please visit our project page for the full set of system logs: https://github.com/logpai/loghub This dataset is a compilation of observational measurements on comet morphology and magnitude covering just over a thousand apparitions of various comets from (UTC) -466 to 1975. The data set contains combined Dynamic Ocean Topography (DOT) and geostrophic velocity components for the northern Nordic Seas between 1995 and 2012. It was produced in the frame of the DFG project NEG-OCEAN: Variations in ocean currents, sea-ice concentration, and sea surface temperature along the North-East coast of Greenland. The data is provided as Format 4 Classic NetCDF files on an unstructured triangular, Finite Element formulated grid. The data are characterized by daily sampling between 18.5.1995 and 3.4.2012 including data gaps and a consistent spatial resolution up to 1 km. More details can be found in the related User Manual. The dataset is based on Dynamic Ocean Topography (DOT) elevations from a combination of along-track satellite altimetry measurements with simulated differential water heights from the Finite Element Sea-ice Ocean Model Version 1.4 (FESOM, Wekerle et al., 2017, doi:10.1002/2017JC012974). The combination approach is described in detail in the related publication. The altimetry data include observations of the ESA satellites Envisat and ERS-2. The high-frequent altimetry range observations are retracked using the ALES+ algorithm (Passaro et al., 2018, doi:10.1016/j.rse.2018.02.074) and are classified into open-water/sea-ice conditions by applying a classification algorithm (Müller et al., 2017, doi:10.3390/rs9060551). All applied atmospheric and geophysical altimetry corrections are listed in Müller et al., 2019 (doi:10.5194/tc-13-611-2019). Cumulative fitness (times flowered + 1 if the plant survived to the end of the experiment) in crosses within the campions Silene dioica (L.) Clairv. and S. latifolia Poiret, as well as their first- and second-generation hybrids, in a four-year transplant experiment at three sites of each species (data from Favre et al., 2017, New Phytologist 213, 1487-1499). Individuals that died soon after transplantation (transplant shock) or due to mole disturbance or could not be sexed were excluded from the dataset here (see Favre et al. 2017). Column names: Site.ID, identification of individuals across sites; site, transplant site; ID, identification number of individuals within sites; cross, cross type [SD: within S. dioica, SL: within S. latifolia, HD: SD female x SL male, HL: SL female x SD male, F2: second-generation hybrids (pooled)]; habitat, SD: transplanted within SD population, SL: transplanted within SL population; dblock, block within site; family, full-sib family; sex, plant sex; fitness.rev, cumulative fitness (times flowered + 1 if the plant survived to the end of the experiment). Datasets for "The Accounting Network: how financial institutions react to systemic crisis" Female survival. Observe that the territory identity does not match other datasets. The database contains fasta sequences from UniProt and associated metadata for molluscan shell matrix proteins (SMPs). The database only contains SMPs that have been experimentally validated to be present in molluscan shell matrices (based on the publication(s) attached to the UniProtID). Metadata includes information on functional domains present in the sequence, as detected by InterproScan. With the advent of Next Generation Sequencing technologies, it is computationally resource intensive to run sequence similarity algorithms on all published data. Moreover, it is impractical to sort through hundreds of sequence similarity search results when working with non-model organisms, since pre-established functional annotations of sequences are generally not available. Therefore, this database was created in order to provide a targeted molluscan biomineralization dataset for sequence similarity algorithms (such as BLAST). Context This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the dataset you'll find information about businesses across 11 metropolitan areas in four countries. Content This dataset contains seven CSV files. The original JSON files can be found in yelp_academic_dataset.zip. You may find this documentation helpful: [https://www.yelp.com/dataset/documentation/json][1] In total, there are : - 5,200,000 user reviews - Information on 174,000 businesses - The data spans 11 metropolitan areas Acknowledgements The dataset was converted from JSON to CSV format and we thank the team of the Yelp dataset challenge for creating this dataset. By downloading this dataset, you agree to the [Yelp Dataset Terms of Use][2]. Inspiration Natural Language Processing &amp; Sentiment Analysis What's in a review? Is it positive or negative? Yelp's reviews contain a lot of metadata that can be mined and used to infer meaning, business attributes, and sentiment. Graph Mining We recently launched our Local Graph but can you take the graph further? How do user's relationships define their usage patterns? Where are the trend setters eating before it becomes popular? [1]: https://www.yelp.com/dataset/documentation/json [2]: https://s3-media2.fl.yelpcdn.com/assets/srv0/engineering_pages/af4b9cebfb4f/assets/vendor/dataset-challenge-dataset-agreement.pdf Forest plot showing the impact on overall surival of CDR2 expression in different public transcriptomic datasets, after adjusting by debulking surgery (residual tumor &lt;1cm) and FIGO stage. Raw data for the Toothaker Pond pollen surface sample dataset obtained from the Neotoma Paleoecological Database. This dataset provides the data collected for a trial investigating the role of hydration status on glycaemic regulation in healthy adults (n = 16; n = 8 male). To our knowledge, the effect of hydration status on glycemia has never been causally investigated in healthy adults. Therefore, the goal was to explore how acute hypohydration impacts blood sugar control in healthy adults. The trial was a randomised crossover trial, with each trial arm lasting 5 days. The first 3 days were lifestyle monitoring, day 4 was a dehydration/rehydration day (including lifestyle monitoring), and day 5 was the full trial day. The trial arms were hypohydrated (HYPO), or rehydrated (RE). The guideline to use the dataset. LC-MS/MS water concentrations (EE2, DEET, diphenhydramine, fluoxetine and their mixture) dataset for the study: Untargeted Metabolomic Investigation of the Eastern Oyster Context The Global Shark Attack File contains a global log of all reported shark attacks from ~1700s to 2018. Content Each record, or shark attack, includes the location, date, time of attack, shark species, and other information about the incident. Acknowledgements Shark Research Institute International Shark Attack File http://www.sharkattackfile.net/incidentlog.htm Inspiration I am interested in exploring if the number of shark attacks is rising as human populations and temperatures are increasing. This is a car sales data set which is taken from Analytixlabs. This is a continuous data set which include several predictors.From this data set we have to predict car sales by using machine learning Techniques. So lets work on this dataset together and carry out which machine learning technique is best suited for prediction. I am using R language you can use any language for this .So go for it and enjoy. If you have any query related data you can freely post your query Context Eclipses of the sun can only occur when the moon is near one of its two orbital nodes during the new moon phase. It is then possible for the Moon's penumbral, umbral, or antumbral shadows to sweep across Earth's surface thereby producing an eclipse. There are four types of solar eclipses: a partial eclipse, during which the moon's penumbral shadow traverses Earth and umbral and antumbral shadows completely miss Earth; an annular eclipse, during which the moon's antumbral shadow traverses Earth but does not completely cover the sun; a total eclipse, during which the moon's umbral shadow traverses Earth and completely covers the sun; and a hybrid eclipse, during which the moon's umbral and antumbral shadows traverse Earth and annular and total eclipses are visible in different locations. Earth will experience 11898 solar eclipses during the five millennium period -1999 to +3000 (2000 BCE to 3000 CE). Eclipses of the moon can occur when the moon is near one of its two orbital nodes during the full moon phase. It is then possible for the moon to pass through Earth's penumbral or umbral shadows thereby producing an eclipse. There are three types of lunar eclipses: a penumbral eclipse, during which the moon traverses Earth's penumbral shadow but misses its umbral shadow; a partial eclipse, during which the moon traverses Earth's penumbral and umbral shadows; and a total eclipse, during which the moon traverses Earth's penumbral and umbral shadows and passes completely into Earth's umbra. Earth will experience 12064 lunar eclipses during the five millennium period -1999 to +3000 (2000 BCE to 3000 CE). Acknowledgements Lunar eclipse predictions were produced by Fred Espenak from NASA's Goddard Space Flight Center. Images from Imagenet dataset (http://image-net.org/) resized to 32x32 Dataset connected with article: 'Increasing the use of conceptually-derived strategies in arithmetic: using inversion problems to promote the use of associativity shortcuts.' Abstract: Conceptual knowledge of key principles underlying arithmetic is an important precursor to understanding algebra and later success in mathematics. One such principle is associativity, which allows individuals to solve problems in different ways by decomposing and recombining subexpressions (e.g. ‘a + b – c’ = ‘b – c + a’). More than any other principle, children and adults alike have difficulty understanding it, and educators have called for this to change. We report three intervention studies that were conducted in university classrooms to investigate whether adults’ use of associativity could be improved. In all three studies, it was found that those who first solved inversion problems (e.g. ‘a + b – b’) were more likely than controls to then use associativity on ‘a + b – c’ problems. We suggest that ‘a + b – b’ inversion problems may either direct spatial attention to the location of ‘b – c’ on associativity problems, or implicitly communicate the validity and efficiency of a right-to-left strategy. These findings may be helpful for those designing brief activities that aim to aid the understanding of arithmetic principles and algebra. This dataset contains key characteristics about the data described in the Data Descriptor The sequence and de novo assembly of Takifugu bimaculatus genome using PacBio and Hi-C technologies. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format 3. machine readable metadata file in ISA-Tab format (zipped folder) Raw data for the Fisherman Lake pollen surface sample dataset obtained from the Neotoma Paleoecological Database. Context This dataset is a collection newsgroup documents. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. Content There is file (list.csv) that contains a reference to the document_id number and the newsgroup it is associated with. There are also 20 files that contain all of the documents, one document per newsgroup. In this dataset, duplicate messages have been removed and the original messages only contain "From" and "Subject" headers (18828 messages total). Each new message in the bundled file begins with these four headers: Newsgroup: alt.newsgroup Document_id: xxxxxx From: Cat Subject: Meow Meow Meow The Newsgroup and Document_id can be referenced against list.csv Organization - Each newsgroup file in the bundle represents a single newsgroup - Each message in a file is the text of some newsgroup document that was posted to that newsgroup. This is a list of the 20 newsgroups: - comp.graphics - comp.os.ms-windows.misc - comp.sys.ibm.pc.hardware - comp.sys.mac.hardware - comp.windows.x rec.autos - rec.motorcycles - rec.sport.baseball - rec.sport.hockey sci.crypt - sci.electronics - sci.med - sci.space - misc.forsale talk.politics.misc - talk.politics.guns - talk.politics.mideast talk.religion.misc - alt.atheism - soc.religion.christian Acknowledgements Ken Lang is credited by the source for collecting this data. The source of the data files is here: http://qwone.com/~jason/20Newsgroups/ Inspiration - This dataset text can be used to classify text documents I have begun studying how to apply R towards the field of Market Research primarily through the use of this book https://www.amazon.com/Marketing-Research-Analytics-Use/dp/3319144359 The book instructs that I build alot of the models from scratch to get a better appreciation for how R works and this dataset is to serve as a portfolio for all that I have learned thus far. 1. Title: Postoperative Patient Data 2. Source Information: -- Creators: Sharon Summers, School of Nursing, University of Kansas Medical Center, Kansas City, KS 66160 Linda Woolery, School of Nursing, University of Missouri, Columbia, MO 65211 -- Donor: Jerzy W. Grzymala-Busse (jerzy@cs.ukans.edu) (913)864-4488 -- Date: June 1993 3. Past Usage: 1. A. Budihardjo, J. Grzymala-Busse, L. Woolery (1991). Program LERS_LB 2.5 as a tool for knowledge acquisition in nursing, Proceedings of the 4th Int. Conference on Industrial &amp; Engineering Applications of AI &amp; Expert Systems, pp. 735-740. 2. L. Woolery, J. Grzymala-Busse, S. Summers, A. Budihardjo (1991). The use of machine learning program LERS_LB 2.5 in knowledge acquisition for expert system development in nursing. Computers in Nursing 9, pp. 227-234. 4. Relevant Information: The classification task of this database is to determine where patients in a postoperative recovery area should be sent to next. Because hypothermia is a significant concern after surgery (Woolery, L. et. al. 1991), the attributes correspond roughly to body temperature measurements. Results: -- LERS (LEM2): 48% accuracy 5. Number of Instances: 90 6. Number of Attributes: 9 including the decision (class attribute) 7. Attribute Information: 1. L-CORE (patient's internal temperature in C): high (&gt; 37), mid (&gt;= 36 and &lt;= 37), low (&lt; 36) 2. L-SURF (patient's surface temperature in C): high (&gt; 36.5), mid (&gt;= 36.5 and &lt;= 35), low (&lt; 35) 3. L-O2 (oxygen saturation in %): excellent (&gt;= 98), good (&gt;= 90 and &lt; 98), fair (&gt;= 80 and &lt; 90), poor (&lt; 80) 4. L-BP (last measurement of blood pressure): high (&gt; 130/90), mid (&lt;= 130/90 and &gt;= 90/70), low (&lt; 90/70) 5. SURF-STBL (stability of patient's surface temperature): stable, mod-stable, unstable 6. CORE-STBL (stability of patient's core temperature) stable, mod-stable, unstable 7. BP-STBL (stability of patient's blood pressure) stable, mod-stable, unstable 8. COMFORT (patient's perceived comfort at discharge, measured as an integer between 0 and 20) 9. decision ADM-DECS (discharge decision): I (patient sent to Intensive Care Unit), S (patient prepared to go home), A (patient sent to general hospital floor) 8. Missing Attribute Values: Attribute 8 has 3 missing values 9. Class Distribution: I (2) S (24) A (64) SET 1: The dataset is composed of a set of omnidirectional images captured in an indoor environment (Quorum V building, ground floor, ARVC Laboratory) at Miguel Hernández University, Spain. This database is intended to test visual mapping and localization algorithms for mobile robots. The images have been captured using an Imaging Source DFK 21BF04 camera, which takes pictures of a hyperbolic mirror (Eizoh Wide 70). The mirror is mounted over the camera, with its axis aligned with the camera optic axis. The whole database contains 400 images that were captured while the robot went through a previously defined trajectory in a laboratory area. The distance between each pair of consecutive images is equal to 20 cm and the environment where the images were captured is very prone to visual aliasing (the visual appearance of some images captured in different rooms may be very similar. SET 2: Set of omnidirectional images captured in an indoors environment (Quorum V building, 2nd floor) at Miguel Hernandez University. The database includes a corridor, three offices, a library and an events room. It is composed of 872 omnidirectional colour images which have been captured on a dense regular 40x40 cm. grid of points. A bird eye's view of the grid points is included. The database was captured with an Imaging Source DFK 21BF04 camera, which takes pictures of a hyperbolic mirror (Eizoh Wide 70). The mirror is mounted over the camera, with its axis aligned with the camera optic axis. Fasta file containing alignment of sequences used to calculate observed summary statistics. So called ABC dataset in the publication. ddPCR dataset Subset of the "iCubWorld Transformations" dataset (https://robotology.github.io/iCubWorld/) to be used for the Deep Learning hands-on session at the Winter School on Humanoid Robot Programming (http://www.icub.org/winterschool/). Supplemental Data. Walker et al. (2017). Plant Cell 10.1105/tpc.16.00961.Supplemental Dataset 1. RMA-normalised Nimblegen microarray data for all transcripts measured. The table lists the TAIR10 Arabidopsis Genome Initiative (AGI) gene IDs represented on the array (for design see GEO record GPL18735 for Nimblegen probe design) and their expression values in all 6 time series, averaged (“Mean”) for each replicate set; see GEO GSE91379 for complete raw and normalised individual replicate values. Gene symbols and gene descriptions are listed according to the TAIR10 annotation. If significantly differentially expressed within a time series (BATS), the cluster number is listed. If significantly differentially expressed between treated and untreated time series (CN GP2S etc), the cluster number is listed. Cluster numbers described in the manuscript text always refer to the within-CU/PU or N/Rhizobia vs. U clusters (orange columns). Empty cells indicate no evidence of DE within/between time series. Transcripts associated with genes that have previously been found to be affected by the protoplast generation treatment or FACS (as [18]) are marked (Proto-flagged) but not removed from the analysis. R environment file for 1000 simulated datasets under BM, and 1000 simulated datasets under BM with a trend (µ = 3). The GIS database contains the data of aufeis (naleds) in the Indigirka River basin (Russia) from historical and nowadays sources, and complete ArcGIS 10.1/10.2 and Qgis 3* projects to view and analyze the data. All data and projects have WGS 1984 coordinate system (without projection). ArcGIS and Qgis projects contain two layers, such as Aufeis_kadastr (historical aufeis data collection, point objects) and Aufeis_Landsat (satellite-derived aufeis data collection, polygon objects).Historical data collection is created based on the Cadastre of aufeis (naleds) of the North-East of the USSR (1958). Each aufeis was digitized as point feature by the inventory map (scale 1:2 000 000), or by topographic maps. Attributive data was obtained from the Cadastre of aufeis. According to the historical data, there were 896 aufeis with a total area 2063.6 km² within the studied basin.Present-day aufeis dataset was created by Landsat-8 OLI images for the period 2013-2017. Each aufeis was delineated by satellite images as polygon. Cloud-free Landsat images are obtained immediately after snowmelt season (e.g. between May, 15 and June, 18), to detect the highest possible number of aufeis. Critical values of Normalized Difference Snow Index (NDSI) were used for semi-automated aufeis detection. However, a detailed expert-based verification was performed after automated procedure, to distinguish snow-covered areas from aufeis and cross-reference historical and satellite-based data collections. According to Landsat data, the number of aufeis reaches 1213, with their total area about 1287 km². The difference between the Cadastre (1958) and the satellite-derived data may indicate significant changes of aufeis formation environments. Excel spreadsheet with the complete diet and body mass (from Dunning 2008) dataset, including associated metadata, listed by species, season and locality. 310 Observations, 13 Attributes (12 Numeric Predictors, 1 Binary Class Attribute - No Demographics) Lower back pain can be caused by a variety of problems with any parts of the complex, interconnected network of spinal muscles, nerves, bones, discs or tendons in the lumbar spine. Typical sources of low back pain include: - The large nerve roots in the low back that go to the legs may be irritated - The smaller nerves that supply the low back may be irritated - The large paired lower back muscles (erector spinae) may be strained - The bones, ligaments or joints may be damaged - An intervertebral disc may be degenerating An irritation or problem with any of these structures can cause lower back pain and/or pain that radiates or is referred to other parts of the body. Many lower back problems also cause back muscle spasms, which don't sound like much but can cause severe pain and disability. While lower back pain is extremely common, the symptoms and severity of lower back pain vary greatly. A simple lower back muscle strain might be excruciating enough to necessitate an emergency room visit, while a degenerating disc might cause only mild, intermittent discomfort. This data set is about to identify a person is abnormal or normal using collected physical spine details/data. This is a piece of bike sharing system data analytics work using the CRISP-DM data mining process. Context Twitter give the general public unfiltered direct access to the ideas and policies of politicians. This means that understanding the content and reach of these tweets can help us understand what connects with constituents. This dataset is meant to help with that exploration. By applying sentiment analysis (using an already trained system) we can apply sentiment context to these tweets. This will help us understand who responds to positive and negative content. Finally this analysis may help to indentify fake or hyperbole polarized Twitter users. Content The dataset contains two files both in .csv format. The first is a list of the political party and the representative handles, and the second are the 200 latest tweets as of May 2018 from those twitter users. Acknowledgements I would like to thank the following website and people who helped me get started Inspiration I was first inspired by trying to find out if the average person would be able to distinguish between political tweets of no context was given. I made a small website that you can try this on. I will use real user data to cross check and see if ML methods are actually better than the average person. Other ace uses are the following: Can we use this to detect Russian troll twitter accounts? Do people respond to negative or positive political tweets? Context The need for music-speech classification is evident in many audio processing tasks which relate to real-life materials such as archives of field recordings, broadcasts and any other contexts which are likely to involve speech and music, concurrent or alternating. Segregating the signal into speech and music segments is an obvious first step before applying speech-specific or music-specific algorithms. Indeed, speech-music classification has received considerable attention from the research community (for a partial list, see references below) but many of the published algorithms are dataset-specific and are not directly comparable due to non-standardised evaluation. Content Dataset collected for the purposes of music/speech discrimination. The dataset consists of 120 tracks, each 30 seconds long. Each class (music/speech) has 60 examples. The tracks are all 22050Hz Mono 16-bit audio files in .wav format. Domain: Real Estate Difficulty: Easy to Medium Challenges: 1. Missing value treatment 2. Outlier treatment 3. Understanding which variables drive the price of homes in Boston Summary: The Boston housing dataset contains 506 observations and 14 variables. The dataset contains missing values. Context This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. Content The datasets consists of several medical predictor variables and one target variable, `Outcome`. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on. Acknowledgements Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., &amp; Johannes, R.S. (1988). [Using the ADAP learning algorithm to forecast the onset of diabetes mellitus][1]. *In Proceedings of the Symposium on Computer Applications and Medical Care* (pp. 261--265). IEEE Computer Society Press. Inspiration Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not? [1]: http://rexa.info/paper/04587c10a7c92baa01948f71f2513d5928fe8e81 Context Simple Convolutional Neural Networks (CNN’s) model works amazingly well in classifying the MNIST hand written digits or differentiating dogs and cats even with a small dataset of few thousand images. It will be fun project to test how well the same simple techniques works if we were trying to classify famous persons from their cartoon images or caricature Content The training dataset consists of cartoon images of six famous personalities (Abraham Lincoln, Albert Einstein, Barack Obama, Donald Trump, Mahatma Gandhi and Steve Jobs) downloaded from Google Image Search. The training dataset consists of 4942 images and validation dataset consists of 2060 images and test dataset consist of 856 images. The images are in jpg format. Content The dataset contains different noise audio files correspond to different environments. Acknowledgements This is a mirror of the database introduced by the authors of the [DEMAND][1]. [1]: https://hal.inria.fr/hal-00796707/file/thiemann_demand.pdf This is the raw dataset that generated the figures in the paper entitled "Characterization of the hemodynamic response function in white matter tracts for event-related fMRI." A replicate dataset based on the A3130 sample set. The DNA isolates were re-processed in a different lab than the original samples using a different machine (ABI 3500 Genetic Analyzer).The RData object has two attributes: data - holds the count data of OTU abundances. Rows are samples and columns are OTUs.labels - a corresponding label for each sample in data. Supplementary Table 1. Primers used for quantitative real time PCR.Supplementary Table 2. Overlapped DEGs among five microarray datasets. Many maps of open water and wetland have been developed based on three main methods: (i) compiling national/regional wetland surveys; (ii) identifying inundated areas by satellite imagery; (iii) delineating wetlands as shallow water table areas based on groundwater modelling. The resulting global wetland extents, however, vary from 3 to 21% of the land surface area, because of inconsistencies in wetland definitions and limitations in observation or modelling systems. To reconcile these differences, we propose composite wetland (CW) maps combining two classes of wetlands: (1) regularly flooded wetlands (RFW) which are obtained by overlapping selected open-water and inundation datasets; (2) groundwater-driven wetlands (GDW) derived from groundwater modelling (either direct or simplified using several variants of the topographic index). Wetlands are thus statically defined as areas with persistent near saturated soil because of regular flooding or shallow groundwater. To explore the uncertainty of the proposed data fusion, seven CW maps were generated at the 15 arc-sec resolution (ca 500 m at the Equator) using geographic information system (GIS) tools, by combining one RFW and different GDW maps. They correspond to contemporary potential wetlands, i.e. the expected wetlands assuming no human influence under the present climate. To validate the approach, these CW maps were compared to existing wetland datasets at the global and regional scales: the spatial patterns are decently captured, but the wetland extents are difficult to assess against the dispersion of the validation datasets. Compared to the only regional dataset encompassing both GDWs and RFWs, over France, the CW maps perform well and better than all other considered global wetland datasets. Two CW maps, showing the best overall match with the available evaluation datasets, are eventually selected. They give a global wetland extent of 27.5 and 29 million km², i.e. 21.1 and 21.6% of global land area, which is among the highest values in the literature, in line with recent estimates also recognizing the contribution of GDWs. This wetland class covers 15% of global land area, against 9.7% for RFWs (with an overlap ca 3.4 %), including wetlands under canopy/cloud cover leading to high wetland densities in the tropics, and small scattered wetlands, which cover less than 5% of land but are very important for hydrological and ecological functioning in temperate to arid areas. By distinguishing the RFWs and GDWs globally based on uniform principles, the proposed dataset is believed to be useful for large-scale land surface modelling (hydrological, ecological and biogeochemical modelling) and environmental planning. IMDB dataset for 5000 movies The dataset contains a gridded global reconstruction of monthly runoff timeseries. In-situ streamflow observations from the GSIM dataset are used to train a machine learning algorithm that predicts monthly runoff rates based on antecedent precipitation and temperature from the Global Soil Wetness Project Phase 3 (GSWP3) meteorological forcing dataset. We thank Prof. Dr. Hyungjun Kim for developing the GSWP3 dataset and providing us with early access to the data. The data are provided in NetCDFv4 format at monthly resolution covering the period 1902-2014. The GRUN reconstruction ("GRUN_v1_GSWP3_WGS84_05_1902_2014.nc" file) is provided on a 0.5 degrees (WGS84) grid in units of mm/day. The runoff time series correspond to the ensemble mean of 50 reconstructions obtained by training the machine learning model with different subsets of data. The individual ensemble members of the reconstruction are provided in the "Realizations_GRUN_v1_GSWP3_WGS84_05_1902_2014.zip" file.When using this dataset, please cite: Ghiggi, G., Humphrey, V., Seneviratne, S. I., Gudmundsson (2019), GRUN: An observations-based global gridded runoff dataset from 1902 to 2014, Earth Syst. Sci. Data, 2019, DOI: https://doi.org/10.5194/essd-2019-32 The complete collection of in-situ streamflow observations from the GSIM archive can be found at: - https://doi.pangaea.de/10.1594/PANGAEA.887477 - https://doi.pangaea.de/10.1594/PANGAEA.887470 For further information on the GSIM dataset see: - https://doi.org/10.5194/essd-10-765-2018 - https://doi.org/10.5194/essd-10-787-2018 For further information on GSWP3, see: - https://doi.org/10.20783/DIAS.501 - https://hyungjun.github.io/GSWP3.DataDescription - http://hydro.iis.u-tokyo.ac.jp/GSWP3/exp1.html **NOTE**: we\'re having some trouble uploading the actual images of the handwritten names. Stay tuned. This dataset contains links to images of handwritten names along with human contributors’ transcription of these written names. Over 125,000 examples of first or last names. Most names are French, making this dataset of particular interest for work on dealing with accent marks in handwritten character recognition. Acknowledgments Data was provided by the [Data For Everyone Library](https://www.crowdflower.com/data-for-everyone/) on [Crowdflower](https://www.crowdflower.com). Our Data for Everyone library is a collection of our favorite open data jobs that have come through our platform. They\'re available free of charge for the community, forever. The Data A file ```handwritten_names.csv``` that contains the following fields: - **_unit_id**: a unique id for the image - **image_url**: the path to the image; begins with "images/" - **transcription**: the (typed) name - **first_or_last**: whether it\'s a first name or a last name A folder ```images``` that contains each of the image files. Raw Data Files **This data set contains Bitcoin data for years 2009-2011. For years 2011-2018 (~45GB), please see https://github.com/cakcora/CoinWorks/blob/master/data.MD** We provide input and output edges of transactions. This data is divided into yearly and monthly files. Each year\'s data is zipped together and contains 12 input edge files and 12 output edge files of transactions that were mined in the blocks of that year/month. Each line in the input edge file is tab separated with the format: ``` Unix time of transaction\\thash of transaction\\thash of first input transaction\\tindex of output from first input transaction\\thash of second input transaction\\tindex of output from second input transaction\\t(additional inputs, if exist)\\r\\n ``` Each line in the output edge file is tab separated with the format: ``` Unix time of transaction\\thash of transaction\\thash of first output address\\tamount of first output bitcoins\\thash of second output address\\tamount of second output bitcoins\\t(additional outputs, if exist)\\r\\n ``` ![Bitcoin Graph][1] Consider the Bitcoin graph in the figure above, where transactions and addresses are shown with rectangles and circles, respectively. This graph would be given in two files: inputsYear_Month.txt and outputsYear_Month.txt. Files would include these lines: -- inputsYear_Month.txt ``` UnixTimeOft_1 HashOft_1 HashOft_x1 0 HashOft_x2 8 UnixTimeOft_2 HashOft_1 HashOft_x3 1 HashOft_x4 3 HashOft_x5 0 UnixTimeOft_3 HashOft_1 1 UnixTimeOft_4 HashOft_3 2 HashOft_2 0 ``` -- outputsYear_Month.txt ``` UnixTimeOft_1 HashOft_1 HashOfa_6 10^8 HashOfa_7 0.8^0.8 UnixTimeOft_2 HashOft_2 HashOfa_8 3.8*10^8 UnixTimeOft_3 HashOft_3 HashOfa_9 0.2*10^8 HashOfa_10 0.2*10^8 HashOfa_11 0.3*10^8 UnixTimeOft_4 HashOft_4 HashOfa_12 3.7*10^8 HashOfa_13 0.3*10^8 ``` <a href="https://utdallas.box.com/s/73i8q4g59ceoum9scc4kkbhi4ritmueg">2009 data (0.1MB)</a> <a href="https://utdallas.box.com/s/6g2li4ls8zk2wfnf3tsl3gsr713r4pms">2010 data (15MB)</a> <a href="https://utdallas.box.com/s/bu30643q4l0a79b4907c2a51tx31s16a">2011 data (300MB)</a> <a href="https://utdallas.box.com/s/vb60kxanb2yifq2yaozviu6nsojnzm1c">2012 data (1.2GB)</a> <a href="https://utdallas.box.com/s/t2w1dc4xbds377lfgxulzk44sr6fwj1t">2013 data (3.2GB)</a> <a href="https://utdallas.box.com/s/xrh9bw8ctmy0kuvx24h6b8v53tdwi127">2014 data (5.2GB)</a> <a href="https://utdallas.box.com/s/zl1n1wh1dqgcicj59qvd8cmas2iz936y">2015 data (9.6GB)</a> <a href="https://utdallas.box.com/s/vuog5rneci364h4m6w5f8eursk2ym786">2016 inputs (8.1GB)</a> <a href="https://utdallas.box.com/s/9wozbdip3yjkfxgnqkf6x3jww9v5rm3m">2016 outputs (8.5GB)</a> <a href="https://utdallas.box.com/s/atscqz8cle50rc5abvbc4ct20qdqoyhi">2017 data until August (13.2GB)</a> Please [visit the full dataset page][2] for your data related questions. [1]: https://user-images.githubusercontent.com/6596905/38154759-80cbf57a-3439-11e8-8d84-9706e5825d5c.png [2]: https://github.com/cakcora/CoinWorks/blob/master/data.MD Context Wikipedia, the world's largest encyclopedia, is a crowdsourced open knowledge project and website with millions of individual web pages. This dataset is a grab of the title of every article on Wikipedia as of September 20, 2017. Content This dataset is a simple newline (`\\n`) delimited list of article titles. No distinction is made between redirects (like `Schwarzenegger`) and actual article pages (like `Arnold Schwarzenegger`). Acknowledgements This dataset was created by scraping [Special:AllPages](https://en.wikipedia.org/w/index.php?title=Special:AllPages) on Wikipedia. It was originally shared [here](https://www.reddit.com/r/datasets/comments/71954f/a_list_of_all_14mil_english_wikipedia_article/). Inspiration * What are common article title tokens? How do they compare against frequent words in the English language? * What is the longest article title? The shortest? * What countries are most popular within article titles? List of all files<i>Readme file</i> 00_readme.txt<i>Monthly grids - ensemble means</i> 01_monthly_grids_ensemble_means_allmodels.zip<i>Monthly grids - ensembles, model 1 to 6</i> 02_monthly_grids_ensemble_JPL_MSWEP_1979_2016.zip 02_monthly_grids_ensemble_JPL_GSWP3_1979_2014.zip 02_monthly_grids_ensemble_JPL_ERA5_1979_2018.zip 02_monthly_grids_ensemble_GSFC_MSWEP_1979_2016.zip 02_monthly_grids_ensemble_GSFC_GSWP3_1979_2014.zip 02_monthly_grids_ensemble_GSFC_ERA5_1979_2018.zip<i>Daily grids - ensemble means, model 1 to 6</i> 03_daily_grids_ensemble_means_JPL_MSWEP_1979_2016.zip 03_daily_grids_ensemble_means_JPL_GSWP3_1979_2014.zip 03_daily_grids_ensemble_means_JPL_ERA5_1979_2018.zip 03_daily_grids_ensemble_means_GSFC_MSWEP_1979_2016.zip 03_daily_grids_ensemble_means_GSFC_GSWP3_1979_2014.zip 03_daily_grids_ensemble_means_GSFC_ERA5_1979_2018.zip<i>Global averages - daily and monthly time series</i> 04_global_averages_allmodels.zipContent of readmeGRACE TWS Reconstruction (GRACE_REC_v03)The dataset contains reconstructed time series of daily and monthly anomalies of terrestrial water storage (TWS) based on two different GRACE solutions and three different meteorological forcing datasets. There is a total of 6 different models:JPL_MSWEP - trained with GRACE JPL mascons, forced with MSWEP forcing (1979-2016)JPL_GSWP3 - trained with GRACE JPL mascons, forced with GSWP3 forcing (1901-2014)JPL_ERA5 - trained with GRACE JPL mascons, forced with ERA5 forcing (1979-present)GSFC_MSWEP - trained with GRACE GSFC mascons, forced with MSWEP forcing (1979-2016)GSFC_GSWP3 - trained with GRACE GSFC mascons, forced with GSWP3 forcing (1901-2014)GSFC_ERA5 - trained with GRACE GSFC mascons, forced with ERA5 forcing (1979-present)The reconstruction aims at reproducing the sub-decadal climate-driven variability observed in the GRACE data. Seasonal cycle and human impacts on TWS are not reconstructed. A GRACE-based seasonal cycle is provided for convenience and should be used with the awareness that in reality long-term changes in the shape of the seasonal cycle might potentially occur. Long-term signals (trends over a period &gt;15 years) are removed during the model calibration procedure but are still present in the final dataset. The interpretation of the reconstructed long-term trends should be done with particular caution and the awareness that there can be some uncertainty in the reconstructed trends.For most applications, uncertainty ranges can be derived from the 100 ensemble members available for each model.The grids are stored in NetCDFv4 files in units of mm (kg m^-2). Although the data is provided on a 0.5 degrees grid, the effective spatial resolution should be considered to be 3 degrees, similar to the original resolution of the GRACE datasets. This might need to be taken into account when comparing this dataset against other sources.The global means are stored as csv files in units of Gt of water. To convert back to mm of water, use the land area values given in the reference paper below.When using this dataset, please cite:Humphrey V. &amp; Gudmundsson L. (submitted). GRACE-REC: A reconstruction of climate-driven water storage changes over the last century. Earth System Science Data Discussions.Vincent Humphrey, May 2019California Institute of TechnologyYour feedback is always welcome:vincent.humphrey[-a-t-]caltech.edu (vincent.humphrey[-a-t-]bluewin.ch) Abstract The amount of water stored on continents is an important constraint for water mass and energy exchanges in the Earth system and exhibits large inter-annual variability at both local and continental scales. From 2002 to 2017, the satellites of the Gravity Recovery and Climate Experiment mission (GRACE) have observed changes in terrestrial water storage (TWS) with an unprecedented level of accuracy. In this paper, we use a statistical model trained with GRACE observations to reconstruct past climate-driven changes in TWS from historical and near real time meteorological datasets at daily and monthly scales. Unlike most hydrological models which represent water reservoirs individually (e.g. snow, soil moisture, etc.) and usually provide a single model run, the presented approach directly reconstructs total TWS changes and includes hundreds of ensemble members which can be used to quantify predictive uncertainty. We compare these data-driven TWS estimates with other independent evaluation datasets such as the sea level budget, large-scale water balance from atmospheric reanalysis and in-situ streamflow measurements. We find that the presented approach performs overall as well or better than a set of state-of-the-art global hydrological models (Water Resources Reanalysis version 2). We provide reconstructed TWS anomalies at a spatial resolution of 0.5°, at both daily and monthly scales over the period 1901 to present, based on two different GRACE products and three different meteorological forcing datasets, resulting in 6 reconstructed TWS datasets of 100 ensemble members each. Possible user groups and applications include hydrological modelling and model benchmarking, sea level budget studies, assessments of long-term changes in the frequency of droughts, the analysis of climate signals in geodetic time series and the interpretation of the data gap between the GRACE and the GRACE Follow-On mission.Check reference for additional details and caveats.ReferenceHumphrey V. &amp; Gudmundsson L. (submitted). GRACE-REC: A reconstruction of climate-driven water storage changes over the last century. Earth System Science Data Discussions. Datasets for Markov motif analysis, in inhibitory and excitatory clustered er networks, er networks, and increased weight networks. SPEECH-COCO is an augmentation of MS-COCO dataset where speech is added to image and text. Speech captions were generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (&gt;600h) paired with images. Disfluencies and speed perturbation were added to the signal in order to sound more natural. Each speech signal (WAV) is paired with a JSON file containing exact timecode for each word/syllable/phoneme in the spoken caption. Such a corpus could be used for Language and Vision (LaVi) tasks including speech input or output instead of text. The Chiang Saen city GIS dataset including space syntax's integration of street and point data of housing. This dataset contains key characteristics about the data described in the Data Descriptor De novo transcriptome assembly and analysis of the freshwater araphid diatom Fragilaria radians, Lake Baikal. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format3. machine readable metadata file in ISA-Tab format (zipped folder) This is a full archive of metadata about papers on arxiv.org from 1993-2018, including abstracts. Data is tidy and packed in TSV files, in two different collections of the total dataset: per year (all categories) and per primary category (all years). This archive also includes Jupyter notebooks for unpacking and analyzing it in python. See the README.md file and https://github.com/staeiou/arxiv_archive for more information. An individual’s annual income results from various factors. Intuitively, it is influenced by the individual’s education level, age, gender, occupation, and etc. This is a widely cited KNN dataset. I encountered it during my course, and I wish to share it here because it is a good starter example for data pre-processing and machine learning practices. **Fields** The dataset contains 16 columns Target filed: Income -- The income is divide into two classes: &lt;=50K and &gt;50K Number of attributes: 14 -- These are the demographics and other features to describe a person We can explore the possibility in predicting income level based on the individual’s personal information. **Acknowledgements** This dataset named “adult” is found in the UCI machine learning repository [http://www.cs.toronto.edu/~delve/data/adult/desc.html][1] The detailed description on the dataset can be found in the original UCI documentation [http://www.cs.toronto.edu/~delve/data/adult/adultDetail.html][2] [1]: http://www.cs.toronto.edu/~delve/data/adult/desc.html [2]: http://www.cs.toronto.edu/~delve/data/adult/adultDetail.html Context Text summarization is a way to condense the large amount of information into a concise form by the process of selection of important information and discarding unimportant and redundant information. With the amount of textual information present in the world wide web the area of text summarization is becoming very important. The extractive summarization is the one where the exact sentences present in the document are used as summaries. The extractive summarization is simpler and is the general practice among the automatic text summarization researchers at the present time. Extractive summarization process involves giving scores to sentences using some method and then using the sentences that achieve highest scores as summaries. As the exact sentence present in the document is used the semantic factor can be ignored which results in generation of less calculation intensive summarization procedure. This kind of summary is generally completely unsupervised and language independent too. Although this kind of summary does its job in conveying the essential information it may not be necessarily smooth or fluent. Sometimes there can be almost no connection between adjacent sentences in the summary resulting in the text lacking in readability. Content This dataset for extractive text summarization has four hundred and seventeen political news articles of BBC from 2004 to 2005 in the News Articles folder. For each articles, five summaries are provided in the Summaries folder. The first clause of the text of articles is the respective title. Acknowledgements This dataset was created using a dataset used for data categorization that onsists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005 used in the paper of D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006; whose all rights, including copyright, in the content of the original articles are owned by the BBC. More at http://mlg.ucd.ie/datasets/bbc.html Context Satellite imagery provides unique insights into various markets, including agriculture, defense and intelligence, energy, and finance. New commercial imagery providers, such as [Planet](https://www.planet.com/), are using constellations of small satellites to capture images of the entire Earth every day. This flood of new imagery is outgrowing the ability for organizations to manually look at each image that gets captured, and there is a need for machine learning and computer vision algorithms to help automate the analysis process. The aim of this dataset is to help address the difficult task of detecting the location of large ships in satellite images. Automating this process can be applied to many issues including monitoring port activity levels and supply chain analysis. Content The dataset consists of image chips extracted from Planet satellite imagery collected over the San Francisco Bay and San Pedro Bay areas of California. It includes 4000 80x80 RGB images labeled with either a "ship" or "no-ship" classification. Image chips were derived from PlanetScope full-frame visual scene products, which are orthorectified to a 3 meter pixel size. Provided is a zipped directory `shipsnet.zip` that contains the entire dataset as .png image chips. Each individual image filename follows a specific format: {label} __ {scene id} __ {longitude} _ {latitude}.png - **label:** Valued 1 or 0, representing the "ship" class and "no-ship" class, respectively. - **scene id:** The unique identifier of the PlanetScope visual scene the image chip was extracted from. The scene id can be used with the [Planet API](https://www.planet.com/docs/reference/data-api/) to discover and download the entire scene. - **longitude_latitude:** The longitude and latitude coordinates of the image center point, with values separated by a single underscore. The dataset is also distributed as a JSON formatted text file `shipsnet.json`. The loaded object contains **data**, **label**, **scene_ids**, and **location** lists. The pixel value data for each 80x80 RGB image is stored as a list of 19200 integers within the **data** list. The first 6400 entries contain the red channel values, the next 6400 the green, and the final 6400 the blue. The image is stored in row-major order, so that the first 80 entries of the array are the red channel values of the first row of the image. The list values at index *i* in **labels**, **scene_ids**, and **locations** each correspond to the *i*-th image in the **data** list. Class Labels The "ship" class includes 1000 images. Images in this class are near-centered on the body of a single ship. Ships of different sizes, orientations, and atmospheric collection conditions are included. Example images from this class are shown below. ![ship](https://i.imgur.com/tLsSoTz.png) The "no-ship" class includes 3000 images. A third of these are a random sampling of different landcover features - water, vegetion, bare earth, buildings, etc. - that do not include any portion of an ship. The next third are "partial ships" that contain only a portion of an ship, but not enough to meet the full definition of the "ship" class. The last third are images that have previously been mislabeled by machine learning models, typically caused by bright pixels or strong linear features. Example images from this class are shown below. ![no-plane](https://imgur.com/Q3daQMC.png) Acknowledgements Satellite imagery used to build this dataset is made available through Planet\'s [Open California](https://www.planet.com/products/open-california/) dataset, which is [openly licensed](https://creativecommons.org/licenses/by-sa/4.0/). As such, this dataset is also available under the same CC-BY-SA license. Users can sign up for a free Planet account to search, view, and download thier imagery and gain access to their API. The database presented here contains radiogenic neodymium and strontium isotope ratios measured on both terrestrial and marine sediments. It was compiled to help assessing sediment provenance and transport processes for various time intervals. This can be achieved by either mapping sediment isotopic signature and/or fingerprinting source areas using statistical tools (e.g. Blanchet, 2018b, 2018a). The database has been built by incorporating data from the literature and the SedDB database and harmonizing the metadata, especially units and geographical coordinates. The original data were processed in three steps. Firstly, a specific attention has been devoted to provide geographical coordinates to each sample in order to be able to map the data. When available, the original geographical coordinates from the reference (generally DMS coordinates, with different precision standard) were transferred into the decimal degrees system. When coordinates were not provided, an approximate location was derived from available information in the original publication. Secondly, all samples were assigned a set of standardized criteria that help splitting the dataset in specific categories. We defined categories associated with the sample location ("Region", "Sub-region", "Location", which relate to location at continental to city/river scale) or with the sample types (terrestrial samples – “aerosols”, “soil sediments”, “river sediments”, “rocks” - or marine samples –“marine sediment” or “trap sample”). Thirdly, samples were discriminated according to their deposition age, which allowed to compute average values for specific time intervals (see attached table "Age_determination_Sediment_Cores_V2.txt"). A first version of the database was published in September 2018 and presented data for the African sector. A second version was published in April 2019, in which the dataset has been extended to reach a global extent. The dataset will be further updated bi-annually to increase the geographical resolution and/or add other type of samples. This dataset consists of two tab separated tables: "Dataset_Nd_Sr_isotopes_V2.txt" and "Age_determination_Sediment_Cores_V2.txt". "Dataset_Nd_Sr_isotopes_V2.txt" contains the assembled dataset of marine and terrestrial Nd and/or Sr concentration and isotopes, together with sorting criteria and geographical locations. "Age_determination_Sediment_Cores_V2.txt" contains all background information concerning the determination of the isotopic signature of specific time intervals (depth interval, number of samples, mean and standard deviation). Column headers are explained in respective metadata comma-separated files. A full reference list is provided in the file “References_Database_Nd_Sr_isotopes_V2.rtf”. Finally, R code for mapping the data and running statistical analyses is also available for this dataset (Blanchet, 2018b, 2018a). A number of zipped files containing datasets and supplemetary information focusing on the catalytic activity of silver nanoparticles. Supplementary information Figure 2 a) and b): Characterisation of Ag Ab NP's using extinction spectroscopy and dynamic light scattering. Dataset contains DSW files of Ag Ab NP's with different concentrations of Ab on the surface obtained using a Cary UV-Vis instrument and excel file with the extinction spectra plotted together and size information form the DLS. Figure 3 a): Catalytic activity of silver nanoparticles assessed with extinction spectroscopy. Dataset contains DSW files obtained using Cary 300 Bio UV-Vis spectrometer with Cary software. Also contains excel files with normalised extinction spectra for each sample and graph of all samples plotted together. Figure 3 b): Catalytic activity of silver nanoparticles. This data set contains SPC files for each sample taken in a 638 nm Raman spectrometer using Ocean optics software. Excel spreadsheet also included which has average spectrum for each sample and graph containing all averaged spectra. Figure 4: Catalytic activity of silver nanoparticles conjugated to antibodies. Data set includes SPC files for all samples taking on a 638 nm Raman spectrometer with Ocean optics software and excel file containing average spectra for each sample. Excel file also has graph of average SERRS spectrum for each sample plotted together Figure 5 a) and b): Analysis of SLISA. Dataset contains SPC files for each concentration and its replicates. SPC's obtained using a 638 nm laser excitation on an InVia Renishaw instrument. Dataset also contains excel with averages for each concentration, spectra of average plotted against each other and limit of detection graph. Context To analyze the trend of the research is an interesting and important task. But many conferences does not publish its accepted paper list by useful format. For that reason, I share the recent [ACL](https://en.wikipedia.org/wiki/Association_for_Computational_Linguistics) accepted papers dataset! Content This dataset includes ACL accepted papers (long &amp; short) from 2016 to 2018. * [ACL 2016 accepted papers](http://mirror.aclweb.org/acl2016/indexa779.html?article_id=68) * [ACL 2017 accepted papers](https://acl2017.wordpress.com/2017/04/05/accepted-papers-and-demonstrations/) * [ACL 2018 accepted papers](https://acl2018.org/programme/papers/) And if [arXiv](https://arxiv.org/) version exists, its summary and URL are acquired. The source code to get the dataset is shared on [GitHub](https://github.com/icoxfog417/get_acl_papers). Microsatellites dataset used in an ABC-RF study to retrace invasion route of the Asian Longhorned Beetle.Specimens are sorted by populations Decision trees are characterized by fast induction time and comprehensible classification rules. However, their classification accuracies are relatively lower in comparison to other black-box classifiers such as support vector machines. It is often possible to improve decision tree accuracies by combining them via boosting or bagging to form an ensemble of trees (i.e., forests). Unfortunately, ensemble approaches will cause the decision trees to lose their comprehensibility and significantly lengthen their induction time. The invention of the alternating decision tree (ADTree) allows the incorporation of boosting within a single decision tree to retain comprehensibility. However, the existing ADTree is univariate in nature which limits its potential to further improve the accuracy and induction time. This thesis presents the multivariate alternating decision tree, whereby multivariate decision nodes are incorporated into the ADTree learning algorithm. It can be considered as a generalization of the existing univariate ADTree. Three different variants of multivariate ADTrees are presented in this thesis, namely the Fisher’s ADTree, Sparse ADTree and regularized LogitBoost ADTree (rLADTree). These algorithms were benchmarked against other existing univariate, multivariate and ensemble-based decision trees using real-world datasets from the University of California, Irvine Machine Learning Repository and University of Eastern Finland Spectral Database. It is shown that the Fisher’s ADTree is capable of improving the accuracy of multivariate decision trees through boosting. At the same time it remains to be significantly smaller than boosted multivariate decision trees. It is also shown that the Sparse ADTree is a non-parametric extension of the Sparse Linear Discriminant Analysis (SLDA). It is therefore able to linearly partition the data when it is beneficial to do so, or to grow a tree to improve the classification accuracy when necessary. The most notable multivariate ADTree variant is the regularized LADTree, which is characterized by having no statistically significant differences in all performance metrics and offering comprehensibility when compared with the univariate, unboosted decision trees like C4.5 and CART for general datasets. For datasets with highly correlated features, the regularized LADTree outperforms the existing decision trees in terms of accuracy and comprehensibility, making it a top choice among decision tree classifiers. Context If you like to eat cereal, do yourself a favor and avoid this dataset at all costs. After seeing these data it will never be the same for me to eat Fruity Pebbles again. Content Fields in the dataset: - Name: Name of cereal - mfr: Manufacturer of cereal - A = American Home Food Products; - G = General Mills - K = Kelloggs - N = Nabisco - P = Post - Q = Quaker Oats - R = Ralston Purina - type: - cold - hot - calories: calories per serving - protein: grams of protein - fat: grams of fat - sodium: milligrams of sodium - fiber: grams of dietary fiber - carbo: grams of complex carbohydrates - sugars: grams of sugars - potass: milligrams of potassium - vitamins: vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended - shelf: display shelf (1, 2, or 3, counting from the floor) - weight: weight in ounces of one serving - cups: number of cups in one serving - rating: a rating of the cereals (Possibly from Consumer Reports?) Acknowledgements These datasets have been gathered and cleaned up by Petra Isenberg, Pierre Dragicevic and Yvonne Jansen. The original source can be found [here][1] This dataset has been converted to CSV Inspiration Eat too much sugary cereal? Ruin your appetite with this dataset! [1]: https://perso.telecom-paristech.fr/eagan/class/igr204/datasets A simple data set to get started with data analysis using python pandas This zipped file contains all input files for simulations, all input files for the bat dataset, and output files from the analyses of simulated data and the bat data Newick tree file comprising 449 angiosperm taxa found in dataset. Tip labels are codes, identifiable to species by cross-referencing with species table. Files related to the ML analysis of the complete dataset. Dataset for:Micro and macroscale drivers of nutrient concentrations in streams in South, Central and North America PONE-D-16-20557 Dataset Each column of this dataset represents the histogram of each image of the image dataset. dataset for nipype tutorials at http://nipy.org/nipype/users/pipeline_tutorial.html NFL-Statistics-Scrape Here are the basic statistics, career statistics and game logs provided by the NFL on their website (http://www.nfl.com) for all players past and present. Summary The data was scraped using a Python code. The code can be located at Github: https://github.com/kendallgillies/NFL-Statistics-Scrape Explanation of Data 1. The first main group of statistics is the basic statistics provided for each player. This data is stored in the CSV file titled Basic_Stats.csv along with the player’s name and URL identifier. If available the data pulled for each player is as follows: 1. Number 2. Position 3. Current Team 4. Height 5. Weight 6. Age 7. Birthday 8. Birth Place 9. College Attended 10. High School Attended 11. High School Location 12. Experience 2. The second main group of statistics gathered for each player are their career statistics. While each player has a main position they play, they will have statistics in other areas; therefore, the career statistics are divided into statistics types. The statistics are then stored in CSV files based on statistic type along with the player name, URL identifier and position (if available). The following are the career statistics types and accompanying CSV file names: 1. Defensive Statistics – Career_Stats_Defensive.csv 2. Field Goal Kickers - Career_Stats_Field_Goal_Kickers.csv 3. Fumbles - Career_Stats_Fumbles.csv 4. Kick Return - Career_Stats_Kick_Return.csv 5. Kickoff - Career_Stats_Kickoff.csv 6. Offensive Line - Career_Stats_Offensive_Line.csv 7. Passing - Career_Stats_Passing.csv 8. Punt Return - Career_Stats_Punt_Return.csv 9. Punting - Career_Stats_Punting.csv 10. Receiving - Career_Stats_Receiving.csv 11. Rushing - Career_Stats_Rushing.csv 3. The final group of statistics is the game logs for each player. The game logs are stored by position and have the player name, URL identifier and position (if available). The following are the game log types and accompanying CSV file names: 1. Quarterback – Game_Logs_Quarterback.csv 2. Running back – Game_Logs_Runningback.csv 3. Wide Receiver and Tight End – Game_Logs_Wide_Receiver_and_Tight_End.csv 4. Offensive Line – Game_Logs_Offensive_Line.csv 5. Defensive Lineman – Game_Logs_Defensive_Lineman.csv 6. Kickers – Game_Logs_Kickers.csv 7. Punters – Game_Logs_Punters.csv Glossary While most of the abbreviations used by the NFL have been translated in the table headers in the data files, there are still a couple of abbreviations used. * FG: Field Goal * TD: Touchdown * Int: Interception MATLAB dataset. Full matrix capture data from the ultrasonic phased array inspection of welded 316L stainless steel plates containing a 6mm lack-of-fusion flaw at a 50 degree angle with respect to the x-axis. Further experimental details can be found in the attached metadata file. This data was collected under an RCNDE targeted project which involved collaboration between the Mathematics and Statistics Department at Strathclyde, the Centre of Ultrasonic Engineering at Strathclyde and 5 industrial partners (AMEC, NNL, Rolls Royce, Shell and Weidlinger). The dataset here will be made publicly accessible under EPSRC regulations. The RDF triples below define a vocabulary for describing processes run on linked data datasets. This vocabulary was originally intended as administrative and provenance data for RDF datasets produced at the University of Washington Libraries. As described on the original website: There are ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement). The image is quantized to 256 grey levels and stored as unsigned 8-bit integers; the loader will convert these to floating point values on the interval [0, 1], which are easier to work with for many algorithms. The “target” for this database is an integer from 0 to 39 indicating the identity of the person pictured; however, with only 10 examples per class, this relatively small dataset is more interesting from an unsupervised or semi-supervised perspective. The original dataset consisted of 92 x 112, while the version available here consists of 64x64 images. Credit to AT&amp;T Laboratories Cambridge for images This dataset contains images of ten different knots tied with two types of climbing rope. The ten knots are: - The Alpine Butterfly Knot - The Bowline Knot - The Clove Hitch - The Figure-8 Knot - The Figure-8 Loop - The Fisherman's Knot - The Flemish Bend - The Overhand Knot - The Reef Knot - The Slip Knot Every knot was photographed in many different conditions. Each knot was photographed at four different z-axis rotations. Each knot was photographed in three different lighting conditions. Each knot was photographed at three different tensions. Each knot was photographed with two different backgrounds. Capturing each knot in these different conditions resulted in 144 images per knot and 1440 images in total for the entire 10Knots dataset. This dataset was originally created to train a convolutional neural network implemented in Keras to perform image classification and classify these ten different knots. Context This dataset contains lot of historical sales data. It was extracted from a Brazilian top retailer and has many SKUs and many stores. The data was transformed to protect the identity of the retailer. Content [TBD] Acknowledgements This data would not be available without the full collaboration from our customers who understand that sharing their core and strategical information has more advantages than possible hazards. They also support our continuos development of innovative ML systems across their value chain. Inspiration Every retail business in the world faces a fundamental question: how much inventory should I carry? In one hand to mush inventory means working capital costs, operational costs and a complex operation. On the other hand lack of inventory leads to lost sales, unhappy customers and a damaged brand. Current inventory management models have many solutions to place the correct order, but they are all based in a single unknown factor: the demand for the next periods. This is why short-term forecasting is so important in retail and consumer goods industry. We encourage you to seek for the best demand forecasting model for the next 2-3 weeks. This valuable insight can help many supply chain practitioners to correctly manage their inventory levels. Dataset connected with article: 'Increasing the use of conceptually-derived strategies in arithmetic: using inversion problems to promote the use of associativity shortcuts.' Abstract: Conceptual knowledge of key principles underlying arithmetic is an important precursor to understanding algebra and later success in mathematics. One such principle is associativity, which allows individuals to solve problems in different ways by decomposing and recombining subexpressions (e.g. ‘a + b – c’ = ‘b – c + a’). More than any other principle, children and adults alike have difficulty understanding it, and educators have called for this to change. We report three intervention studies that were conducted in university classrooms to investigate whether adults’ use of associativity could be improved. In all three studies, it was found that those who first solved inversion problems (e.g. ‘a + b – b’) were more likely than controls to then use associativity on ‘a + b – c’ problems. We suggest that ‘a + b – b’ inversion problems may either direct spatial attention to the location of ‘b – c’ on associativity problems, or implicitly communicate the validity and efficiency of a right-to-left strategy. These findings may be helpful for those designing brief activities that aim to aid the understanding of arithmetic principles and algebra. Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? Supplemental Data. Walker et al. (2017). Plant Cell 10.1105/tpc.16.00961. Supplemental Dataset 8. Gene family regulation across cell types and treatments. (A) Summary of gene family regulation: number DE/non-DE, in which timeseries they are regulated and if transcripts are DE in one or both cell types. (B) Lists of the genes in each family together with cluster numbers if differentially expressed in a timeseries. Context Things like Block chain, Bitcoin, Bitcoin cash, Ethereum, Ripple etc are constantly coming in the news articles I read. So I wanted to understand more about it and [this post][1] helped me get started. Once the basics are done, the data scientist inside me started raising questions like: 1. How many cryptocurrencies are there and what are their prices and valuations? 2. Why is there a sudden surge in the interest in recent days? For getting answers to all these questions (and if possible to predict the future prices ;)), I started collecting data from [coinmarketcap][2] about the cryptocurrencies. So what next? Now that we have the price data, I wanted to dig a little more about the factors affecting the price of coins. I started of with Bitcoin and there are quite a few parameters which affect the price of Bitcoin. Thanks to [Blockchain Info][3], I was able to get quite a few parameters on once in two day basis. This will help understand the other factors related to Bitcoin price and also help one make future predictions in a better way than just using the historical price. Content The dataset has one csv file for each currency. Price history is available on a daily basis from April 28, 2013. This dataset has the historical price information of some of the top crypto currencies by market capitalization. The currencies included are: - Bitcoin - Ethereum - Ripple - Bitcoin cash - Bitconnect - Dash - Ethereum Classic - Iota - Litecoin - Monero - Nem - Neo - Numeraire - Stratis - Waves - Date : date of observation - Open : Opening price on the given day - High : Highest price on the given day - Low : Lowest price on the given day - Close : Closing price on the given day - Volume : Volume of transactions on the given day - Market Cap : Market capitalization in USD **Bitcoin Dataset (bitcoin_dataset.csv) :** This dataset has the following features. - Date : Date of observation - btc_market_price : Average USD market price across major bitcoin exchanges. - btc_total_bitcoins : The total number of bitcoins that have already been mined. - btc_market_cap : The total USD value of bitcoin supply in circulation. - btc_trade_volume : The total USD value of trading volume on major bitcoin exchanges. - btc_blocks_size : The total size of all block headers and transactions. - btc_avg_block_size : The average block size in MB. - btc_n_orphaned_blocks : The total number of blocks mined but ultimately not attached to the main Bitcoin blockchain. - btc_n_transactions_per_block : The average number of transactions per block. - btc_median_confirmation_time : The median time for a transaction to be accepted into a mined block. - btc_hash_rate : The estimated number of tera hashes per second the Bitcoin network is performing. - btc_difficulty : A relative measure of how difficult it is to find a new block. - btc_miners_revenue : Total value of coinbase block rewards and transaction fees paid to miners. - btc_transaction_fees : The total value of all transaction fees paid to miners. - btc_cost_per_transaction_percent : miners revenue as percentage of the transaction volume. - btc_cost_per_transaction : miners revenue divided by the number of transactions. - btc_n_unique_addresses : The total number of unique addresses used on the Bitcoin blockchain. - btc_n_transactions : The number of daily confirmed Bitcoin transactions. - btc_n_transactions_total : Total number of transactions. - btc_n_transactions_excluding_popular : The total number of Bitcoin transactions, excluding the 100 most popular addresses. - btc_n_transactions_excluding_chains_longer_than_100 : The total number of Bitcoin transactions per day excluding long transaction chains. - btc_output_volume : The total value of all transaction outputs per day. - btc_estimated_transaction_volume : The total estimated value of transactions on the Bitcoin blockchain. - btc_estimated_transaction_volume_usd : The estimated transaction value in USD value. **Ethereum Dataset (ethereum_dataset.csv):** This dataset has the following features - Date(UTC) : Date of transaction - UnixTimeStamp : unix timestamp - eth_etherprice : price of ethereum - eth_tx : number of transactions per day - eth_address : Cumulative address growth - eth_supply : Number of ethers in supply - eth_marketcap : Market cap in USD - eth_hashrate : hash rate in GH/s - eth_difficulty : Difficulty level in TH - eth_blocks : number of blocks per day - eth_uncles : number of uncles per day - eth_blocksize : average block size in bytes - eth_blocktime : average block time in seconds - eth_gasprice : Average gas price in Wei - eth_gaslimit : Gas limit per day - eth_gasused : total gas used per day - eth_ethersupply : new ether supply per day - eth_chaindatasize : chain data size in bytes - eth_ens_register : Ethereal Name Service (ENS) registrations per day Acknowledgements This data is taken from [coinmarketcap][5] and it is [free][6] to use the data. Bitcoin dataset is obtained from [Blockchain Info][7]. Ethereum dataset is obtained from [Etherscan][8]. Cover Image : Photo by Thomas Malama on Unsplash Inspiration Some of the questions which could be inferred from this dataset are: 1. How did the historical prices / market capitalizations of various currencies change over time? 2. Predicting the future price of the currencies 3. Which currencies are more volatile and which ones are more stable? 4. How does the price fluctuations of currencies correlate with each other? 5. Seasonal trend in the price fluctuations Bitcoin / Ethereum dataset could be used to look at the following: 1. Factors affecting the bitcoin / ether price. 2. Directional prediction of bitcoin / ether price. (refer [this paper][9] for more inspiration) 3. Actual bitcoin price prediction. [1]: https://www.linkedin.com/pulse/blockchain-absolute-beginners-mohit-mamoria [2]: https://coinmarketcap.com/ [3]: https://blockchain.info/ [4]: https://etherscan.io/charts [5]: https://coinmarketcap.com/ [6]: https://coinmarketcap.com/faq/ [7]: https://blockchain.info/ [8]: https://etherscan.io/charts [9]: http://cs229.stanford.edu/proj2014/Isaac%20Madan,%20Shaurya%20Saluja,%20Aojia%20Zhao,Automated%20Bitcoin%20Trading%20via%20Machine%20Learning%20Algorithms.pdf it is a dataset with defect4j, bugs-dot-jar and the extended dataset from ye[fse'14]. Nexus file for Dataset 2, 38-taxon dataset. Context Competitions like LUNA (http://luna16.grand-challenge.org) and the Kaggle Data Science Bowl 2017 (https://www.kaggle.com/c/data-science-bowl-2017) involve processing and trying to find lesions in CT images of the lungs. In order to find disease in these images well, it is important to first find the lungs well. This dataset is a collection of 2D and 3D images with manually segmented lungs. Challenge Come up with an algorithm for accurately segmenting lungs and measuring important clinical parameters (lung volume, PD, etc) Percentile Density (PD) The PD is the density (in Hounsfield units) the given percentile of pixels fall below in the image. The table includes 5 and 95% for reference. For smokers this value is often high indicating the build up of other things in the lungs. photos of furniture High-throughput experimental data are accumulating exponentially in public databases. Unfortunately, however, mining valid scientific discoveries from these abundant resources is hampered by technical artifacts and inherent biological heterogeneity. The former are usually termed “batch effects,” and the latter is often modeled by subtypes. Existing methods either tackle batch effects provided that subtypes are known or cluster subtypes assuming that batch effects are absent. Consequently, there is a lack of research on the correction of batch effects with the presence of unknown subtypes. Here, we combine a location-and-scale adjustment model and model-based clustering into a novel hybrid one, the batch-effects-correction-with-unknown-subtypes model (BUS). BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, (d) allowing the number of subtypes to vary from batch to batch, (e) integrating batches from different platforms, and (f) enjoying a linear-order computation complexity. We prove the identifiability of BUS and provide conditions for study designs under which batch effects can be corrected. BUS is evaluated by simulation studies and a real breast cancer dataset combined from three batches measured on two platforms. Results from the breast cancer dataset offer much better biological insights than existing methods. We implement BUS as a free Bioconductor package BUScorrect. Supplementary materials for this article are available online. We used different sensing techniques including time-lapse imagery, electric conductivity and stage measurements to generate a combined dataset of presence and absence of streamflow within a large number of nested sub-catchments in the Attert Catchment, Luxembourg. The first sites of observation were established in 2013 and successively extended to a total number of 182 in 2016 as part of the project “Catchments As Organized Systems” (CAOS, Zehe et al., 2014). Setup for time-lapse imagery measurements was inspired by Gilmore et al. (2013) while the setup for EC-sensor was proposed by Chapin et al. (2014). Temporal resolution ranged from 5 to 15 minutes intervals. Each single dataset was carefully processed and quality controlled before the time interval was homogenized to 30 minutes. The dataset provides valuable information of the dynamics of a meso-scale stream network in space and time. The Attert basin is located in the border region of Luxembourg and Belgium and covers an area of 247 km². The elevation of the catchment ranges from 245 m a.s.l. in Useldange to 549 m a.s.l. in the Ardennes. Climate conditions across the catchment are rather similar in terms of temperature and precipitation. Hydrological regimes are mainly driven by seasonal fluctuations in evapotranspiration causing flow to cease in intermittent reaches during dry periods. The catchment covers three predominant geologies: Slate, Marls and Sandstone. The dataset features data from catchments covering all geological characteristics from single geology to mixed geology. It can be used to test and evaluate hydrologic models, but also for the assessment of the intermittent stream ecosystem in the Attert basin. Purple rapeseed leaves dataset contains training set and test set, both of which included RGB images which were cropped using UAV orthoimage and the corresponding labels, with the size of 256 × 256 pixels.The codes for the U-Net model used in this experiment was also assigned to the folder. Context Based on Fisher\'s linear discriminant model, this data set became a typical test case for many statistical classification techniques in machine learning such as support vector machines. Content The Iris flower data set or Fisher\'s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.[1] It is sometimes called Anderson\'s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.[2] Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".[3] The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other. Acknowledgements description taken from Wiki Would like to thank Dr. Jason Brownlee who has explained all the examples very nicely and clearly! Copyright information:Taken from "Combining gene expression data from different generations of oligonucleotide arrays"BMC Bioinformatics 2004;5():159-159.Published online 25 Oct 2004PMCID:PMC528726.Copyright © 2004 Hwang et al; licensee BioMed Central Ltd. The same RNA was hybridized on both HG-U95Av2 and HG-U133A arrays, for 14 samples. Three methods for matching the probes were considered, but the two datasets gave highly inconsistent results in cluster analysis and identification of differentially expressed genes. To improve the comparability in general, probe-level sequence information was exploited. All 25-mer probes were aligned to human genome sequences by BLAT and then filtered based on the length of their overlap with the probes on the other array. New expression indices were calculated using only the selected probes, and this results in higher reproducibility. Bank Marketing **Abstract:** The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). **Data Set Information:** The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. Attribute Information: Bank client data: - Age (numeric) - Job : type of job (categorical: 'admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown') - Marital : marital status (categorical: 'divorced', 'married', 'single', 'unknown' ; note: 'divorced' means divorced or widowed) - Education (categorical: 'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown') - Default: has credit in default? (categorical: 'no', 'yes', 'unknown') - Housing: has housing loan? (categorical: 'no', 'yes', 'unknown') - Loan: has personal loan? (categorical: 'no', 'yes', 'unknown') Related with the last contact of the current campaign: - Contact: contact communication type (categorical: 'cellular','telephone') - Month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec') - Day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri') - Duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. Other attributes: - Campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) - Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) - Previous: number of contacts performed before this campaign and for this client (numeric) - Poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success') Social and economic context attributes - Emp.var.rate: employment variation rate - quarterly indicator (numeric) - Cons.price.idx: consumer price index - monthly indicator (numeric) - Cons.conf.idx: consumer confidence index - monthly indicator (numeric) - Euribor3m: euribor 3 month rate - daily indicator (numeric) - Nr.employed: number of employees - quarterly indicator (numeric) Output variable (desired target): - y - has the client subscribed a term deposit? (binary: 'yes', 'no') Analysis Steps: - Atribute information Analysis. - Machine Learning (Logistic Regression, KNN, SVM, Decision Tree, Random Forest, Naive Bayes) - Deep Learning (ANN) Source: - Dataset from : http://archive.ics.uci.edu/ml/datasets/Bank+Marketing Normalized full metabolic dataset. Table S2. Seed weight in response to salinity for both seasons. Table S3. Putative QTLs for maturation percent. Table S4. Putative QTLs for RMC in SDF. Table S5. Putative QTLs for RMC in SDS. (XLSX 882 kb) Context This dataset was downloaded from INEP, a department from the Brazilian Education Ministry. It contains data from the applicants for the 2016 National High School Exam. Content Inside this dataset there are not only the exam results, but the social and economic context of the applicants. Acknowledgements The original dataset is provided by INEP (http://portal.inep.gov.br/microdados). Inspiration The objective is to explore the dataset to achieve a better understanding of the social and economic context of the applicants in the exams results. Context MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota. This data set consists of: * 100,000 ratings (1-5) from 943 users on 1682 movies. * Each user has rated at least 20 movies. * Simple demographic info for the users (age, gender, occupation, zip) The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set. Detailed descriptions of the data file can be found at the end of this file. Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? This dataset includes ANSI/IES TM-30-18 data for approximately 165,000 light source spectral power distributions. Context Most countries of the world define poverty as a lack of money. Yet poor people themselves consider their experience of poverty much more broadly. A person who is poor can suffer from multiple disadvantages at the same time – for example they may have poor health or malnutrition, a lack of clean water or electricity, poor quality of work or little schooling. Focusing on one factor alone, such as income, is not enough to capture the true reality of poverty. Multidimensional poverty measures can be used to create a more comprehensive picture. They reveal who is poor and how they are poor – the range of different disadvantages they experience. As well as providing a headline measure of poverty, multidimensional measures can be broken down to reveal the poverty level in different areas of a country, and among different sub-groups of people. Content Most recent MPI data harmonized for comparisons across time. OPHI researchers apply the AF method and related multidimensional measures to a range of different countries and contexts. Their analyses span a number of different topics, such as changes in multidimensional poverty over time, comparisons in rural and urban poverty, and inequality among the poor. For more information on OPHI’s research, see our [working paper series](http://www.ophi.org.uk/resources/ophi-working-papers/) and [research briefings](http://www.ophi.org.uk/resources/briefing-documents/). OPHI also calculates the Global Multidimensional Poverty Index [MPI](http://www.ophi.org.uk/multidimensional-poverty-index/), which has been published since 2010 in the United Nations Development Programme’s Human Development Report. The Global MPI is an internationally-comparable measure of acute poverty covering more than 100 developing countries. It is updated by OPHI twice a year and constructed using the AF method. The Alkire Foster (AF) method is a way of measuring multidimensional poverty developed by OPHI’s Sabina Alkire and James Foster. Building on the Foster-Greer-Thorbecke poverty measures, it involves counting the different types of deprivation that individuals experience at the same time, such as a lack of education or employment, or poor health or living standards. These deprivation profiles are analysed to identify who is poor, and then used to construct a multidimensional index of poverty (MPI). For free online video guides on how to use the AF method, see [OPHI’s online training portal](http://www.ophi.org.uk/teaching/online-training-portal/). To identify the poor, the AF method counts the overlapping or simultaneous deprivations that a person or household experiences in different indicators of poverty. The indicators may be equally weighted or take different weights. People are identified as multidimensionally poor if the weighted sum of their deprivations is greater than or equal to a poverty cut off – such as 20%, 30% or 50% of all deprivations. It is a flexible approach which can be tailored to a variety of situations by selecting different dimensions (e.g. education), indicators of poverty within each dimension (e.g. how many years schooling a person has) and poverty cut offs (e.g. a person with fewer than five years of education is considered deprived). The most common way of measuring poverty is to calculate the percentage of the population who are poor, known as the headcount ratio (H). Having identified who is poor, the AF method generates a unique class of poverty measures (Mα) that goes beyond the simple headcount ratio. Three measures in this class are of high importance: Adjusted headcount ratio (M0), otherwise known as the MPI: This measure reflects both the incidence of poverty (the percentage of the population who are poor) and the intensity of poverty (the percentage of deprivations suffered by each person or household on average). M0 is calculated by multiplying the incidence (H) by the intensity (A). M0 = H x A. Find out about other ways the AF method is used in [research and policy](http://www.ophi.org.uk/research/multidimensional-poverty/research-applications/). Additional data [here](http://ophi.org.uk/multidimensional-poverty-index/global-mpi-2017/mpi-data/). This dataset contains the [Summer 2016 Subnational data from Table 6.3](http://ophi.org.uk/multidimensional-poverty-index/mpi-resources/2016) as it is the most recent dataset for MPI comparisons over time. Data Cleaning Notes The original format was significantly different in many unusable ways. I converted all survey years (`year1` and `year2`) from a `Period` format that looked some like "2005/6 - 2009". Note, "2005/6" meant the survey was conducted from sometime in 2005 through sometime in 2006. Additionally, the `year2` aspect could follow a similar format (eg "2009/10"). To keep simplicity, I dropped the `/%s` portion of both `year1` and `year2`. This still maintains consistency in the case that either year column becomes used for a comparison statistic. The raw data file from OPHI has their `Total Population` and `Number of Poor` in Thousands. I converted the decimals to make it a raw population number. For example, `3.142` becomes `3142`. The original file is in an excel format that needed to be converted into a `csv` in order to upload into Kaggle. I decided to keep values to the`10^-6` decimal place. The statistical significance columns comes from OPHI\'s test of significant changes. Directly from the excel file: `Note, *** statistically significant at α=0.01, ** statistically significant at α=0.05, * statistically significant at α=0.10` Acknowledgements Alkire, S. and Robles, G. (2017). “Multidimensional Poverty Index Summer 2017: Brief methodological note and results.” OPHI Methodological Note 44, University of Oxford. Alkire, S. and Santos, M. E. (2010). “Acute multidimensional poverty: A new index for developing countries.” OPHI Working Papers 38, University of Oxford. Alkire, S. Jindra, C. Robles, G. and Vaz, A. (2017). ‘Multidimensional Poverty Index – Summer 2017: brief methodological note and results’. OPHI MPI Methodological Notes No. 44, Oxford Poverty and Human Development Initiative, University of Oxford. [OPHI Kaggle\'s Page](https://www.kaggle.com/ophi/mpi) Inspiration Further evaluate OPHI\'s approach to comparing subnational regions for various years. Then, consider how much Kiva\'s microcredit impacted the subnational MPI change. Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? This data was donated by researchers of the University of Wisconsin and includes the measurements from digitized images of fine-needle aspirate of a breast mass. You can find the dataset at https://github.com/dataspelunking/MLwR/blob/master/Machine%20Learning%20with%20R%20(2nd%20Ed.)/Chapter%2003/wisc_bc_data.csv. The breast cancer data includes 569 examples of cancer biopsies, each with 32 features. One feature is an identification number, another is the cancer diagnosis and 30 are numeric-valued laboratory measurements. The diagnosis is coded as "M" to indicate malignant or "B" to indicate benign. The other 30 numeric measurements comprise the mean, standard error and worst (i.e. largest) value for 10 different characteristics of the digitized cell nuclei, which are as follows:- - Radius - Texture - Perimeter - Area - Smoothness - Compactness - Concavity - Concave Points - Symmetry - Fractal dimension Context There's a story behind every dataset and here's your opportunity to share yours. Content Pima Indian Diabetes Data Acknowledgements Jerry Kurata Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? Context Image Segmentation is a complicated problem that often cannot be performed in a fully automatic manner. We use this dataset as a way for testing and exploring methods to make such semi-automatic segmentation work better Content 151 images with full segmentations and paint strokes (compiled by: http://www.robots.ox.ac.uk/~vgg/data/iseg/) Acknowledgements Visual Graphics Group at Oxford for Compiling the data GrabCut Dataset from Microsoft PASCAL Dataset Alpha Matting Dataset Inspiration - How well do different techniques work at expanding the initial labels to a full segmentation? - Which techniques are quick enough to run in real time (the paint strokes are normally given and the user waits for feedback, they can't be precomputed) - Are any of these techniques easy to implement in the JavaScript so they could be browser-based? Facial keypoints detection -&gt; improving prediction Context Kaggle kernels have no internet connectivity, everything you use must be a dataset. Content A pair of african elephants. Acknowledgements Taken from here: https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/5.4-visualizing-what-convnets-learn.ipynb Inspiration It's basically a toy, just for testing and learning purpose. This dataset contains Version 2.3 of the Global Precipitation Climatology Project (GPCP) Monthly Analysis Product. The data are monthly analyses defined on a global 2.5 degree by 2.5 degree longitude/latitude grid and cover the period January 1979 to (delayed) present. The objective of the BRFSS is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases in the adult population. Factors assessed by the BRFSS include tobacco use, health care coverage, HIV/AIDS knowledge or prevention, physical activity, and fruit and vegetable consumption. Data are collected from a random sample of adults (one per household) through a telephone survey. The Behavioral Risk Factor Surveillance System (BRFSS) is the nation's premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world. Content - Each year contains a few hundred columns. Please see one of the [annual code books][1] for complete details. - These CSV files were converted from a SAS data format using pandas; there may be some data artifacts as a result. - If you like this data, you might also enjoy [the 2011-2015 batch][2]. Please note that those years use a different format. Acknowledgements This dataset was released by the CDC. You can find the original dataset, manuals, and [additional years of data here][3]. [1]: https://www.cdc.gov/brfss/annual_data/2001/pdf/codebook_01.pdf [2]: https://www.kaggle.com/cdc/behavioral-risk-factor-surveillance-system [3]: https://www.cdc.gov/brfss/annual_data/annual_data.htm The existing code-based program implemented in GitHub portal provides a great tool for scientists and students for data sharing and notification of the co-workers, tutors and supervisors involved in research about actual updates. It enables to connect collaborators to share around current results, release datasets and updates and many more. Using standard command-line interface GitHub allows registered users to push repositories on the site. The availability of both public and private repositories enables to share current data updates with target audience: e.g., unpublished research work only for co-authors or supervisors, or, vice versa, successfully defended Fig.1. Fragment of the text written using LaTeX and processed by Git. Therefore, there is a need in academic centers and universities to strongly popularize and increase the use of GitHub for student works. The case study is given on the graduate study: an MSc work successfully written and maintained using open source GitHub service at the University of Twente, Faculty of Geo-Information Science and Earth Observation (Netherlands) entitled “Seagrass monitoring and mapping along the coasts of Greece, Crete”. Current presentation reports my own experience of management and organization of MSc thesis project. In spite of traditional and highly ineffective tool of MS Word, I used the effective combination of LaTeX tools with GitHub for data thesis is open for public. However, despite the evident usefulness and perspectives of GitHub, the existing users of GitHub mostly include the programmer communities and IT specialists. Therefore, there is a need in academic centers and universities to strongly popularize and increase the use of GitHub for student works. The case study is given on the graduate study: an MSc work successfully written and maintained using open source GitHub service at the University of Twente, Faculty of Geo-Information Science and Earth Observation (Netherlands) entitled “Seagrass monitoring and mapping along the coasts of Greece, Crete”. This dataset contains key characteristics about the data described in the Data Descriptor Longitudinal dataset of human-building interactions in U.S. offices. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format Dataset containing observations of sea turtle and the climatic data related. The data is from Portal da Biodiversidade (https://portaldabiodiversidade.icmbio.gov.br), GBIF and Bio-Oracle (http://bio-oracle.org). The data from Portal da Biodiversidade contains observations of sea turtles in Brazil, colected by researchers from all over the country. The GBIF data contains observations from various countries. The geophysical, biotic and environmental data for surface marine realms was exported from Bio-Oracle marine data layers. The columns in the file mean: BO_calcite - Calcite (mol.m-3) BO_chlomax - Chlorophyll (mg.m-3) BO_chlomean - Chlorophyll (mg.m-3) BO_chlomin - Chlorophyll (mg.m-3) BO_chlorange - Chlorophyll (mg.m-3) BO_cloudmax - Cloud cover (%) BO_cloudmean - Cloud cover (%) BO_cloudmin - Cloud cover (%) BO_damax - Diffuse attenuation (m-1) BO_damean - Diffuse attenuation (m-1) BO_damin - Diffuse attenuation (m-1) BO_dissox - Dissolved molecular oxygen (mol.m-3) BO_nitrate - Nitrate (mol.m-3) BO_parmax - Photosynt. Avail. Radiation (E.m-2.day-1) BO_parmean - Photosynt. Avail. Radiation (E.m-2.day-1) BO_ph - pH BO_phosphate - Phosphate (mol.m-3) BO_salinity - Salinity (PSS) BO_silicate - Silicate (mol.m-3) BO_sstmax - Temperature (ºC) BO_sstmean - Temperature (ºC) BO_sstmin - Temperature (ºC) BO_sstrange - Temperature (ºC) BO_bathymin - Bathymetry (m) BO_bathymax - Bathymetry (m) BO_bathymean - Bathymetry (m) More information: www.bio-oracle.org Supplemental Dataset Bibliographic dataset compiling all Scopus records relevant to Agent-based Complex Systems (ACS) science (i.e. the fields of agent-based and individual-based modelling). This dataset was post-processed and used to create a citation graph using the Diderot R package (https://cran.r-project.org/package=Diderot). resolution: 512*512*273; unsigned short; a 0.3mm resolution.This dataset was obtained using the dental CBCT imaging system ZCB100 (Shenzhen ZhongKe TianYue Technology Co., Ltd.) at 110 kV and 10 mAs, with a 15.36-cm FOV Reference in the dataset Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? Reconstructed slices of a 3D tomographic dataset after ring artifacts suppression using algorithm 6, algorithm 5, and algorithm 3 of our approaches. This dataset contains information on UK university lecture capture policies as of April 2018. The dataset was used for the paper "Employee Surveillance: The Road to Surveillance is Paved with Good Intentions", by Lilian Edwards, Laura Martin and Tristan Henderson, accepted for presentation at the Amsterdam Privacy Conference, October 2018. Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? ChinaCropPhen1km: A high-resolution crop phenological dataset for three staple crops in China during 2000-2015 based on LAI products The data file is in tif format, and the file name is named "crop type"+"doy"+"key phenological stages"+".tif". Among them, the crop type takes values 1, 2, and 3, representing maize, wheat, and rice, respectively. The key phenological stage has a value of 1, 2, and 3. The specific meaning is different for different crops. For wheat, the key growth period is followed by green-up (emergence), heading and maturity date, while for maize, the key growth period is followed by three-leaf (V3) stage, heading and maturity date. For rice, the key growth period is followed by transplanting stage, heading and maturity date. The data with a spatial resolution of 1 km. File may be viewed using ProSeq software at http://dps.plants.ox.ac.uk/sequencing/proseq.htm [Filatov DA (2009) Processing and population genetic analysis of multigenic datasets with ProSeq3 software. Bioinformatics 25: 3189-3190]. Sequence datasets used in phylogenetic analyses. Allele files produced by pyRAD are provided. See text for dataset nomenclature prescription dataset. Context It is a well known fact that Millenials LOVE Avocado Toast. It's also a well known fact that all Millenials live in their parents basements. Clearly, they aren't buying home because they are buying too much Avocado Toast! But maybe there's hope... if a Millenial could find a city with cheap avocados, they could live out the Millenial American Dream. Content This data was downloaded from the Hass Avocado Board website in May of 2018 &amp; compiled into a single CSV. Here's how the [Hass Avocado Board describes the data on their website][1]: &gt; The table below represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table. Some relevant columns in the dataset: - `Date` - The date of the observation - `AveragePrice` - the average price of a single avocado - `type` - conventional or organic - `year` - the year - `Region` - the city or region of the observation - `Total Volume` - Total number of avocados sold - `4046` - Total number of avocados with PLU 4046 sold - `4225` - Total number of avocados with PLU 4225 sold - `4770` - Total number of avocados with PLU 4770 sold Acknowledgements Many thanks to the Hass Avocado Board for sharing this data!! http://www.hassavocadoboard.com/retail/volume-and-price-data Inspiration In which cities can millenials have their avocado toast AND buy a home? Was the Avocadopocalypse of 2017 real? [1]: http://www.hassavocadoboard.com/retail/volume-and-price-data This dataset contains key characteristics about the data described in the Data Descriptor A dataset of cetacean occurrences in the Eastern North Atlantic. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format Context Each line in LiveStreaming records a sequence of browsed items of a user in ascending order of time. Each number represents a unique itemID. Context While implementing the paper https://arxiv.org/pdf/1511.05440.pdf, we realized that all the datasets available are either have image data but are very big, or have videos from which frames have to be extracted manually. In order to fix that, we created this dataset. Content This dataset contains sequences of images extracted from the starting scene of the movie "The Hobbit: An Unexpected Journey". While implementing the paper https://arxiv.org/pdf/1511.05440.pdf, we realized that all the datasets available are either have image data but are very big, or have videos from which frames have to be extracted manually. In order to fix that, we created this dataset. Also, an implementation of the paper using this dataset can be found here: https://github.com/akshaybapat04/video_prediction Acknowledgements We used GOM Player to take snapshots of the video. https://github.com/akshaybapat04/video_prediction Inspiration Can you predict the next image in an image sequence? Copyright information:Taken from "Function-informed transcriptome analysis of renal tubule"Genome Biology 2004;5(9):R69-R69.Published online 26 Aug 2004PMCID:PMC522876.Copyright © 2004 Wang et al.; licensee BioMed Central Ltd. Genes enriched in tubules are historically under-researched. The percentage of genes with explicit names (other than automatic CG annotations) is shown for the entire genome, and for the top 50, 100 and 200 genes (as judged by fold enrichment) from the tubule dataset. Machine learning has emerged as a discipline that enables computers to assist humans in making sense of large and complex data sets. With the drop-in cost of sequencing technologies, large amounts of omics data are being generated and made accessible to researchers. Analysing these complex high-volume data is not trivial and the use of classical tools cannot explore their full potential. Machine learning can thus be very useful in mining large omics datasets to uncover new insights that can advance the field of medicine and improve health care.The aim of this tutorial is to introduce participants to the Machine learning (ML) taxonomy and common machine learning algorithms. The tutorial will cover the methods being used to analyse different omics data sets by providing a practical context through the use of basic but widely used R and Python libraries. The tutorial will comprise a number of hands on exercises and challenges, where the participants will acquire a first understanding of the standard ML processes as well as the practical skills in applying them on familiar problems and publicly available real-world data sets. @page { size: 8.5in 11in; margin: 0.79in } p { margin-bottom: 0.1in; direction: ltr; color: 000000; line-height: 115%; text-align: left; orphans: 2; widows: 2; background: transparent } p.western { font-family: "Liberation Serif", serif; font-size: 12pt; so-language: en-US } p.cjk { font-family: "Noto Sans CJK SC Regular"; font-size: 12pt; so-language: zh-CN } p.ctl { font-family: "Lohit Devanagari"; font-size: 12pt; so-language: hi-IN } Dataset S1. Assembly of S1 and S2 specimens separated into bins. Raw data for the Hay Lake pollen dataset obtained from the Neotoma Paleoecological Database. The objective of this work was to pre-process the Soil Landscapes of Canada (SLC) database to offer a country-level soils dataset in a format ready to be used in SWAT simulations. A two-level screening process was used to identify critical information required by SWAT and to remove records with information that could not be calculated or estimated. Out of the 14,063 unique soils in the SLC, 11,838 soils with complete information were included in the dataset presented here. Soils with missing records for the required SWAT variables were removed from the analysis. These soils were compiled into a soils list provided as a reference ("incomplete" dataset). The ultimate Soccer database for data analysis and machine learning ------------------------------------------------------------------- **What you get:** - +25,000 matches - +10,000 players - 11 European Countries with their lead championship - Seasons 2008 to 2016 - Players and Teams\' attributes* sourced from EA Sports\' FIFA video game series, including the weekly updates - Team line up with squad formation (X, Y coordinates) - Betting odds from up to 10 providers - Detailed match events (goal types, possession, corner, cross, fouls, cards etc...) for +10,000 matches **16th Oct 2016: New table containing teams\' attributes from FIFA !* ---------- **Original Data Source:** You can easily find data about soccer matches but they are usually scattered across different websites. A thorough data collection and processing has been done to make your life easier. **I must insist that you do not make any commercial use of the data**. The data was sourced from: - [http://football-data.mx-api.enetscores.com/][1] : scores, lineup, team formation and events - [http://www.football-data.co.uk/][2] : betting odds. [Click here to understand the column naming system for betting odds:][3] - [http://sofifa.com/][4] : players and teams attributes from EA Sports FIFA games. *FIFA series and all FIFA assets property of EA Sports.* &gt; When you have a look at the database, you will notice foreign keys for &gt; players and matches are the same as the original data sources. I have &gt; called those foreign keys "api_id". ---------- **Improving the dataset:** You will notice that some players are missing from the lineup (NULL values). This is because I have not been able to source their attributes from FIFA. This will be fixed overtime as the crawling algorithm is being improved. The dataset will also be expanded to include international games, national cups, Champion\'s League and Europa League. Please ask me if you\'re after a specific tournament. &gt; Please get in touch with me if you want to help improve this dataset. [CLICK HERE TO ACCESS THE PROJECT GITHUB][5] *Important note for people interested in using the crawlers:* since I first wrote the crawling scripts (in python), it appears sofifa.com has changed its design and with it comes new requirements for the scripts. The existing script to crawl players (\'Player Spider\') will not work until i\'ve updated it. ---------- Exploring the data: Now that\'s the fun part, there is a lot you can do with this dataset. I will be adding visuals and insights to this overview page but please have a look at the kernels and give it a try yourself ! Here are some ideas for you: **The Holy Grail...** ... is obviously to predict the outcome of the game. The bookies use 3 classes (Home Win, Draw, Away Win). They get it right about 53% of the time. This is also what I\'ve achieved so far using my own SVM. Though it may sound high for such a random sport game, you\'ve got to know that the home team wins about 46% of the time. So the base case (constantly predicting Home Win) has indeed 46% precision. **Probabilities vs Odds** When running a multi-class classifier like SVM you could also output a probability estimate and compare it to the betting odds. Have a look at your variance vs odds and see for what games you had very different predictions. **Explore and visualize features** With access to players and teams attributes, team formations and in-game events you should be able to produce some interesting insights into [The Beautiful Game][6] . Who knows, Guardiola himself may hire one of you some day! [1]: http://football-data.mx-api.enetscores.com/ [2]: http://www.football-data.co.uk/ [3]: http://www.football-data.co.uk/notes.txt [4]: http://sofifa.com/ [5]: https://github.com/hugomathien/football-data-collection/tree/master/footballData [6]: https://en.wikipedia.org/wiki/The_Beautiful_Game Overwatch is a team-based multiplayer first-person shooter video game developed and published by Blizzard Entertainment. Overwatch puts players into two teams of six, with each player selecting one of several pre-defined hero characters with unique movement, attributes, and abilities; these heroes are divided into four classes: Offense, Defense, Tank and Support. Players on a team work together to secure and defend control points on a map and/or escort a payload across the map in a limited amount of time. Players gain cosmetic rewards that do not affect gameplay, such as character skins and victory poses, as they continue to play in matches. The game was launched with casual play, while Blizzard added competitive ranked play about a month after launch. Additionally, Blizzard has developed and added new characters, maps, and game modes post-release, while stating that all Overwatch updates will remain free, with the only additional cost to players being microtransactions to earn additional cosmetic rewards. ( Wikipedia - https://en.wikipedia.org/wiki/Overwatch_(video_game) ) ![enter image description here][1] [1]: http://images.pushsquare.com/news/2016/12/game_of_the_year_2016_3_-_overwatch/attachment/0/original.jpg Open clinical trial data provide a valuable opportunity for researchers worldwide to assess new hypotheses, validate published results, and collaborate for scientific advances in medical research. Here, we present a health dataset for the non-invasive detection of cardiovascular disease (CVD), containing 657 data records from 219 subjects. The dataset covers an age range of 20–89 years and records of diseases including hypertension and diabetes. Data acquisition was carried out under the control of standard experimental conditions and specifications. This dataset can be used to carry out the study of photoplethysmograph (PPG) signal quality evaluation and to explore the intrinsic relationship between the PPG waveform and cardiovascular disease to discover and evaluate latent characteristic information contained in PPG signals. These data can also be used to study early and noninvasive screening of common CVD such as hypertension and other related CVD diseases such as diabetes. The expected sea level rise by the year 2100 will determine an adaptation of the whole coastal system and the land retreat of the shoreline. Future scenarios coupled with the improving of mining technologies will favour an increased exploitation of sand deposits for nourishments, especially for urban beaches and sandy coasts with lowlands behind. Objective of the work is to provide useful tools to support planning actions in the management of sand deposits located in the continental shelf of western Sardinia (western Mediterranean Sea). The work has been realized through the integration of data and information collected during several projects. Available data consist of morpho-bathymetric data (multibeam) associated with morphoacoustic (backscatter) data, collected in the depth range -25 to -700 m. Extensive coverage of high-resolution seismic profiles (Chirp 3.5 kHz) have been acquired along the continental shelf. Also surface sediment samples (Van Veen grab and box corer) and vibrocores have been collected. These data allow mapping of the submerged sand deposits with the determination of their thickness and volumes, and their sedimentological characteristics. Furthermore, it is possible to map the seabed geomorphological features of the continental shelf of western Sardinia. All the available data (doi:10.1594/PANGAEA.895430) have been integrated and organized in a geodatabase implemented through a GIS and the software suite Geoinformation Enabling Toolkit StarterKit ® (GET-IT), developed by researchers of the Italian National Research Council for RITMARE project. GET-IT facilitates the creation of distributed nodes of an interoperable Spatial Data Infrastructure (SDI) and enables unskilled researchers from various scientific domains to create their own Open Geospatial Consortium (OGC) standard services for distributing geospatial data, observations and metadata of sensors and datasets.Data distribution through standard services follows the guidelines of the European Directive INSPIRE (DIRECTIVE 2007/2/EC); in particular, standard metadata describe each map level, containing identifiers such as data type, origin, property, quality, processing processes to foster data searching and quality assessment. Estimates of selection on juvenile size traits, compiled by Njal Rollinson and Locke Rowe in September 2013. These data are are described as the "J-S Database" in the main text. Both the data included in our formal selection analyses as well as data omitted from formal analyses are included in this dataset. Context Hey everyone out there! Wikipedia is a publicly available encyclopedia which can be modified by anyone. Some of these modifications are useful whereas some are not. This data set captures all the edits done to English Wikipedia by anyone across the globe. As there are two edits per second, the data which I have collected is for just 20 minutes. Content I have revised the original data set, removed the duplicates and included only the relevant and useful columns. This data set has below mentioned columns: a) action : only edits action is captured. Other actions maybe Talk, etc. b) change_size : the number of characters added or deleted. Positive size means the change was added and negative means the change was deleted. c) geo_ip : This is null if the user is registered in Wikipedia otherwise it is a JSON object containing city, latitude, country_name, region_name and longitude d) is_anonymous : This is a flag/boolean value(true/false) that notifies whether the user is registered or unregistered(anonymous) e) is_bot : This flag/boolean value(true/false) determines if the user is a bot(robot) or a human. f) is_minor: Thus flag/boolean value(true/false) identifies whether the change made to Wikipedia article was minor or major one. g) page_title : This is the title of the Wikipedia article edited by the user. h) url : This field has the URL or link which compares the Wikipedia article before and after the change. i) user : If the user is unregistered, this field will have IP Address either in IPv4 or IPv6 format and if the user is register it will contain the username used when registering on Wikipedia. Acknowledgements I would like to thank hatnote.com from which I could get this data. If you need the original data you may visit www.hatnote.com or directly connect this WebSocket - ws://wikimon.hatnote.com/en/ Context Bitcoin is the longest running and most well known cryptocurrency, first released as open source in 2009 by the anonymous Satoshi Nakamoto. Bitcoin serves as a decentralized medium of digital exchange, with transactions verified and recorded in a public distributed ledger (the blockchain) without the need for a trusted record keeping authority or central intermediary. Transaction blocks contain a SHA-256 cryptographic hash of previous transaction blocks, and are thus "chained" together, serving as an immutable record of all transactions that have ever occurred. As with any currency/commodity on the market, bitcoin trading and financial instruments soon followed public adoption of bitcoin and continue to grow. Included here is historical bitcoin market data at 1-min intervals for select bitcoin exchanges where trading takes place. Happy (data) mining! Content coincheckJPY_1-min_data_2014-10-31_to_2018-06-27.csv bitflyerJPY_1-min_data_2017-07-04_to_2018-06-27.csv coinbaseUSD_1-min_data_2014-12-01_to_2018-06-27.csv bitstampUSD_1-min_data_2012-01-01_to_2018-06-27.csv CSV files for select bitcoin exchanges for the time period of Jan 2012 to July 2018, with minute to minute updates of OHLC (Open, High, Low, Close), Volume in BTC and indicated currency, and weighted bitcoin price. Timestamps are in Unix time. Timestamps without any trades or activity have their data fields forward filled from the last valid time period. If a timestamp is missing, or if there are jumps, this may be because the exchange (or its API) was down, the exchange (or its API) did not exist, or some other unforseen technical error in data reporting or gathering. All effort has been made to deduplicate entries and verify the contents are correct and complete to the best of my ability, but obviously trust at your own risk. Acknowledgements and Inspiration Bitcoin charts for the data. The various exchange APIs, for making it difficult or unintuitive enough to get OHLC and volume data at 1-min intervals that I set out on this data scraping project. Satoshi Nakamoto and the novel core concept of the blockchain, as well as its first execution via the bitcoin protocol. I\'d also like to thank viewers like you! Can\'t wait to see what code or insights you all have to share. I am a lowly Ph.D. student who did this for fun in my meager spare time. If you find this data interesting and you can spare a coffee to fuel my science, send it my way and I\'d be immensely grateful! 1kmWmcQa8qN9ZrdGfdkw8EHKBgugKBRcF Please refer [here](https://github.com/jp2011/london-crime-data-retriever) for further information. Context Anonymized data from profiles scraped on LinkedIn. Contains data from about 15000 profiles. Profiles came from people predominantly located in Australia. Includes all their work history as well as analysis of their photo and name. Content Each row contains: * Profile data * Job data * Name analysis (Race, Gender) * Profile picture analysis (Age, Race, Gender, Attractiveness, Health, Emotionality) Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? Description Data from [Our World In Data][1] with the mortality rates for causes of death by country and region between 1990 and 201 Acknowledgements Thanks to Our World In Data for collecting this information. [1]: https://ourworldindata.org/ Dataset for prediction of material elastic tensors.Note on citations: If you found this dataset useful and would like to cite it in your work, please be sure to cite its original sources below rather than or in addition to this page.Dataset described in:de Jong M, Chen W, Angsten T, Jain A, Notestine R, Gamst A, Sluiter M, Ande CK, van der Zwaag S, Plata JJ, Toher C, Curtarolo S, Ceder G, Persson KA, Asta M (2015) Charting the complete elastic properties of inorganic crystalline compounds. Scientific Data 2: 150009. https://doi.org/10.1038/sdata.2015.9Data converted from json file available on Dryad (see references 3-4):de Jong M, Chen W, Angsten T, Jain A, Notestine R, Gamst A, Sluiter M, Ande CK, van der Zwaag S, Plata JJ, Toher C, Curtarolo S, Ceder G, Persson KA, Asta M (2015) Data from: Charting the complete elastic properties of inorganic crystalline compounds. Dryad Digital Repository. https://doi.org/10.5061/dryad.h505v This dataset contains key characteristics about the data described in the Data Descriptor Temporary dense seismic network during the 2016 Central Italy seismic emergency for microzonation studies. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format A multimodal dataset (visual, auditory, electric median nerve) recorded with a Neuromag Vectorview 306-channel MEG system. International Financial Statistics (IFS) is a standard source of international statistics on all aspects of international and domestic finance. It reports, for most countries of the world, current data needed in the analysis of problems of international payments and of inflation and deflation, i.e., data on exchange rates, international liquidity, international banking, money and banking, interest rates, prices, production, international transactions, government accounts, and national accounts. Last update in UNdata: 14 May 2010 If you need more current data, the IMF has made their current database available for [bulk download for personal use](http://data.imf.org/?sk=388DFA60-1D26-4ADE-B505-A05A558D9A42). Acknowledgements This dataset was kindly published by the United Nations on the UNData site. You can find [the original dataset here](http://data.un.org/Explorer.aspx). License [Per the UNData terms of use](http://data.un.org/Host.aspx?Content=UNdataUse): all data and metadata provided on UNdata’s website are available free of charge and may be copied freely, duplicated and further distributed provided that [UNdata](http://data.un.org/Explorer.aspx) is cited as the reference. Context Fashion-MNIST is a dataset of Zalando\'s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. The original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. "If it doesn\'t work on MNIST, it won\'t work at all", they said. "Well, if it does work on MNIST, it may still fail on others." Zalando seeks to replace the original MNIST dataset Content Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255. The training and test data sets have 785 columns. The first column consists of the class labels (see above), and represents the article of clothing. The rest of the columns contain the pixel-values of the associated image. - To locate a pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27. The pixel is located on row i and column j of a 28 x 28 matrix. - For example, pixel31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ascii-diagram below. **Labels** Each training and test example is assigned to one of the following labels: - 0 T-shirt/top - 1 Trouser - 2 Pullover - 3 Dress - 4 Coat - 5 Sandal - 6 Shirt - 7 Sneaker - 8 Bag - 9 Ankle boot TL;DR - Each row is a separate image - Column 1 is the class label. - Remaining columns are pixel numbers (784 total). - Each value is the darkness of the pixel (1 to 255) Acknowledgements - Original dataset was downloaded from [https://github.com/zalandoresearch/fashion-mnist][1] - Dataset was converted to CSV with this script: [https://pjreddie.com/projects/mnist-in-csv/][2] License The MIT License (MIT) Copyright © [2017] Zalando SE, https://tech.zalando.com Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. [1]: https://github.com/zalandoresearch/fashion-mnist [2]: https://pjreddie.com/projects/mnist-in-csv/ Summary of CIBERSORT algorithm analysis with all datasets and PCD samples. Cell types are in column and samples in row. Values are ratios over 1. The Global Energy Balance Archive (GEBA) is a database for the central storage of the worldwide measured energy fluxes at the Earth\'s surface, maintained at ETH Zurich (Switzerland). This paper documents the status of the GEBA version 2017 dataset, presents the new web interface and user access, and reviews the scientific impact that GEBA data had in various applications. GEBA has continuously been expanded and updated and contains in its 2017 version around 500.000 monthly mean entries of various surface energy balance components measured at 2500 locations. The database contains observations from 15 surface energy flux components, with the most widely measured quantity available in GEBA being the shortwave radiation incident at the Earth\'s surface (global radiation). Many of the historic records extend over several decades. GEBA contains monthly data from a variety of sources, namely from the World Radiation Data Centre (WRDC) in St. Petersburg, from national weather services, from different research networks (BSRN, ARM, SURFRAD), from peer-reviewed publications, project and data reports, and from personal communications. Quality checks are applied to test for gross errors in the dataset. GEBA has played a key role in various research applications, such as in the quantification of the global energy balance, in the discussion of the anomalous atmospheric shortwave absorption, and in the detection of multi-decadal variations in global radiation, known as "global dimming" and "brightening". GEBA is further extensively used for the evaluation of climate models and satellite-derived surface flux products. On a more applied level, GEBA provides the basis for engineering applications in the context of solar power generation, water management, agricultural production and tourism. GEBA is publicly accessible through the internet via http://www.geba.ethz.ch. Mean body sizes, approximated from the natural logarithm of the lower first or second molar area, and the proposed evolutionary relationships between mammalian genera from the late Clarkforkian and earliest Wasatchian (Cf3 to Wa0) of the Bighorn and Clarks Fork Basins, Wyoming, USA. For details of the dataset see caption for electronic supplementary material, dataset S1. classical iris dataset Content Big collection of quotes with their authors, category, tags and popularity(from 0 to 1) Job Posts dataset The dataset consists of 19,000 job postings that were posted through the Armenian human resource portal CareerCenter. The data was extracted from the Yahoo! mailing group https://groups.yahoo.com/neo/groups/careercenter-am. This was the only online human resource portal in the early 2000s. A job posting usually has some structure, although some fields of the posting are not necessarily filled out by the client (poster). The data was cleaned by removing posts that were not job related or had no structure. The data consists of job posts from 2004-2015 Content jobpost – The original job post date – Date it was posted in the group Title – Job title Company - employer AnnouncementCode – Announcement code (some internal code, is usually missing) Term – Full-Time, Part-time, etc Eligibility -- Eligibility of the candidates Audience --- Who can apply? StartDate – Start date of work Duration - Duration of the employment Location – Employment location JobDescription – Job Description JobRequirment - Job requirements RequiredQual -Required Qualification Salary - Salary ApplicationP – Application Procedure OpeningDate – Opening date of the job announcement Deadline – Deadline for the job announcement Notes - Additional Notes AboutC - About the company Attach - Attachments Year - Year of the announcement (derived from the field date) Month - Month of the announcement (derived from the field date) IT – TRUE if the job is an IT job. This variable is created by a simple search of IT job titles within column “Title” Acknowledgements The data collection and initial research was funded by the American University of Armenia’s research grant (2015). Inspiration The online job market is a good indicator of overall demand for labor in the local economy. In addition, online job postings data are easier and quicker to collect, and they can be a richer source of information than more traditional job postings, such as those found in printed newspapers. The data can be used in the following ways: -Understand the demand for certain professions, job titles, or industries -Help universities with curriculum development -Identify skills that are most frequently required by employers, and how the distribution of necessary skills changes over time -Make recommendations to job seekers and employers Past research We have used association rules mining and simple text mining techniques to analyze the data. Some results can be found here (https://www.slideshare.net/HabetMadoyan/it-skills-analysis-63686238). Missing Persons India ============== Taken from the pdf available at [National Crime Records Bureau](http://ncrb.nic.in/MissingUidb/20170821-Missing%20Person%20Report.pdf). Since the original was a PDF this is a table extracted from the original PDF using scripts. Some level of noise is present in the data partly due to the original source and partly due to the extraction scripts. Content Nigerian dishes from Yoruba ethnic. It contains 6 classifiers of food(amala,eba,efo,ewedu,fufu and iyan) This is a CSV file containing the data behind the Altmetric Top 100 for 2017.A second dataset pulls out the authors and institutions:https://figshare.com/articles/altmetric_top_100_authors_2017_csv/5683963 Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? Datasets used in this study This work presents the Guadalfeo Monitoring Network in Sierra Nevada (Spain), a snow monitoring network in the Guadalfeo Experimental Catchment, a semiarid area in southern Europe representative of snow packs with highly variable dynamics on both the annual and seasonal scales, and significant topographic gradients. The network includes weather stations that cover the high mountain area in the catchment and time-lapse cameras to capture the variability of the ablation phases on different spatial scales. This dataset consists of snow cover maps of the time-lapse camera C1 of the Guadalfeo Monitoring Network, at 10m x 10m spatial resolution and for those days when Landsat satellites overpasses the area (Pimentel et al., 2017). Many women who are initially thought to have angina turn out to have normal coronary angiograms, that is they are found not to have angina after all. A study was carried out to assess the feasibility of a preliminary screening test. For a large number of patients who were thought to have angina, information on a number of possible risk factors was collected and then their subsequent angina status was recorded. The data is available as an R data frame entitled angina and contains the following information: status: whether woman turns out to have angina (yes/no) age: age of a \u200bwoman smoke: smoking status (1=current-, 2=ex-, 3=non-smoker) cig: current average number of cigarettes per day hyper: hypertension (1=absent, 2=mild, 3=moderate) angfam: family history of angina (yes/no) myofam: family history of myocardial infarction (yes/no) strokefam: family history of stroke (yes/no) diabetes: does woman have diabetes? (yes/no) Missing values are coded as NA. The main aim of this study was to try to find out which, if any, of the health variables, are associated with angina and whether some subset of them could be used to help predict the dependent variable angina status. The accompanying document on the `Model selection through backward elimination’ is going to be useful for that purpose. More specifically, it would be helpful to be able to estimate the risk/probability that a woman with a particular combination of these health variables truly has angina. If such a scheme of estimating risks can be constructed, is it likely to be useful? i.e. is it good at predicting whether a woman has angina or not (since the treatment of angina is expensive)? In addition, it would be of interest to estimate the individual effects of important variables. For example, if smoking seems to be a risk factor, then what is the odds of a smoker having angina relative to a non-smoker? What about ex-smokers and light smokers? Raw data for the Floating Island Lake pollen dataset obtained from the Neotoma Paleoecological Database. FROM-GLC-Hierarchy (Yu et al., 2014) is land cover dataset with multi-resolution (i.e. 30 m, 250 m, 500 m, 1 km, 5 km, 10 km, 25 km, 50 km, 100 km) to meet requirements for different resolutions from different applications. The 30 m base map was improved from FROM-GLC-agg with additional coarse resolution datasets (i.e., MCD12Q1 (Friedl et al., 2010), GlobCover2009 (Bontemps et al., 2010) etc.) to reduce land cover type confusion. Around 1.1% pixels were replaced by coarse resolution products. Validation based assessments indicate the accuracy for land cover maps at 30 m, 250 m, 500 m, 1 km resolutions are 69.50%, 76.65%, 74.65%, and 73.47%, respectively. Further analysis of area-estimation biases for different land cover types at different resolutions suggests that maps at coarser than 5 km resolution contain at least 5% area estimation error for most land cover types. Proportion layers, which contain precise information on land cover percentage, are suggested for use when coarser resolution land cover data are required.Please refer to the classification system at http://data.ess.tsinghua.edu.cn/. This is the raw data from a manuscript entitled "Flow velocity and nutrients affect CO<sub>2</sub> emissions from agricultural drainage channels". Controlled field mesocosms were applied for mimicking agricultural drainage channels and the fate of the initial dissolved inorganic carbon. The dataset has a series of water parameters over time in each mesocosm, including water velocity, water depth, water temperature (WT), EC, dissolved oxygen (DO), dissolved inorganic carbon (DIC), nitrate, CO<sub>2</sub> flux, and CH<sub>4</sub> flux. Context JeuxVideo.com, is a French website specialized in video games since 1997. It is built as an information tool for players by a team of writers and offers news, files, video game tests or video presentations. Jeuxvideo.com is the most popular French site on video game news. The site's attendance record dates from E3 2013, on June 11, 2013, with a peak of 33 million hits on its pages. Content The dataset covers over 700 video game on JeuxVideo.com Acknowledgements Data was scraped from jeuxvideo.com . Context SPECint2006 Rate Results for Intel Xeon Scalable Processors Content Data collected on October 30, 2017 Acknowledgements - [The Standard Performance Evaluation Corporation (SPEC)][1] - [Intel ARK][2] - Photo by Samuel Zeller on Unsplash Inspiration Intel introduced new processor names: Platinum, Gold, Silver and Bronze. It would be nice to visualise difference between them. [1]: https://www.spec.org/ [2]: https://ark.intel.com/ Context This dataset was given to me to resolve a task during my master's program. The idea was to separate three batches of fish being the most unrelated possible using genetic algorithms Content Is a matrix in which every row and column represent a fish, so every cell represents the relation between the fish of that column and the fish in of that row Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? Content More details about each file are in the individual file descriptions. Context This is a dataset hosted by the City of Seattle. The city has an open data platform found [here](https://data.seattle.gov/) and they update their information according the amount of data that is brought in. Explore the City of Seattle using Kaggle and all of the data sources available through the City of Seattle [organization page](https://www.kaggle.com/city-of-seattle)! * Update Frequency: This dataset is updated monthly. Acknowledgements This dataset is maintained using Socrata's API and Kaggle's API. [Socrata](https://socrata.com/) has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public. This dataset is distributed under the following licenses: Open Data Commons Public Domain Dedication and License This dataset (Megapool) contains background subtracted frequencies of CD4 and CD8 T cells producing a combination of IFNγ, IL-2 and TNF in response to stimulation by Megapool for Mtb-infected but healthy individuals (Lindestam Arlehamn, 2016). See referenced paper for details. See Column names - Megapool.docx for meanings of column names in Megapool.xlsx. This is the raw dataset. This file contains the sampling data (date and location), the aphids analysed, the parasitism rate and the parasitoid detected with the molecular tool. This file was used to perform the statistical tests (with R) and to build figure 1. Loans data Dahurian larch (Larix gmelinii Rupr.) is the dominant species in northeast China, which situated in the southernmost part of the global boreal forest biome and undergoing the greatest climatically induced changes. Published studies (1965-2015) on tree aboveground growth of Larix gmelinii forests in northeast China were collected in this study, critically reviewed, and a comprehensive growth dataset was developed from 122 sites, which distributed between 40.85° N and 53.47° N in latitude, between 118.20° E and 133.70° E in longitude, between 130 m and 1260 m in altitude. The dataset was composed of 743 entries, including growth data (mean tree height, mean DBH, mean tree volume and/or stand volume) and the associated information, i.e., geographical location (latitude, longitude, altitude, aspect and slope), climate (mean annual temperature (MAT) and mean annual precipitation (MAP)), stand description (origin, stand age, stand density and canopy density), and sample regime (observing year, plot area and number). It would provide quantitative references for plantation management practices and boreal forest growth prediction under future climate change. This dataset shows maps of the sediment properties and physical environment of the seabed on the northwest European Continental Shelf. Mapped products are: mud, sand and gravel percentages; rock cover; whole-sediment, sand- and gravel-fraction median grain sizes; porosity and permeability; carbon and nitrogen content of sediments; mean and maximum depth-averaged tidal velocity and wave orbital-velocity; monthly natural disturbance rates. Data products are produced at a spatial resolution of 0.125 by 0.125 degrees. [Please note that a previous pre-peer review version of this dataset exists: (http://dx.doi.org/10.15129/07bc686e-a354-40de-8c08-372ced7aad64] Copyright information:Taken from "Unequal evolutionary conservation of human protein interactions in interologous networks"http://genomebiology.com/2007/8/5/R95Genome Biology 2007;8(5):R95-R95.Published online 29 May 2007PMCID:PMC1929159. Co-expression of yeast \'high confidence\' protein interactions (solid lines) and random protein pairs (dotted lines) using two microarray datasets. This network is enriched in stable complexes, represented by a high mean correlation. Co-expression of the yeast \'kinome\' [31], which is enriched for transient interactions. This type of interaction shows co-expression that is highly similar to the random distribution (dotted lines). Distribution of clustering coefficients in stable and transient PPI networks. Complexes are represented by a high C(blue line), while the sparsely connected transient network is typified by a low C(green line). The properties of the human interaction network. The clustering coefficients indicate that this network is more sparsely connected, with few protein complexes. The co-expression profile is only slightly higher than the randomly generated distribution, suggesting the presence of many transient PPIs. Context So I was trying to use a VGG19 pretrained model with Keras but the Docker instance couldn\'t download the model file. There\'s an open ticket for this issue here: https://github.com/Kaggle/docker-python/issues/73 Content Just starting off with VGG16 and VGG19 for now. If this works, I\'ll upload some more. The weights for the full vgg16 and vgg19 files were too large to upload as a single files. I tried uploading them in parts but there wasn\'t enough room to extract them in the working directory. Here\'s an example on how to use the model files: &gt; keras_models_dir = "../input/keras-models" &gt; &gt; model = applications.VGG16(include_top=False, weights=None) &gt; model.load_weights(\'%s/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5\' &gt; % keras_models_dir) Here\'s some more examples on how to use it: https://www.kaggle.com/ekkus93/keras-models-as-datasets-test Acknowledgements I downloaded the files from here: https://github.com/fchollet/deep-learning-models Inspiration I just wanted try out something with the Dogs vs Cats dataset and VGG19. Resting-state fMRI (rsfMRI) data generates time courses with unpredictable hills and valleys. People with musical training may notice that, to some degree, it resemble the notes of a musical scale. Taking advantage of these similarities, and using only rsfMRI data as input, we use basic rules of music theory to transform the data into musical form. Our project is implemented in Python using the midiutil library. We used open rsfMRI from the ABIDE dataset preprocessed by the Preprocessed Connectomes Project. We randomly chose 10 individual datasets preprocessed using C-PAC pipeline with 4 different strategies. To reduce the data dimensionality, we used the CC200 atlas to downsample voxels to 200 regions-of-interest. A framework for generating music from fMRI data, based on music theory, was developed and implemented as a Python tool yielding several audio files. When listening to the results, we noticed that music differed across individual datasets. However, music generated by the same individual (4 preprocessing strategies) remained similar. Our results sound different from music obtained in a similar study using EEG and fMRI data. About the Dataset This dataset contains the complete datacube collected from the Pavia University, Italy. It is an old dataset where Hyperspectral data measures can be implemented on. Datasets of four experiments testing whether subitizing in the periphery can be crowded by nearby flankers. Please note: Please start using ds633.0 to access RDA maintained ERA-5 data, see ERA5 Reanalysis (0.25 Degree Latitude-Longitude Grid) [https://rda.ucar.edu/datasets/ds633.0], RDA dataset ds633.0. This dataset is no longer being updated, and web access will be removed on October 1, 2019.After many years of research and technical preparation, the production of a new ECMWF climate reanalysis to replace ERA-Interim is in progress. ERA5 is the fifth generation of ECMWF atmospheric reanalyses of the global climate, which started with the FGGE reanalyses produced in the 1980s, followed by ERA-15, ERA-40 and most recently ERA-Interim. ERA5 will cover the period January 1950 to near real time, though the first segment of data to be released will span the period 2010-2016.ERA5 is produced using high-resolution forecasts (HRES) at 31 kilometer resolution (one fourth the spatial resolution of the operational model) and a 62 kilometer resolution ten member 4D-Var ensemble of data assimilation (EDA) in CY41r2 of ECMWF's Integrated Forecast System (IFS) with 137 hybrid sigma-pressure (model) levels in the vertical, up to a top level of 0.01 hPa. Atmospheric data on these levels are interpolated to 37 pressure levels (the same levels as in ERA-Interim). Surface or single level data are also available, containing 2D parameters such as precipitation, 2 meter temperature, top of atmosphere radiation and vertical integrals over the entire atmosphere. The IFS is coupled to a soil model, the parameters of which are also designated as surface parameters, and an ocean wave model. Generally, the data is available at an hourly frequency and consists of analyses and short (18 hour) forecasts, initialized twice daily from analyses at 06 and 18 UTC. Most analyses parameters are also available from the forecasts. There are a number of forecast parameters, e.g. mean rates and accumulations, that are not available from the analyses.Improvements to ERA5, compared to ERA-Interim, include use of HadISST.2, reprocessed ECMWF climate data records (CDR), and implementation of RTTOV11 radiative transfer. Variational bias corrections have not only been applied to satellite radiances, but also ozone retrievals, aircraft observations, surface pressure, and radiosonde profiles.NCAR's Data Support Section (DSS) is performing and supplying a grid transformed version of ERA5, in which variables originally represented as spectral coefficients or archived on a reduced Gaussian grid are transformed to a regular 1280 longitude by 640 latitude N320 Gaussian grid. In addition, DSS is also computing horizontal winds (u-component, v-component) from spectral vorticity and divergence where these are available. Finally, the data is reprocessed into single parameter time series.Please note: As of November 2017, DSS is also producing a CF 1.6 compliant netCDF-4/HDF5 version of ERA5 for CISL RDA at NCAR. The netCDF-4/HDF5 version is the de facto RDA ERA5 online data format. The GRIB1 data format is only available via NCAR's High Performance Storage System (HPSS). We encourage users to evaluate the netCDF-4/HDF5 version for their work, and to use the currently existing GRIB1 files as a reference and basis of comparison. To ease this transition, there is a one-to-one correspondence between the netCDF-4/HDF5 and GRIB1 files, with as much GRIB1 metadata as possible incorporated into the attributes of the netCDF-4/HDF5 counterpart. Dataset file Context -------- Game of Thrones is a hit fantasy tv show based on the equally famous book series "A Song of Fire and Ice" by George RR Martin. The show is well known for its vastly complicated political landscape, large number of characters, and its frequent character deaths. Content ------------ Of course, it goes without saying that this dataset contains spoilers ;) This dataset combines three sources of data, all of which are based on information from the book series. - Firstly, there is **battles.csv** which contains Chris Albon\'s "The War of the Five Kings" Dataset. Its a great collection of all of the battles in the series. - Secondly we have **character-deaths.csv** from Erin Pierce and Ben Kahle. This dataset was created as a part of their Bayesian Survival Analysis. - Finally we have a more comprehensive character dataset with **character-predictions.csv**. It includes their predictions on which character will die. Acknowledgements ------------ - Firstly, there is **battles.csv** which contains Chris Albon\'s "The War of the Five Kings" Dataset, which can be found here: https://github.com/chrisalbon/war_of_the_five_kings_dataset . Its a great collection of all of the battles in the series. - Secondly we have **character-deaths.csv** from Erin Pierce and Ben Kahle. This dataset was created as a part of their Bayesian Survival Analysis which can be found here: http://allendowney.blogspot.com/2015/03/bayesian-survival-analysis-for-game-of.html - Finally we have a more comprehensive character dataset with **character-predictions.csv**. This comes from the team at A Song of Ice and Data who scraped it from http://awoiaf.westeros.org/ . It also includes their predictions on which character will die, the methodology of which can be found here: https://got.show/machine-learning-algorithm-predicts-death-game-of-thrones Inspiration ------------ What insights about the complicated political landscape of this fantasy world can you find in this data? Iris data set head SNP datasets for each species and summary table of SNP information. Note that Lithobates sphenocephalus (Lsp) = Rana sphenocephala. This dataset includes the data for our third analysis, which examined fecal glucocorticoids (fGC), fecal estrogens (fE), and fecal progestogens (fP) during PPA as a function of the number of months to sexual cycle resumption, or during the various phases of the sexual cycles (early or late follicular and luteal) as a function of the number of cycles to conception. For fGC and fE we had a total of 5,470 hormone samples collected between 2000 and 2014 for 138 females during 549 IBIs. Of these samples, 4,150 fecal samples were collected during PPA, and 1,320 were collected during cycling. Each female contributed fecal samples from an average of 4 IBIs (range: 1-11) with an average of 10 hormone samples per IBI (range: 1-55) and an average total number of 40 samples per female (range: 1-145). For fP, we had a smaller sample size of 4,095 hormone samples collected between 2003 and 2014 for 130 females during 456 IBIs; 2,922 were collected during PPA and 1,173 during cycling. Each female contributed samples from an average of 3.5 IBIs (range 1-8), with an average of 9 (range 1-47) samples per IBI, for an average total number of 32 (range 1-141) samples per female. dataset for test P-HDBF Modern dataset used for the R code "Indo-Pac_figs.R". This dataset contains key characteristics about the data described in the Data Descriptor De novo transcriptomes of 14 gammarid individuals for proteogenomic analysis of seven taxonomic groups. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format Knowledge about the coverage and characteristics of glaciers in High Mountain Asia is still incomplete and heterogeneous. However, several applications such as modelling of past or future glacier development, runoff or glacier volume, rely on the existence and accessibility of complete datasets. In particular, precise outlines of glacier extent are required to spatially constrain glacier-specific calculations such as length, area and volume changes or flow velocities. As a contribution to the Randolph Glacier Inventory (RGI) and the Global Land Ice Measurements from Space (GLIMS) glacier database, we have produced a homogeneous inventory of the Pamir and the Karakoram mountain ranges using 28 Landsat TM and ETM+ scenes acquired around the year 2000. We applied a standardized method of automated digital glacier mapping and manual correction using coherence images from ALOS-1 PALSAR-1 as an additional source of information; we then separated the glacier complexes into individual glaciers using drainage divides derived by watershed analysis from the ASTER GDEM2, and separately delineated all debris-covered areas. Assessment of uncertainties was performed for debris-covered and clean-ice glacier parts using the buffer method and independent multiple digitizing of three glaciers representing key challenges such as shadows and debris cover. Indeed, along with seasonal snow at high elevations, shadow and debris cover represent the largest uncertainties in our final dataset. In total, we mapped more than 27'800 glaciers &gt;0.02 km² covering an area of 35'520 ±1948 km² and an elevation range from 2260 m to 8600 m. Regional median glacier elevations vary from 4150 m (Pamir Alai) to almost 5400 m (Karakoram), which is largely due to differences in temperature and precipitation. Supraglacial debris covers an area of 3587 ±662 km², i.e. 10% of the total glacierised area. Larger glaciers have a higher share in debris-covered area (up to &gt;20%), making it an important factor to be considered in subsequent applications. Emission intensity pro le is estimated by averaging the signal over several pixel rows from the recorded images to get a complementary dataset that yields information similar to optical time of fight (OTOF) measurements except the fact that current measurements provide a convolution of time of fights of all the species that moves within the plume for various energies in SP scheme. While a double peak is visible for energy of irradiation greater than or equal to 200 micro joules with larger plume length, 100 microjoules shows a single peak with low emission counts and low spatial expansion. Also, emission count corresponding to fast peaks is always less when compared to its slowcounterpart in each case. To compare all the graphs, the emission count is normalized using the maximum emission count in all the data sets (which is the DP 100 case). Content Hotel reviews given by customers. Near-surface air temperatures were monitored from 2005 to 2010 in a mesoscale network of 230 sites in the foothills of the Rocky Mountains in southwestern Alberta, Canada. The monitoring network covers a range of elevations from 890 to 2880\u202fm above sea level and an area of about 18\u202f000\u202fkm², sampling a variety of topographic settings and surface environments with an average spatial density of one station per 78\u202fkm². This paper presents the multiyear temperature dataset from this study, with minimum, maximum, and mean daily temperature data. In this paper, we describe the quality control and processing methods used to clean and filter the data and assess its accuracy. Overall data coverage for the study period is 91\u202f%. We introduce a weather-system-dependent gap-filling technique to estimate the missing 9\u202f% of data. Monthly and seasonal distributions of minimum, maximum, and mean daily temperature lapse rates are shown for the region. json file The results of the membrane feeding experiments performed during the different infectivity surveys are included in this dataset. In addition to data on the feeding experiments, information on the age of study participants is included. Morphological data of the Lunterse beek dataset the file contains Standardized multilocus heterozygosity calculated at 37 putatively neutral markers as well as at 6 MHC-linked markers and fitness data of 147 male Alpine ibex of Gran Paradiso National Park (Italy). Please note that fitness data were only available for a subset of the genotyped individuals. The heterozygosity-fitness correlations were therefore based only on 147 out of the 247 individuals in file "genotypes_PNGP". The full dataset was however used for the calculations of diversity measures and also for the Standardized multilocus heterozygosity that was calculated for each individual as the ratio of its heterozygosity to the mean heterozygosity in the population of the loci at which the individual was genotyped (Coltman et al., 1999). Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? Dataset including information on children's diet and household characteristics for over 43,000 households across 27 developing countries from the Demographic and Health Surveys program 2000 - 2013 (https://www.dhsprogram.com/data/available-datasets.cfm). The dataset also includes spatial variables such as distance to forest edge, road and city. This geodatabase was built to cover several geothermal targets developed by Flint Geothermal in 2012 during a search for high-temperature systems that could be exploited for electric power development. Several of the thermal springs have geochemistry and geothermometry values indicative of high-temperature systems. In addition, the explorationists discovered a very young Climax-style molybdenum porphyry system northeast of Rico, and drilling intersected thermal waters at depth. Datasets include: 1. Structural data collected by Flint Geothermal 2. Point information 3. Mines and prospects from the USGS MRDS dataset 4. Results of reconnaissance shallow (2 meter) temperature surveys 5. Air photo lineaments 6. Areas covered by travertine 7. Groundwater geochemistry 8. Land ownership in the Rico area 9. Georeferenced geologic map of the Rico Quadrangle, by Pratt et al. 10. Various 1:24,000 scale topographic maps During hot summer days and heat waves, bedrooms can become warm and uncomfortable, affecting the sleep quality. This dataset presents bedroom air temperatures, bedroom window positions and important bedroom characteristics (floor, orientation and roof material) to investigate factors that promote cool bedrooms. The dataset presents air temperatures measured in 20 bedrooms of terraced houses in Amsterdam, the Netherlands, during an extremely hot summer week in 2016. The datasets also presents outdoor temperatures measured over the same time period. The dataset includes time series of the window position of the bedroom windows (open, half-open, closed) to investigate the effect of the window position on the bedroom temperature. The air temperature measurements were carried out with Ibuttons, recording air temperatures with a time interval of 10 minutes. The measurement accuracy of an IButton is 0.5 ° C.In each bedroom, two IButtons were placed, often on bedside tables in the shade, and in bedrooms on the first floor. In most cases this is the top floor of the houses, except for two bedrooms which were on the mezzanine floor. Bedrooms were orientated to the southeast or northwest. The roofing material was bitumen/EPDM, gravel or green. Two IButtons measured the outdoor temperature. These were placed outside in the shade: one in the front yard and one in the back yard of one of the terraced houses.All IButtons registered air temperatures between Tuesday, August 22nd, 2016 and Sunday August 28th, 2016. Each hour, the homeowners recorded whether the bedroom window was closed, half-open or open. This dataset includes annual urban extent dynamics (1985-2015) in the conterminous United States at a 30m resolution.(1) The dataset is organized by state (in total 49) in the conterminous US. Location of US states can be found in uploaded figure of “US_State.jpg”. Full names and abbreviations of states are provided in the Excel file of “US_StateList.xls”. (2) The format of provided data is GeoTIFF, i.e., the georeferencing information was embedded within the TIFF file. Each dataset was projected to the Albers Equal Area Conic projection, with a spatial resolution of 30m.(3) The legend in GeoTIFF file can be founded in the figure of “Legend.jpg”, and more detailed information about the urbanized year and the pixel value can be found in the file of “Year_Code_Loopup.csv”. This dataset contains key characteristics about the data described in the Data Descriptor The sequencing and de novo assembly of the Larimichthys crocea genome using PacBio and Hi-C technologies. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format 3. machine readable metadata file in ISA-Tab format (zipped folder) Context ------------- Most publicly available football (soccer) statistics are limited to aggregated data such as Goals, Shots, Fouls, Cards. When assessing performance or building predictive models, this simple aggregation, without any context, can be misleading. For example, a team that produced 10 shots on target from long range has a lower chance of scoring than a club that produced the same amount of shots from inside the box. However, metrics derived from this simple count of shots will similarly asses the two teams. A football game generates much more events and it is very important and interesting to take into account the context in which those events were generated. This dataset should keep sports analytics enthusiasts awake for long hours as the number of questions that can be asked is huge. Content ------- This dataset is a result of a very tiresome effort of webscraping and integrating different data sources. The central element is the text commentary. All the events were derived by reverse engineering the text commentary, using regex. Using this, I was able to derive 11 types of events, as well as the main player and secondary player involved in those events and many other statistics. In case I've missed extracting some useful information, you are gladly invited to do so and share your findings. The dataset provides a granular view of 9,074 games, totaling 941,009 events from the biggest 5 European football (soccer) leagues: England, Spain, Germany, Italy, France from 2011/2012 season to 2016/2017 season as of 25.01.2017. There are games that have been played during these seasons for which I could not collect detailed data. Overall, over 90% of the played games during these seasons have event data. The dataset is organized in 3 files: - **events.csv** contains event data about each game. Text commentary was scraped from: bbc.com, espn.com and onefootball.com - **ginf.csv** - contains metadata and market odds about each game. odds were collected from oddsportal.com - **dictionary.txt** contains a dictionary with the textual description of each categorical variable coded with integers Past Research ------------- I have used this data to: - create predictive models for football games in order to bet on football outcomes. - make visualizations about upcoming games - build expected goals models and compare players Inspiration ----------- There are tons of interesting questions a sports enthusiast can answer with this dataset. For example: - What is the value of a shot? Or what is the probability of a shot being a goal given it's location, shooter, league, assist method, gamestate, number of players on the pitch, time - known as expected goals (xG) models - When are teams more likely to score? - Which teams are the best or sloppiest at holding the lead? - Which teams or players make the best use of set pieces? - In which leagues is the referee more likely to give a card? - How do players compare when they shoot with their week foot versus strong foot? Or which players are ambidextrous? - Identify different styles of plays (shooting from long range vs shooting from the box, crossing the ball vs passing the ball, use of headers) - Which teams have a bias for attacking on a particular flank? And many many more... Context This is a fictitious dataset created for the Data Analytics Bootcamp at ILTACON 2018. Content In this dataset, each legal case is a row. We have generated fake data for each case attribute. Acknowledgements Thanks to @Ventrisfox for generating the fake data. Data Set The labelled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating &lt; 5 results in a sentiment score of 0, and rating &gt;=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labelled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels. File descriptions - **labeledTrainData -** The labelled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review. - **testData -** The test set. The tab-delimited file has a header row followed by 25,000 rows containing an id and text for each review. Your task is to predict the sentiment for each one. - **unlabeledTrainData -** An extra training set with no labels. The tab-delimited file has a header row followed by 50,000 rows containing an id and text for each review. - **sampleSubmission -** A comma-delimited sample submission file in the correct format. Data fields - **id -** Unique ID of each review - **sentiment -** Sentiment of the review; 1 for positive reviews and 0 for negative reviews - **review -** Text of the review The dataset and the corresponding code (Matlab) can be used for recoloring images, thus helping the people with Color Vision Deficiencies (CVD) recognize and communicate color information. Please cite the following paper if you wish to use our dataset and code: Yulun Wang, Duo li, Menghan Hu, Liming Cai, Guangtao Zhai. Non-local Recoloring Algorithm for Color Vision Deficiencies with Naturalness and Detail Preserving (Unpublished paper, will be updated later). If you have any questions, you can send a request to: humenghan89@163.com Emotion expression is an essential part of human interaction. The same text can hold different meanings when expressed with different emotions. Thus understanding the text alone is not enough for getting the meaning of an utterance. Acted and natural corpora have been used to detect emotions from speech. Many speech databases for different languages including English, German, Chinese, Japanese, Russian, Italian, Swedish and Spanish exist for modeling emotion recognition. Since there is no reported reference of an available Arabic corpus, we decided to collect the first Arabic Natural Audio Dataset (ANAD) to recognize discrete emotions. Embedding an effective emotion detection feature in speech recognition system seems a promising solution for decreasing the obstacles faced by the deaf when communicating with the outside world. There exist several applications that allow the deaf to make and receive phone calls normally, as the hearing-impaired individual can type a message and the person on the other side hears the words spoken, and as they speak, the words are received as text by the deaf individual. However, missing the emotion part still makes these systems not hundred percent reliable. Having an effective speech to text and text to speech system installed in their everyday life starting from a very young age will hopefully replace the human ear. Such systems will aid deaf people to enroll in normal schools at very young age and will help them to adapt better in classrooms and with their classmates. It will help them experience a normal childhood and hence grow up to be able to integrate within the society without external help. Eight videos of live calls between an anchor and a human outside the studio were downloaded from online Arabic talk shows. Each video was then divided into turns: callers and receivers. To label each video, 18 listeners were asked to listen to each video and select whether they perceive a happy, angry or surprised emotion. Silence, laughs and noisy chunks were removed. Every chunk was then automatically divided into 1 sec speech units forming our final corpus composed of 1384 records. Twenty five acoustic features, also known as low-level descriptors, were extracted. These features are: intensity, zero crossing rates, MFCC 1-12 (Mel-frequency cepstral coefficients), F0 (Fundamental frequency) and F0 envelope, probability of voicing and, LSP frequency 0-7. On every feature nineteen statistical functions were applied. The functions are: maximum, minimum, range, absolute position of maximum, absolute position of minimum, arithmetic of mean, Linear Regression1, Linear Regression2, Linear RegressionA, Linear RegressionQ, standard Deviation, kurtosis, skewness, quartiles 1, 2, 3 and, inter-quartile ranges 1-2, 2-3, 1-3. The delta coefficient for every LLD is also computed as an estimate of the first derivative hence leading to a total of 950 features. I would have never reached that far without the help of my supervisors. I warmly thank and appreciate Dr. Rached Zantout, Dr. Lama Hamandi, and Dr. Ziad Osman for their guidance, support and constant supervision. Presence-absence dataset of small mammals in 164 counties of the Hengduan Mountains Model parameters and derived summary statistics for simulated dataset The original form of this dataset is at this page [http://qwone.com/~jason/20Newsgroups/][1] The 20 Newsgroups data set is a collection of approximately 19K newsgroup documents This version is third version that has 18828 documents All files is converted to txt format [1]: http://qwone.com/~jason/20Newsgroups/ Summary of CIBERSORT algorithm analysis with all datasets and PCD samples. Cell types are in column and samples in row. Values are ratios over 1. Reconstructed slices of a 3D tomographic dataset without ring artifacts suppresion. datasets contain records of dead people which collected from ssdmf.info Dataset used to evaluate sample size to estimate genomic kinship. My dataset contains 10,000 images of Indian vehicle license plates. A, exon 2 diversity compared to the rest of the coding region (fused exons 1, 3, 4, 5 and 6) in the dataset including the entire coding region (49 sequences). Mean ± sem is shown. B, synonymous and non-synonymous diversity in the coding region in the dataset including the entire coding region. In the short exon 5 (24 bp) half of the alleles have G instead of C at the nucleotide position 22, resulting in high apparent diversity for the whole exon. Mean ± sem is shown. C, sliding window analysis of non-synonymous, synonymous and complex substitutions in the -e2 in the dataset including the complete -e2. Complex stands for complex combinations of non-synonymous and synonymous substitutions in the same codon. The graph illustrates the contribution of these different components in , which is not equal to and (, as calculated here does not take into consideration the capability of the codon to mutate in synonymous and non-synonymous manner).Copyright information:Taken from "Sequence features of locus define putative basis for gene conversion and point mutations"http://www.biomedcentral.com/1471-2164/9/228BMC Genomics 2008;9():228-228.Published online 19 May 2008PMCID:PMC2408603. Data matrices and results from all RAxML analyses of concatenated datasets described in the study **SOURCE** Data taken from : **Boehringer Ingelheim** Predict a biological response of molecules from their chemical properties https://www.kaggle.com/c/bioresponse/data Context There's a story behind every dataset and here's your opportunity to share yours. Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered? Catchment areas of Lake Baikal catchment, projected in UTM Z48, WGS 84. The data were taken from Swiercz, S. (2004). GIS supported characterization of the Baikal region. Diploma Thesis, Free University of Berlin, Germany. The dataset has two components: * Lake Baikal catchment area * Catchment areas of the main tributary rivers to Lake Baikal (Selenga, Barguzin, Upper Angara) This data repository contains (1) yearly global autotrophic respiration (RA) dataset from 1980 to 2012 with a spatial resolution of 0.5°; (2) original field observations to develop Random Forest (RF) model; (3) main R codes to produce RA database.Model description:The globally gridded RA database was developed by Random Forest (RF) with 449 field observations (see “dataset.csv” in this repository, updated from Bond-Lamberty and Thomson, 2018) using 11 global variables, including gridded temperature, precipitation, diurnal temperature range, potential evapotranspiration, Palmer Drought Severity Index, nitrogen deposition, downward shortwave radiation, soil carbon content, soil nitrogen density, soil water content, land cover. Dataset information:Dataset name: “Respiration_autotrophic_belowgroud_glob_1980_2012_yr_half_dgree.nc”Which means globally belowground autotrophic respiration from 1980 to 2012 with a spatial resolution of 0.5° at a yearly step.Units: g C m<sup>-2</sup> yr<sup>-1</sup>Format: network Common Data Form (netCDF)Spatial coverage: 90S-90N, 180W-180EThe “dataset.csv” file is the field observation from peer review publications combining Global Soil Respiration Database (SRDB v4, Bond-Lamberty and Thomson, 2018), which is publicly available at https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=1578. Besides, The database was further updated using observations collected from the China Knowledge Resource Integrated Database (www.cnki.net) up to November 2018 according to the criteria of SRDB. This dataset is provided in format of “.csv”.R codes:10fold_CV_RA.txt: 10-fold CV for RAAnnual_variability_RA.txt: annual variability for global RACMP_RA.txt: comparing RF-RA and Hashimoto2015-RA using CMP approachRa_DD_CC_plot.txt: plotting the comparing results from CMPRA_MAT_MAP_anomaly.txt: plotting and modelling the relationship between temperature/precipitation anomalies and RA RGB_plot.txt: deriving RGB plot to detecting the relative importance of temperature, precipitation and shortwave radiation. This dataset provides the results of warming incubation of Arctic soils from trough areas of a high-center polygon at the Barrow Environmental Observatory (BEO) in northern Alaska, United States. The organic-rich soil (8-20 cm below ground surface) and the mineral-rich soil (22-45 cm below surface) were separated, and the thawed and homogenized subsamples from each soil were incubated at -2 degrees C or 8 degrees C for 122 days under anoxic conditions (headspace filled with N2). The extracted DOM from soil samples were analyzed by Fourier transform ion cyclotron resonance mass spectrometry coupled with electrospray ionization (ESI-FTICR-MS). Reported analytes include soil water content, dissolved organic carbon, total organic carbon, MS peaks' m/z and intensities, and elemental composition of identified molecular formulas. Genotype information for each sample is displayed on a single line. Labels for diploid loci are organized as column headers along the top. Dataset includes 369 individuals collected from 10 geographic locations. This file is formatted for input in GenAlEx v.6.502 Context I have developed an online judge for my university named [RUET OJ][1] . I am sharing the server log dataset of RUET OJ Content This dataset has 16008 rows and 4 columns. Columns are IP, Time, URL, Response Status. Acknowledgements This dataset is too small for research . But I hope others people will also share larger dataset for web log as web log dataset is rare here . Inspiration This dataset will inspire other people to share their collected web log dataset . [1]: http://ruetoj.ml Excel file containing the full dataset of the paper "Sediment Respiration Pulses in Intermittent Rivers and Ephemeral Streams". The first sheet contains a description of the variables. The second sheet contains the data. These data were used together with the R code (Code S1 file) to generate teh results presented in the paper. Context It has the list of recommendation job listed for individual Content List of suggested job recommendation Acknowledgements Thank you LinkedIn and IFFFT for helping to collect the dataset Inspiration Wish to design an advanced version of recommendation engine About This Dataset ===== You can use this fonts file to generate some Chinese character. Use this image can train a machine learning model to recognize text. Dataset is updating ===== Tell me if you have other font file or anything related to this topic. Copy from https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en to use on Kaggle Kernel Movie revenue depends on multiple factors such as cast, budget, film critic review, MPAA rating, release year, etc. Because of these multiple factors there is no analytical formula for predicting how much revenue a movie will generate. However by analyzing revenues generated by previous movies, one can build a model which can help us predict the expected revenue for a movie. Such a prediction could be very useful for the movie studio which will be producing the movie so they can decide on expenses like artist compensations, advertising, promotions, etc. accordingly. Plus investors can predict an expected return-on-investment. This data has been modified from the RAW data by: The PM10, PM2.5 and PM1 mass concentrations were measured using an environmental dust monitor (Grimm EDM 180-MC, GRIMM Aerosol Technik GmbH &amp; Co. KG) at an interval of 5 min in Kunming.These converted datasets are combined and averaged over 1 hour, and then saved to this file. Dataset II used for the analysis of primer universality This dataset contains key characteristics about the data described in the Data Descriptor Time series of heat demand and heat pump efficiency for energy system modeling. Contents: 1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format Input data for case studies.&nbsp; Context Mass Shootings in the United States of America (1966-2017) The US has witnessed 398 mass shootings in last 50 years that resulted in 1,996 deaths and 2,488 injured. The latest and the worst mass shooting of October 2, 2017 killed 58 and injured 515 so far. The number of people injured in this attack is more than the number of people injured in all mass shootings of 2015 and 2016 combined. The average number of mass shootings per year is 7 for the last 50 years that would claim 39 lives and 48 injured per year. Content Geography: United States of America Time period: 1966-2017 Unit of analysis: Mass Shooting Attack Dataset: The dataset contains detailed information of 398 mass shootings in the United States of America that killed 1996 and injured 2488 people. Variables: The dataset contains Serial No, Title, Location, Date, Summary, Fatalities, Injured, Total Victims, Mental Health Issue, Race, Gender, and Lat-Long information. Acknowledgements I’ve consulted several public datasets and web pages to compile this data. Some of the major data sources include [Wikipedia][1], [Mother Jones][2], [Stanford][3], [USA Today][4] and other web sources. Inspiration With a broken heart, I like to call the attention of my fellow Kagglers to use Machine Learning and Data Sciences to help me explore these ideas: • How many people got killed and injured per year? • Visualize mass shootings on the U.S map • Is there any correlation between shooter and his/her race, gender • Any correlation with calendar dates? Do we have more deadly days, weeks or months on average • What cities and states are more prone to such attacks • Can you find and combine any other external datasets to enrich the analysis, for example, gun ownership by state • Any other pattern you see that can help in prediction, crowd safety or in-depth analysis of the event • How many shooters have some kind of mental health problem? Can we compare that shooter with general population with same condition Mass Shootings Dataset Ver 3 This is the new Version of Mass Shootings Dataset. I've added eight new variables: 1. Incident Area (where the incident took place), 2. Open/Close Location (Inside a building or open space) 3. Target (possible target audience or company), 4. Cause (Terrorism, Hate Crime, Fun (for no obvious reason etc.) 5. Policeman Killed (how many on duty officers got killed) 6. Age (age of the shooter) 7. Employed (Y/N) 8. Employed at (Employer Name) Age, Employed and Employed at (3 variables) contain shooter details Mass Shootings Dataset Ver 4 Quite a few missing values have been added Mass Shootings Dataset Ver 5 Three more recent mass shootings have been added including the Texas Church shooting of November 5, 2017 I hope it will help create more visualization and extract patterns. Keep Coding! [1]: https://en.wikipedia.org/wiki/Category:Mass_shootings_in_the_United_States_by_year [2]: http://www.motherjones.com/politics/2012/12/mass-shootings-mother-jones-full-data/ [3]: https://library.stanford.edu/projects/mass-shootings-america [4]: http://www.gannett-cdn.com/GDContent/mass-killings/index.htmltitle This dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of [vgchartz.com][1]. Fields include * Rank - Ranking of overall sales * Name - The games name * Platform - Platform of the games release (i.e. PC,PS4, etc.) * Year - Year of the game's release * Genre - Genre of the game * Publisher - Publisher of the game * NA_Sales - Sales in North America (in millions) * EU_Sales - Sales in Europe (in millions) * JP_Sales - Sales in Japan (in millions) * Other_Sales - Sales in the rest of the world (in millions) * Global_Sales - Total worldwide sales. The script to scrape the data is available at https://github.com/GregorUT/vgchartzScrape. It is based on BeautifulSoup using Python. There are 16,598 records. 2 records were dropped due to incomplete information. [1]: http://www.vgchartz.com/ This dataset consists of 5547 breast histology images of size 50 x 50 x 3, curated from [Andrew Janowczyk website][1] and used for a data science tutorial at [Epidemium][2]. The goal is to classify cancerous images (IDC : invasive ductal carcinoma) vs non-IDC images. [1]: http://www.andrewjanowczyk.com/use-case-6-invasive-ductal-carcinoma-idc-segmentation/ [2]: http://www.epidemium.cc/ Any user who accepts the BSRN data release guidelines (http://bsrn.awi.de/data/conditions-of-data-release) may ask Amelie Driemel (Amelie.Driemel@awi.de) to obtain an account to download these datasets. Newer data are available at: https://dataportals.pangaea.de/bsrn/NOTE TO USERS: The best way to view the data is by clicking on "View dataset as HTML", you will then have the possibility to click on the year and station of interest to download the data files. The download format is a .tab file which can be opened with every program which opens txt files. However, if you want to bulk change the file extension there are various ways to do so, e.g. within the command window with the command: ren *.tab *.txt This is the dataset related to the article publishes in Sensor and Actuator B titiled "Automatic de-noising of close-range hyperspectral images with a wavelength-specific shearlet-based image noise reduction method" The dataset comprised of hyperspectral images of six different tea products acquired in Visible and Near-infrared spectral range. Please find the complete description in file "Description of the database.docx" attached Here we provide two ArcGIS map packages with georeferenced files on the spatial distribution of sponges and echinoderms in the wider Weddell Sea (Antarctica), which were created in the context of the development of a marine protected area (MPA) in the Weddell Sea.Sponges: The map of interpolated occurrence of sponges is based on quantitative abundance data (Gerdes 2014 a - o) and on semi-quantitative data obtained by W. Arntz (retired; formerly AWI) (see Teschke &amp; Brey 2019a for presence / absence records of the latter dataset). The abundance data were classified to be merged with the semi-quantitative data and an inverse distance weighted method was performed on the united dataset. Areas with very common occurrence of sponges occurred on the shelf near Brunt Ice Shelf along Riiser - Larsen Ice Shelf to Ekstrøm Ice Shelf.Echinoderms: A cluster analysis with species x station datasets of asteroids (Teschke &amp; Brey 2019b), ophiuroids (Teschke &amp; Brey 2019c) and holothurians (Gutt et al. 2014) from the Antarctic Weddell Sea indicated a particular cold-water echinoderm fauna on the Filchner shelf. We approximated this potential habitat by bottom temperature ≤ -1°, based on seawater temperature data from the Finite Element Sea Ice - Ocean Model provided by R. Timmermann (AWI).More information on the spatial analysis is given in working paper WG-EMM-16/03 submitted to the CCAMLR Working Group on Ecosystem Monitoring and Management (available at https://www.ccamlr.org/en/wg-emm-16). This repository provides the supplementary R code and data to reproduce the experiments in the following paper : "Highly accurate autonomic diagnosis of papillary thyroid carcinomas using a pathway-based personalized machine learning algorithms". These include: 1. The main method function R file 2. The main script R file 3. The datasets for the development/validation cohorts (R data file format) 4. The pathway information (R data file format) Context The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. This dataset became a typical test case for many statistical classification techniques in machine learning such as support vector machines Content The dataset contains a set of 150 records under 5 attributes - Petal Length, Petal Width, Sepal Length, Sepal width and Class(Species). Acknowledgements This dataset is free and is publicly available at the UCI Machine Learning Repository This dataset contain files in OPJ formats which can be opened by data processing software ORIGIN, and WFM file format. It is created by Tektronic oscilloscope and it can be open by WAVESTAR FOR OSCILLOSCOPES. This data is for paper "Broadband Amplification of Low-Terahertz Signals Using Axis-Encircling Electrons in a Helically Corrugated Interaction Region" that been published in Physical review letter. Context There\'s a story behind every dataset and here\'s your opportunity to share yours. Interactive Hand Gesture. Here are color images and as well depth images of hand gestures grouped by their classes. Copyright: Author: Chengyin Liu; Email: destin369y at gmail.com; Year: 2015. Please acknowledge my name if you use this dataset, thank you. Content What\'s inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too. RGB-D hand gesture images taken by depth camera. Grouped by classes. Please refer to "class.txt" Used for hand gesture recognition evaluation. Acknowledgements We wouldn\'t be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research. This dataset is used for my hand gesture recognition research at National Taiwan University, in the Intelligent Robot Lab under the lead of Prof. Li-Chen Fu. More details about our lab, please visit http://robotlab.csie.ntu.edu.tw/ Inspiration Your data will be in front of the world\'s largest data science community. What questions do you want to see answered? I used scrapy to gather data on charities in the United States. All data has been retrieved from https://www.charitynavigator.org/ Context The world marathon majors consist of six major city marathons (https://en.wikipedia.org/wiki/World_Marathon_Majors): List of all historic winners can be found via their individual wikipedia pages: - Tokyo (https://en.wikipedia.org/wiki/Tokyo_Marathon) - Boston (https://en.wikipedia.org/wiki/List_of_winners_of_the_Boston_Marathon) - London (https://en.wikipedia.org/wiki/List_of_winners_of_the_London_Marathon) - Berlin (https://en.wikipedia.org/wiki/Berlin_Marathon) - Chicago (https://en.wikipedia.org/wiki/List_of_winners_of_the_Chicago_Marathon) - New York (https://en.wikipedia.org/wiki/List_of_winners_of_the_New_York_City_Marathon) Content Using Wikipedia and Pandas, I've scraped and compiled a list of winners for each race on both the male and female runners and processed into easy to use csv files. Code can be found on github [here]. Acknowledgements All data scraped from Wikipedia so thanks to all contributors. [1]: https://github.com/GJBroughton/World_Marathon_Majors Context A dataset for 1300 laptop models. Content 1. Company Name 2. Product Name 3. Laptop Type 4. Screen Inches 5. Screen Resolution 6. CPU Model 7. RAM Characteristics 8. Memory 9. GPU Characteristics 10. Operating System 11. Laptop's Weight 12. Laptop's Price Sokoto Coventry Fingerprint Dataset (SOCOFing) is a biometric fingerprint database designed for academic research purposes. SOCOFing is made up of 6,000 fingerprint images from 600 African subjects and contains unique attributes such as labels for gender, hand and finger name as well as synthetically altered versions with three different levels of alteration for obliteration, central rotation, and z-cut. For a complete formal description and usage policy please refer to the following paper: https://arxiv.org/abs/1807.10609 The original dataset. A cohort of 149 individuals (58 healthy children, 91 pediatric IBD patients). This sample set was processed and analyzed in Amsterdam, the Netherlands, using an ABI PRISM 3130 Genetic Analyzer.The RData object has two attributes: data - holds the count data of OTU abundances. Rows are samples and columns are OTUs; labels - holds a corresponding label for each sample in data. Use of gamification strategy in Accounting Education in Brazil to improve students' skills. Dataset in Portuguese. A vast number of clinical disorders may involve changes in brain structure that are correlated with cognitive function and behavior (e.g., depression, schizophrenia, stroke, etc.). Reliably understanding the relationship between specific brain structures and relevant behaviors in worldwide clinical populations could dramatically improve healthcare decisions around the world. For instance, if a reliable relationship between brain structure after stroke and functional motor ability was established, brain imaging could be used to predict prognosis/recovery potential for individual patients. However, high heterogeneity in clinical populations in both individual neuroanatomy and behavioral outcomes make it difficult to develop accurate models of these potentially subtle relationships.Large neuroimaging studies (n&gt;10,000) would provide unprecedented power to successfully relate clinical neuroanatomy changes with behavioral measures. While these sample sizes might be difficult for any one individual to collect, the ENIGMA Center for WorldwideMedicine, Imaging, and Genomics has successfully pioneered meta- and mega-analytic methods to accomplish this task. ENIGMA brings together a global alliance of over 500 international researchers from over 35 countries to pool together neuroimaging data on different disease states in hopes of discovering critical brain-behavior relationships Individual investigators with relevant data run ENIGMA analysis protocols on their own data and send back an output folder containing the analysis results to be combined with data from other sites for a meta-analysis. In this way, large sample sizes can be acquired without the hassle of large-scale data transfers or actual neuroimaging data sharing.A test dataset is available on request; if interested, please email npnl@usc.edu. Weather data monitoring is ongoing since late 2013 in a network of three sites located in the Campi Flegrei volcanic area, near Naples (Italy) in the framework of the MONICA (Innovative Monitoring of Coastal and Marine Environment) Project. The aim of this activity is to acquire time series to analyze the influence of meteorological factors on geomorphological coastal processes, such as cliff retreat, landslides and beach erosion. The uploaded dataset includes data (temperature, rain, wind, barometric pressure and relative humidity) acquired at the Denza automatic weather station (model DAVIS Vantage Pro2 wireless) during the period Jan. 2014 - Dec. 2018. Automatic data transfer from the weather station to the ISMAR-CNR processing center of Naples is performed by an internet LAN connection. Comparisons of this dataset to other publicly available datasets. Included in this file are: 1- SGP-biased genes that are also annotated as expressed in SGPs on http://wormbase.org , 2- hmc-biased genes that are also annotated as expressed in hmcs on wormbase.org , 3- SGP enriched genes [18] that are also detected in our study, 4- C. elegans transcription factors in the wTF2.0 dataset [54] that are SGP-biased in our dataset, and 5- expression results for C. elegans homologs of pluripotency factors. (XLSX 152 kb) Cloud computing is an emerging technology. It process huge amount of data so scheduling mechanism works as a vital role in the cloud computing. Thus my protocol is designed to minimize the switching time, improve the resource utilization and also improve the server performance and throughput. This method or protocol is based on scheduling the jobs in the cloud and to solve the drawbacks in the existing protocols. Here we assign the priority to the job which gives better performance to the computer and try my best to minimize the waiting time and switching time. Best effort has been made to manage the scheduling of jobs for solving drawbacks of existing protocols and also improvise the efficiency and throughput of the server. MNIST data from http://neuralnetworksanddeeplearning.com This mango transcriptome assembly was derived from pooled leaf, stem, bud, root, floral and fruit tissue. Using normalized cDNA libraries, we generated comprehensive RNA-Seq datasets using the Illumina NextSeq 500 platform. 82198 of mango unigenes were generated and functionally annotated using a combination of <i>de novo</i> transcriptome assembly, redundancy reduction and Basic Local Alignment Search Tool (BLAST) searches to the Universal Protein Resource UniProtKB/Swiss-Prot database. Natural Speech Dataset This dataset provides leaf trait, proximate composition, fatty acid profile, phenolic composition, and <i>in vitro </i>true digestibility of <i>Acer pseudoplatanus, Fraxinus excelsior, Salix caprea, </i>and <i>Sorbus aucuparia </i>foliages, from data collected in Trivero (Italy) in 2015 The goal of this project is to improve accessibility of open datasets by curating them. “NiData” aims to provide a common interface for documentation, downloads, and examples to all open neuroimaging datasets, making data usable for experts and non-experts alike. NiData is a Python package that provides a single interface accessing data from a variety of open data sources. The software framework makes it easy to add new data sources, simple to define and to provide access to multiple datasets from a single data source. Software dependencies are managed on a per-dataset basis, allowing downloads and examples to use any public packages without requiring installation of packages required by unused datasets. The interface also allows selective download of data (by subject or type) and caches files locally, allowing easy management of big datasets. We focused on exposing new methods for downloading data from the HCP, supporting access via Amazon S3 and HTTP/XNAT. We were able to provide a downloader that accepts login credentials and downloads files locally. We created an example that interacts with DIPY to produce diffusion imaging results on a single subject from the HCP. We also worked at collecting common data sources, as well as individual datasets stored at each data source, into NiData’s “data sources” wiki page. We incorporated downloads, documentation, and examples from the nilearn package and began discussion of making a more extensible object model. Voice Gender ---------------- Gender Recognition by Voice and Speech Analysis This database was created to identify a voice as male or female, based upon acoustic properties of the voice and speech. The dataset consists of 3,168 recorded voice samples, collected from male and female speakers. The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages, with an analyzed frequency range of 0hz-280hz ([human vocal range][2]). The Dataset The following acoustic properties of each voice are measured and included within the CSV: - **meanfreq**: mean frequency (in kHz) - **sd**: standard deviation of frequency - **median**: median frequency (in kHz) - **Q25**: first quantile (in kHz) - **Q75**: third quantile (in kHz) - **IQR**: interquantile range (in kHz) - **skew**: skewness (see note in specprop description) - **kurt**: kurtosis (see note in specprop description) - **sp.ent**: spectral entropy - **sfm**: spectral flatness - **mode**: mode frequency - **centroid**: frequency centroid (see specprop) - **peakf**: peak frequency (frequency with highest energy) - **meanfun**: average of fundamental frequency measured across acoustic signal - **minfun**: minimum fundamental frequency measured across acoustic signal - **maxfun**: maximum fundamental frequency measured across acoustic signal - **meandom**: average of dominant frequency measured across acoustic signal - **mindom**: minimum of dominant frequency measured across acoustic signal - **maxdom**: maximum of dominant frequency measured across acoustic signal - **dfrange**: range of dominant frequency measured across acoustic signal - **modindx**: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range - **label**: male or female Accuracy Baseline (always predict male) 50% / 50% Logistic Regression 97% / 98% CART 96% / 97% Random Forest 100% / 98% SVM 100% / 99% XGBoost 100% / 99% Research Questions An original analysis of the data-set can be found in the following article: [Identifying the Gender of a Voice using Machine Learning][3] The best model achieves 99% accuracy on the test set. According to a CART model, it appears that looking at the mean fundamental frequency might be enough to accurately classify a voice. However, some male voices use a higher frequency, even though their resonance differs from female voices, and may be incorrectly classified as female. To the human ear, there is apparently more than simple frequency, that determines a voice's gender. Questions - What other features differ between male and female voices? - Can we find a difference in resonance between male and female voices? - Can we identify falsetto from regular voices? (separate data-set likely needed for this) - Are there other interesting features in the data? CART Diagram ![CART model][4] Mean fundamental frequency appears to be an indicator of voice gender, with a threshold of 140hz separating male from female classifications. References [The Harvard-Haskins Database of Regularly-Timed Speech](http://www.nsi.edu/~ani/download.html) [Telecommunications &amp; Signal Processing Laboratory (TSP) Speech Database at McGill University](http://www-mmsp.ece.mcgill.ca/Documents../Downloads/TSPspeech/TSPspeech.pdf), [Home](http://www-mmsp.ece.mcgill.ca/Documents../Data/index.html) [VoxForge Speech Corpus](http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/8kHz_16bit/), [Home](http://www.voxforge.org) [Festvox CMU_ARCTIC Speech Database at Carnegie Mellon University](http://festvox.org/cmu_arctic/) [1]: http://www.primaryobjects.com/2016/06/22/identifying-the-gender-of-a-voice-using-machine-learning [2]: https://en.wikipedia.org/wiki/Voice_frequencyFundamental_frequency [3]: http://www.primaryobjects.com/2016/06/22/identifying-the-gender-of-a-voice-using-machine-learning/ [4]: http://i.imgur.com/Npr2U7O.png Context I want to create an app that could generate instrumentals of songs that we listen daily. Content I have generated wav files of different notes using garageband. I will use this data to classify musical notes. Acknowledgements I took stanford paper on sheet music from audio files by Jan Dlabal and Richard Wedeen Inspiration Can we even generate midi files of complicated melodies just using wav files of the song. This dataset was used to conduct our first analysis, which examined the duration of IBIs and their component phases. For this analysis we used 36 years of data (collected between 1977 and 2012) on reproductive states, demographic events, dominance rank, and rainfall for 160 wild-feeding females. Specifically, we had a total of 490 IBIs for 160 females that fit our analysis criteria (see Methods - Data Analysis), with each female contributing an average of 3 IBIs to the dataset (range: 1-10). Note that female identity and pregnancy identity have been anonymized and the ID given were identical across tables. DM-Authors dataset contains information about 4,906 researchers in the domain of data management. The dataset is a crawling on DBLP in October 2014. For each researcher, demographic attributes (gender, seniority, number of publications and publication rate) and activity attributes (list of venues and keywords that the researcher has contribute to) are provided. Context This data set is helpful for beginners in R or Python .It is simple data set for analyzing . Content A XXX Training institute offers training programs on various courses on Mechanical,Computers and Electrical domain. Two type of courses are offered i.e ATP(Advance Training Programme - Job oriented - Target is Bachelor of Engineering students) &amp; MTP(Modular Training Programme -Not Job Oriented - Target is those who want to update there skills).Training.csv is enquirer list generated 2016-2017 batch.ATP is of yearly 3 batchs(Jan,July &amp;sep) .MTP is through year . Content This dataset is all of Hubway's ridership data and station information up to December 2017. License Hubways data license agreement can be found here: https://www.thehubway.com/data-license-agreement Overview -------- The World of Warcraft Avatar History Dataset is a collection of records that detail information about player characters in the game over time. It includes information about their character level, race, class, location, and social guild. The Kaggle version of this dataset includes only the information from 2008 (and the dataset in general only includes information from the \'Horde\' faction of players in the game from a single game server). - Full Dataset Source and Information: [http://mmnet.iis.sinica.edu.tw/dl/wowah/][1] - Code used to clean the data: [https://github.com/myles-oneill/WoWAH-parser][2] Ideas for Using the Dataset --------------------------- From the perspective of game system designers, players\' behavior is one of the most important factors they must consider when designing game systems. To gain a fundamental understanding of the game play behavior of online gamers, exploring users\' game play time provides a good starting point. This is because the concept of game play time is applicable to all genres of games and it enables us to model the system workload as well as the impact of system and network QoS on users\' behavior. It can even help us predict players\' loyalty to specific games. Open Questions -------------- - Understand user gameplay behavior (game sessions, movement, leveling) - Understand user interactions (guilds) - Predict players unsubscribing from the game based on activity - What are the most popular zones in WoW, what level players tend to inhabit each? Wrath of the Lich King ---------------------- An expansion to World of Warcraft, "Wrath of the Lich King" (Wotlk) was released on November 13, 2008. It introduced new zones for players to go to, a new character class (the death knight), and a new level cap of 80 (up from 70 previously). This event intersects nicely with the dataset and is probably interesting to investigate. Map --- This dataset doesn\'t include a shapefile (if you know of one that exists, let me know!) to show where the zones the dataset talks about are. Here is a list of zones an information from this version of the game, including their recommended levels: http://wowwiki.wikia.com/wiki/Zones_by_level_(original) . **Update (Version 3)**: [dmi3kno][3] has generously put together some supplementary zone information files which have now been included in this dataset. Some notes about the files: *Note that some zone names contain Chinese characters. Unicode names are preserved as a key to the original dataset. What this addition will allow is to understand properties of the zones a bit better - their relative location to each other, competititive properties, type of gameplay and, hopefully, their contribution to character leveling. Location coordinates contain some redundant (and possibly duplicate) records as they are collected from different sources. Working with uncleaned location coordinate data will allow users to demonstrate their data wrangling skills (both working with strings and spatial data).* [1]: http://mmnet.iis.sinica.edu.tw/dl/wowah/ [2]: https://github.com/myles-oneill/WoWAH-parser [3]: https://www.kaggle.com/dmi3kno This dataset includes global bias-corrected climate model output data from version 1 of NCAR's Community Earth System Model (CESM1) that participated in phase 5 of the Coupled Model Intercomparison Experiment (CMIP5), which supported the Intergovernmental Panel on Climate Change Fifth Assessment Report (IPCC AR5). The dataset contains all the variables needed for the initial and boundary conditions for simulations with the Weather Research and Forecasting model (WRF) or the Model for Prediction Across Scales (MPAS), provided in the Intermediate File Format specific to WRF and MPAS. The data are interpolated to 26 pressure levels and are provided in files at six hourly intervals. The variables have been bias-corrected using the European Centre for Medium-Range Weather Forecasts (ECMWF) Interim Reanalysis (ERA-Interim) fields for 1981-2005, following the method in Bruyere et al. (2014) [http://dx.doi.org/10.1007/s00382-013-2011-6]. Files are available for a 20th Century simulation (1951-2005) and three concomitant Representative Concentration Pathway (RCP) future scenarios (RCP4.5, RCP6.0 and RCP8.5) spanning 2006-2100. NOTE: There are no bias-corrected data for RCP2.6, due to corrupted data caused by a model bug in CESM. Note to Microsoft Windows users: The executable metgrid.exe, which is required to ingest this data into WPS/WRF, is not compatible with Windows and can only be run in a Linux environment. It is recommended, therefore, that this dataset be used in Linux environments only. These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the "present" contains complete loan data for all loans issued through the previous completed calendar quarter. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file. k Context The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem. Original Dataset The original dataset can be acessed at [https://archive.ics.uci.edu/ml/datasets/abalone][1]. [1]: https://archive.ics.uci.edu/ml/datasets/abalone Lately, black carbon (BC) has received significant attention due to its climate-warming properties and adverse health effects. Nevertheless, long-term observations in urban areas are scarce, most likely because BC monitoring is not required by environmental legislation. This, however, handicaps the evaluation of air quality models which can be used to assess the effectiveness of policy measures which aim at reducing BC concentrations.Here, we present a new dataset of atmospheric BC measurements from Germany constructed from over six million measurements at over 170 stations. Data covering the period between 1994 and 2014 were collected from twelve German federal states and the federal Environment Agency (UBA), quality checked and harmonized into a database with comprehensive metadata. The final data in original time resolution are available for download (link will follow). Though assembled in a consistent way, the dataset is characterized by differences in (a) measurement methodologies for determining evolved carbon and optical absorption, (b) covered time periods, and (c) temporal resolutions that ranged from half hourly to 6-daily measurements. Usage of this dataset thus requires a careful consideration of these differences.Our analysis focuses on 2009, the year with the largest data coverage obtained with one single methodology, as well as on the relative changes in long-term trends over ten years. Stations are grouped into the following categories: urban background, traffic, industrial, and rural. For 2009, we find that BC concentrations at traffic sites were at least twice as high as at urban background, industrial and rural sites. Weekly cycles are most prominent at traffic stations, however, the presence of differences in concentrations during the week and on weekends at other station types suggests that traffic plays an important role throughout the full network. Generally higher concentrations and weaker weekly cycles during the winter months point towards the influence of other sources such as domestic heating. Regarding the long-term trends, advanced statistical techniques allow us to account for instrumentation changes and to separate seasonal and long-term changes in our dataset. Analysis shows a downward trend in BC at nearly all locations and in all conditions, with a high level of confidence for the period of 2005-2014. In depth analysis indicates that background BC is decreasing slowly, while the occurrences of high concentrations are decreasing more rapidly.In summary, legislation - both in Europe and locally - to reduce particulate emissions and indirectly BC appear to be working, based on this analysis. Human health and climate impacts are likely to be diminished because of the improvements in air quality. Context The real estate markets, like those in Sydney and Melbourne, present an interesting opportunity for data analysts to analyze and predict where property prices are moving towards. Prediction of property prices is becoming increasingly important and beneficial. Property prices are a good indicator of both the overall market condition and the economic health of a country. Considering the data provided, we are wrangling a large set of property sales records stored in an unknown format and with unknown data quality issues Actual expenditures for operating funds (General, Special Revenue, Enterprise and Food Services Funds) per student. Student count is enrollment as of October 1. Actual expenditures for operating funds (General, Special Revenue, Enterprise and Food Services Funds) per student. Student count is enrollment as of October 1. This is the APS dataset with estimated parameters. Data explosion via the high-throughput and high-resolution technologies leads to a rapid growth of reference resources and the accumulation of considerably trustworthy knowledgebase. Identifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. However, as datasets continue to grow, the time involved in this simple task may be prohibitive. In large-scale data mining, the vast majority of pairwise comparisons are unlikely to be relevant, meaning that they do not share a signature of interest. It is therefore essential to efficiently identify these unproductive comparisons as rapidly as possible to and exclude them from more time-intensive similarity calculations. The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity algorithm, which enables extensive data mining within a reasonable amount of time. The algorithm transforms datasets into binary metrics, allowing it to utilize the computationally efficient bit operators and provide a coarse measure of similarity. As a result, the BSF can scale to high dimensionality and large comparisons and rapidly filter unproductive pairwise comparison. We present the two LINCS applications to demonstrate the ability to scale to billions of pairwise comparisons and the usefulness of this approach. This is the amino acid dataset used to infer the phylogeny presented in Figure 1. It contains 15,549 positions and was assembled from 79 proteins encoded in the chloroplast genomes of 63 green algae. The data set is in Phylip format. List of the genes composing the merged matrix of all transcriptomic data from datasets and PCD samples (n=9939). Identifiers on the rise in Germany shall treat the current development of identifiers in Germany\'s research landscape. Introducing ORCID in Germany\'s Universities and research organizations finds a lot of interest and a quick uptake. The talks illustrates also the challenges for personal identifiers taking especially German history into account. Within the talk I will present the latest results from a study on the usage and spread of ORCID in academic institutions in Germany in the course of the ORCID DE project. The presentation will also touch upon the discussion about of the recently presented "Kerndatensatz Forschung" (Research Core Dataset) recommended by the German Council of Science and Humanities aiming to gather coherent information about research activities also using authority files such as ORCID, DOI and organization identifiers (http://www.forschungsinfo.de/kerndatensatz/en/index.php?home). The standard biomedical terminologies ICD-10, ICD-O, TNM, MeSH, NCIt, MedDRA, and SNOMED CT were used in a case study where two dimensions of cancer (anatomy and histology) had already been coded in a dataset using a custom terminology (ROCHE). Stata dataset of 2,118 Lobbying Disclosure Act reports from 574 organizations active on the 2014 Farm Bill. Data originally collected and coded by the Center for Responsive Politics. Includes name of organization, dollar amount of reported expenses, number of lobbyists, lobbyists with previous government experiences (revolving door), description of issue, sector and industry of organization, and topic codes (created by author). Dataset used. Copyright information:Taken from "Transterm—extended search facilities and improved integration with other databases"Nucleic Acids Research 2005;34(Database issue):D37-D40.Published online 28 Dec 2005PMCID:PMC1347521.© The Author 2006. Published by Oxford University Press. All rights reserved Shown is a selection of the type of pre-processed data to view in progress, with the results of a pattern description search from a previous action in the low frame (see also ). The file contents for each type of data have been described previously (). These include redundant and non-redundant 3′- and 5′-flanks, CDS, initiation and termination contexts; consensuses and information content of the initiation and termination contexts; codon usage; list of entries making up the dataset; scientific and short names of the species; an overall summary file. Dataset for genetic relatedness estimates, using TrioML for degree of genetic relatedness between male and female Content **This dataset contains following important columns :** 'start' - When the president got elected 'end' - When the term ended 'president' - Who was the president 'prior' - What was he before becoming president 'party' - The supporting party 'vice' - The Vice President during the term Context I am experimenting on this dataset various Deep Learning architectures. This dataset can be used in classification, localization, segmentation, generative models and face detection/recognition/classification. Content The dataset has 52156 rgb images. Train and test datasets are splitted for each 86 classes with ratio 0.8 . the original images has 1988x3056 dimension. It is reduced to 288x432 using OpenCV. Acknowledgements I download the books from different webpages. In the futures, I can add some new images if it needed. Inspiration Comic books have different images than standard images that are worked on. The characters, images, environments, colors, and more in this data set are much more challenging and confusing than the image data sets that have been worked on before. Besides, the results for the use of GANs are much bigger and more complicated than those that have achieved successful results. Content BMS - Building Management System is a data set which describes complains (reactives) from people who work in the building. Process: Staff in the hospital reports to facility if for example area (room) is too hot, then engineer will go to check this area and mark task as completed. I want to show average time each month to complete this task. Under category of work there is ** Area Too Hot. I want to select that category of work and to show average time to complete that task and compare with other months. Building - Hospital in London (St Barts Hospital) associated dataset for Data Descriptor: Data Record 2, Raw Data.xlsx File descriptions ----------------- - train.csv - the training set - test.csv - the test set - data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here - sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms Data fields ----------- Here's a brief version of what you'll find in the data description file. - SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict. - MSSubClass: The building class - MSZoning: The general zoning classification - LotFrontage: Linear feet of street connected to property - LotArea: Lot size in square feet - Street: Type of road access - Alley: Type of alley access - LotShape: General shape of property - LandContour: Flatness of the property - Utilities: Type of utilities available - LotConfig: Lot configuration - LandSlope: Slope of property - Neighborhood: Physical locations within Ames city limits - Condition1: Proximity to main road or railroad - Condition2: Proximity to main road or railroad (if a second is present) - BldgType: Type of dwelling - HouseStyle: Style of dwelling - OverallQual: Overall material and finish quality - OverallCond: Overall condition rating - YearBuilt: Original construction date - YearRemodAdd: Remodel date - RoofStyle: Type of roof - RoofMatl: Roof material - Exterior1st: Exterior covering on house - Exterior2nd: Exterior covering on house (if more than one material) - MasVnrType: Masonry veneer type - MasVnrArea: Masonry veneer area in square feet - ExterQual: Exterior material quality - ExterCond: Present condition of the material on the exterior - Foundation: Type of foundation - BsmtQual: Height of the basement - BsmtCond: General condition of the basement - BsmtExposure: Walkout or garden level basement walls - BsmtFinType1: Quality of basement finished area - BsmtFinSF1: Type 1 finished square feet - BsmtFinType2: Quality of second finished area (if present) - BsmtFinSF2: Type 2 finished square feet - BsmtUnfSF: Unfinished square feet of basement area - TotalBsmtSF: Total square feet of basement area - Heating: Type of heating - HeatingQC: Heating quality and condition - CentralAir: Central air conditioning - Electrical: Electrical system - 1stFlrSF: First Floor square feet - 2ndFlrSF: Second floor square feet - LowQualFinSF: Low quality finished square feet (all floors) - GrLivArea: Above grade (ground) living area square feet - BsmtFullBath: Basement full bathrooms - BsmtHalfBath: Basement half bathrooms - FullBath: Full bathrooms above grade - HalfBath: Half baths above grade - BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms) - KitchenAbvGr: Kitchens above grade - KitchenQual: Kitchen quality - TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) - Functional: Home functionality rating - Fireplaces: Number of fireplaces - FireplaceQu: Fireplace quality - GarageType: Garage location - GarageYrBlt: Year garage was built - GarageFinish: Interior finish of the garage - GarageCars: Size of garage in car capacity - GarageArea: Size of garage in square feet - GarageQual: Garage quality - GarageCond: Garage condition - PavedDrive: Paved driveway - WoodDeckSF: Wood deck area in square feet - OpenPorchSF: Open porch area in square feet - EnclosedPorch: Enclosed porch area in square feet - 3SsnPorch: Three season porch area in square feet - ScreenPorch: Screen porch area in square feet - PoolArea: Pool area in square feet - PoolQC: Pool quality - Fence: Fence quality - MiscFeature: Miscellaneous feature not covered in other categories - MiscVal: $Value of miscellaneous feature - MoSold: Month Sold - YrSold: Year Sold - SaleType: Type of sale - SaleCondition: Condition of sale Acknowledgments --------------- Using data from: [House Prices: Advanced Regression Techniques][1] 2 attributes corrected from the description: KitchenAbvGr and BedroomAbvGr [1]: https://www.kaggle.com/c/house-prices-advanced-regression-techniques I crawled some of the posts from r/mexico. This dataset considers both the text of the websites submitted to the subrredit as well as the comments posted about them. Context Gowalla is a location-based social networking website where users share their locations by checking-in. Content Time and location information of check-ins made by users. Acknowledgements This data set is available from https://snap.stanford.edu/data/loc-gowalla.html E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011. This is a datapacket that contains the input files for all analyses performed in this study as described in the publication. This includes input files for maximum parsimony, maximum likelihood and Bayesian inference analyses of mtSSU; maximum liklihood and Bayesian inference input files for a partitioned ITS dataset; and the distance table of computed ITS Jukes Cantor sequences distances calculated in PAUP. PRIMAP-crf is a processed version of data reported by countries to the United Nations Framework Convention on Climate Change (UNFCCC) in the Common Reporting Format (CRF). The processing has three key aspects: 1) Data from individual countries and years are combined into one file. 2) Data is re-organised to follow the IPCC 2006 hierarchical categorisation. 3) ‘Baskets’ of gases are calculated according to different global warming potential estimates from each of the three most recent IPCC reports. All Annex I Parties to the United Nations Framework Convention on Climate Change (UNFCCC) are required to report domestic emissions on an annual basis in a 'Common Reporting Format' (CRF). In 2015, the CRF data reporting was updated to follow the more recent 2006 guidelines from the IPCC and the structure of the reporting tables was modified accordingly. However, the hierarchical categorisation of data in the IPCC 2006 guidelines is not readily extracted from the reporting tables. We present the PRIMAP-crf data as a re-constructed hierarchical dataset according to the IPCC 2006 guidelines. Furthermore, the data is organised in a series of tables containing all available countries and years for each GHG individual gas and category reported. In addition to single gases, the Kyoto basket of greenhouse gases (CO2, N2O, CH4, HFCs, PFCs, SF6, and NF3) is provided according to multiple global warming potentials. The dataset was produced using the PRIMAP emissions module. Key processing steps include; extracting data from submitted CRF excel spreadsheets, mapping CRF categories to IPCC 2006 categories, constructing missing categories from available data, and aggregating single gases to gas baskets. The processed data is available under an Creative Commons Attribution 4.0 International License (CC BY 4.0). Donald Trump Tweets Context Being a fan of board games, I wanted to see if there was any correlation with a games rating and any particular quality, the first step was to collect of this data. Content The data was collected in March of 2017 from the website https://boardgamegeek.com/, this site has an API to retrieve game information (though sadly XML not JSON). Acknowledgements Mainly I want to thank the people who run the board game geek website for maintaining such a great resource for those of us in the hobby. Inspiration I wish I had some better questions to ask of the data, perhaps somebody else can think of some good ways to get some insight of this dataset. Information This dataset is a copy of the [Detecting Insults in Social Commentary Challange][1] to run on a kernel. [1]: https://www.kaggle.com/c/detecting-insults-in-social-commentary Context The data was originally collected from here: https://www.kaggle.com/c/painter-by-numbers but only train_2.zip was available on new Kernels. So the following datasets can be combined to hold all the needed data in 1 new Kernel: * [painter-test](https://www.kaggle.com/mfekadu/painter-test/) * [painters-train-part-1](https://www.kaggle.com/mfekadu/painters-train-part-1/) * [painters-train-part-2](https://www.kaggle.com/mfekadu/painters-train-part-2/) * [painters-train-part-3](https://www.kaggle.com/mfekadu/painters-train-part-3/) This is a messy way to get around Kaggle's 20 GB dataset limit, but you can [**_just fork this notebook_**](https://www.kaggle.com/mfekadu/painter-by-numbers-combined-dataset) to quickly get started. Content test.zip Acknowledgements Thank you [Kiri Nichol](https://www.kaggle.com/smallyellowduck) for collecting the data for the competition. Inspiration Do the pixels that represent a Picasso painting, uniquely identify him? Content More details about each file are in the individual file descriptions. Context This is a dataset from the [U.S. Census Bureau](http://www.census.gov/) hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found [here](https://fred.stlouisfed.org/) and they update their information according the amount of data that is brought in. Explore the U.S. Census Bureau using Kaggle and all of the data sources available through the U.S. Census Bureau [organization page](https://www.kaggle.com/census)! * Update Frequency: This dataset is updated daily. Acknowledgements This dataset is maintained using FRED's [API](https://research.stlouisfed.org/docs/api/fred/) and Kaggle's [API](https://github.com/Kaggle/kaggle-api). Context For movie viewers, the movie posters are one of the first impressions which humans use to get cues about the movie content and its genre. Humans can grasp the cues like color, expressions on the faces of actors etc to quickly determine the genre (horror, comedy, animation etc). It has been shown that color characteristics of an image like hues, saturation, brightness, contour etc. affect human emotions. A given situation arouses these emotions in humans. If humans are able to predict genre of a movie by a single glance at its poster, then we can assume that the color characteristics, local texture based features and structural cues of posters possess some characteristics which could be utilized in machine learning algorithms to predict its genre. Content The movie posters are obtained from IMDB website. The collected dataset contains IMDB Id, IMDB Link, Title, IMDB Score, Genre and link to download movie posters. Each Movie poster can belong to at least one genre and can have at most 3 genre labels assigned to it. As the dataset also includes the IMDB score, it would be really interesting to see if movie poster is related to rating. Acknowledgements The IMDB Id for movies were obtained from MovieLens. The IMDB Link, Title, IMDB Score, Genre and link to download movie posters were obtained from IMDB website. Inspiration Does color plays an important role in deciding the genre of the movie? Can raw image pixels contain enough information to predict genre from movie? Does number of faces in the poster say anything about the movie genre? What is the most frequent color used in horror movies? Which features are important to predict animated movie genre? If a movie belong to more than one genre, can we predict them all? Can we use movie posters only to predict movie rating? This dataset contains data presented in the figures of the paper "Semivolatile POA and parameterized total combustion SOA in CMAQv5.2: impacts on source strength and partitioning" published in Atmospheric Chemistry and Physics. It also links to the data archive of field observations. Gene matrix from dataset Context Observations of particles much smaller than us, and various understandings of those particles, have propelled mankind forward in ways once impossible to imagine. "The elements" are what we call the sequential patterns in which some of these particles manifest themselves. As a chemistry student and a coder, I wanted to do what came naturally to me and make my class a bit easier by coding/automating my way around some of the tedious work involved with calculations. Unfortunately, it seems that chemical-related datasets are not yet a thing which have been conveniently formatted into downloadable databases (as far as my research went). I decided that the elements would be a good place to start data collection, so I did that, and I\'d like to see if this is useful to others as well. Other related data sets I\'d like to coalesce are some large amount of standard entropies and enthalpies of various compounds, and many of the data sets from the *CRC Handbook of Chemistry and Physics*. I also think as many diagrams as possible should be documented in a way that can be manipulated and read via code. Content Included here are three data sets. Each data set I have included is in three different formats (CSV, JSON, Excel), for a total of nine files. Table of the Elements: - This is the primary data set. - 118 elements in sequential order - 72 features Reactivity Series: - 33 rows (in order of reactivity - most reactive at the top) - 3 features (symbol, name, ion) Electromotive Potentials: - 284 rows (in order from most negative potential to most positive) - 3 features (oxidant, reductant, potential) Acknowledgements All of the data was scraped from 120 pages on Wikipedia using scripts. The links to those scripts are available in the dataset descriptions. Extra If you are interested in trying the chemistry calculations code I made for completing some of my repetitive class work, it\'s publicly available on [my GitHub][1]. ([Chemistry Calculations Repository][1]) I plan to continue updating that as time goes on. [1]: https://github.com/jwaitze/Chemistry-Calculations Context The wikipedia dump is a giant XML file and contains loads of not-so-useful content. I needed some english text for some unsupervised learning so I spent quite a bit of time extracting and cleaning up the text. Content Each line of the txt file is a 'sentence'. I put sentence in quote because the content in these files haven't been read all the way through for errors. Here is what I did: - Parsed out the opening text on non-disambiguation and non-table-of-contents pages. - Removed sentences requiring citations, because these were usually poorly formed. - Parse each block of text into sentences using SpaCy. I then checked for bracket and quote correctness, filtering out sentences that didn't quite match up. - Removed sentences shorter than 3 letters and longer than 255 characters. This covers 97% of the data. - Remove duplicate sentences, and, as a byproduct, sorted alphabetically. Context US Airline passenger satisfaction survey Content "Satisfaction:Airline satisfaction level(Satisfaction, neutral or dissatisfaction)" Age:The actual age of the passengers Gender:Gender of the passengers (Female, Male) "Type of Travel:Purpose of the flight of the passengers (Personal Travel, Business Travel)" "Class:Travel class in the plane of the passengers (Business, Eco, Eco Plus)" Customer Type:The customer type (Loyal customer, disloyal customer) Flight distance:The flight distance of this journey "Inflight wifi service:Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)" Ease of Online booking:Satisfaction level of online booking Inflight service:Satisfaction level of inflight service Online boarding:Satisfaction level of online boarding Inflight entertainment:Satisfaction level of inflight entertainment Food and drink:Satisfaction level of Food and drink Seat comfort:Satisfaction level of Seat comfort On-board service:Satisfaction level of On-board service Leg room service:Satisfaction level of Leg room service Departure/Arrival time convenient:Satisfaction level of Departure/Arrival time convenient Baggage handling:Satisfaction level of baggage handling Gate location:Satisfaction level of Gate location Cleanliness:Satisfaction level of Cleanliness Check-in service:Satisfaction level of Check-in service Departure Delay in Minutes:Minutes delayed when departure Arrival Delay in Minutes:Minutes delayed when Arrival Flight cancelled:Whether the Flight cancelled or not (Yes, No) Flight time in minutes:Minutes of Flight takes **Tatoeba Sentences Corpus** This data is directly from the Tatoeba project: https://tatoeba.org/ It is a large collection of sentences in multiple languages. Many of the sentences are contained with translations in multiple languages. It is a valuable resource for Machine Translation and many Natural Language Processing projects. There are 150 normal and 134 nodule thyroid CT images in the dataset.zip. The image format includes PNG and DICOM. Taxa partition in nexus format for use in SVDq analysis implemented in PAUP 4a150. To be used in conjunction with RAD dataset in nexus format. The dataset authors have created a terrestrial water budget data archive on a 1-degree grid. They identified 13332 global stations with complete records and created 30-year (1950-1979) climatological means for each month of the year for air temperature, precipitation, evaporation, soil moisture, and snow cover. These monthly climatological means were then interpolated to the 1-degree gridpoints. This data set contains both the gridpoint individual station data. Content Official addresses assigned in the City of Los Angeles created and maintained by the Bureau of Engineering. Context This is a dataset hosted by the city of Los Angeles. The organization has an open data platform found [here](https://data.lacity.org) and they update their information according the amount of data that is brought in. Explore Los Angeles's Data using Kaggle and all of the data sources available through the city of Los Angeles [organization page](https://www.kaggle.com/cityofLA)! * Update Frequency: This dataset is updated daily. Acknowledgements This dataset is maintained using Socrata's API and Kaggle's API. [Socrata](https://socrata.com/) has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public. [Cover photo](https://unsplash.com/photos/1mPBkYvbu3w) by [Timothy Eberly](https://unsplash.com/@timothyeberly) on [Unsplash](https://unsplash.com/) _Unsplash Images are distributed under a unique [Unsplash License](https://unsplash.com/license)._ References for all studies mentioned in the three datasets for: Johnston, A.S.A & Sibly, R.M. The influence of soil communities on the temperature sensitivity of soil respiration. Consumer complaints are added to this public database after the company has responded to the complaint, confirming a commercial relationship with the consumer, or after they've had the complaint for 15 calendar days, whichever comes first. We don’t verify all the facts alleged in complaints, but we do give companies the opportunity to publicly respond to complaints by selecting responses from a pre-populated list. Company-level information should be considered in the context of company size and/or market share. 0 present/absent calls. The average frequency of values of empty probesets generated by the MAS 5.0 present/absent algorithm when τ = 0.015 (solid black line) and when τ = 0 (dotted line). The average was taken over the six samples. The percentage of central nucleotides in PM probes for empty probesets with values &lt; 0.06, for all empty probesets (similar percentages are present in all probesets), and for empty probesets with values &gt; 0.94 are shown. values generated with the Wilcoxon signed rank test for random empty probesets. The PM-MM probe-pairs from empty probesets with fewer than six alignment errors to any transcript in the GoldenSpike dataset were randomly re-assembled into probesets based on the central nucleotide (for example, only central T nucleotides in the PM probes). Symbols and lines are colored according to the central nucleotide.Copyright information:Taken from "Correcting for sequence biases in present/absent calls"http://genomebiology.com/2007/8/6/R125Genome Biology 2007;8(6):R125-R125.Published online 26 Jun 2007PMCID:PMC2394774. MovieLens 1M dataset enriched with IMDB on movie attributes. Cars Data has Information about 3 brands/make of cars. Namely US, Japan, Europe. Target of the data set to find the brand of a car using the parameters such as horsepower, Cubic inches, Make year, etc. A decision tree can be used create a predictive data model to predict the car brand. Dataset Context I love football and wanted to gather a data-set of a list of football players along with their each game performance from various different sources. Content The csv file has the fantasy premier league data of all players who played in 3 seasons and a detailed spreadsheet of each player is provided. Acknowledgements Thanks to TURD from tableau for some of the data. Inspiration We all wondered if it is possible to predict the future! well with the player data against each team and conditions we get to check if the future prediction is truly possible! Context The **Convention on International Trade in Endangered Species of Wild Fauna and Flora**, or **CITES** for short, is an international treaty organization tasked with monitoring, reporting, and providing recommendations on the international species trade. CITES is a division of the IUCN, which is one of the principal international organization focused on wildlife conversation at large. It is not a part of the UN (though its reports are read closely by the UN). CITES is one of the oldest conservation organizations in existence. Participation in CITES is voluntary, but almost every member nation in the UN (and, therefore, almost every country worldwide) participates. Countries participating in CITES are obligated to report on roughly 5000 animal species and 29000 plant species brought into or exported out of their countries, and to honor limitations placed on the international trade of these species. Protected species are organized into three appendixes. Appendix I species are those whose trade threatens them with extinction. Two particularly famous examples of Class I species are the black rhinoceros and the African elephant, whose extremely valuable tusks are an alluring target for poachers exporting ivory abroad. There are 1200 such species. Appendix II species are those not threatened with extinction, but whose trade is nevertheless detrimental. Most species in cites, around 21000 of them, are in Appendix II. Finally, Appendix III animals are those submitted to CITES by member states as a control mechanism. There are about 170 such species, and their export or import requires permits from the submitting member state(s). This dataset records all <i>legal</i> species imports and exports carried out in 2016 (and a few records from 2017) and reported to CITES. Species not on the CITES lists are not included; nor is the significant, and highly illegal, ongoing black market trading activity. Content This dataset contains records on every international import or export conducted with species from the CITES lists in 2016. It contains columns identifying the species, the import and export countries, and the amount and characteristics of the goods being traded (which range from live animals to skins and cadavers). For further details on individual rows and columns refer to the metadata on the `/data` tab. A much more detailed description of each of the fields is available in the [original CITES documentation](https://trade.cites.org/cites_trade_guidelines/en-CITES_Trade_Database_Guide.pdf). Acknowledgements This dataset was originally aggregated by CITES and made available online through [this downloader tool](https://trade.cites.org/en/cites_trade/). The CITES downloader goes back to 1975, however it is only possible to download fully international data two years at a time (or so) due to limitations in the number of rows allowed by the data exporter. If you would like data going further back, check out the downloader. Be warned, though, this data takes a long time to generate! This data is prepared for CITES by UNEP, a division of the UN, and hence likely covered by the [UN Data License](http://data.un.org/Host.aspx?Content=UNdataUse). Inspiration * What is the geospatial distribution of the international plant/animal trade? * How much export/import activity is there for well-known species, like rhinos, elephants, etcetera? * What percent of the trade is live, as opposed to animal products (ivory, skins, cadavers, etcetera)? Context The [Sentiment Polarity Dataset Version 2.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/) is created by Bo Pang and Lillian Lee. This dataset is redistributed with NLTK with permission from the authors. This corpus is also used in the [**Document Classification** section of Chapter 6.1.3 of the NLTK book](http://www.nltk.org/book/ch06.html). Content This dataset contains 1000 positive and 1000 negative processed reviews. Citation Bo Pang and Lillian Lee. 2004. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. In ACL. Bibtex: @InProceedings{Pang+Lee:04a, author = {Bo Pang and Lillian Lee}, title = {A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts}, booktitle = "Proceedings of the ACL", year = 2004 } Dataset connected with article: 'Increasing the use of conceptually-derived strategies in arithmetic: using inversion problems to promote the use of associativity shortcuts.' Abstract: Conceptual knowledge of key principles underlying arithmetic is an important precursor to understanding algebra and later success in mathematics. One such principle is associativity, which allows individuals to solve problems in different ways by decomposing and recombining subexpressions (e.g. ‘a + b – c’ = ‘b – c + a’). More than any other principle, children and adults alike have difficulty understanding it, and educators have called for this to change. We report three intervention studies that were conducted in university classrooms to investigate whether adults’ use of associativity could be improved. In all three studies, it was found that those who first solved inversion problems (e.g. ‘a + b – b’) were more likely than controls to then use associativity on ‘a + b – c’ problems. We suggest that ‘a + b – b’ inversion problems may either direct spatial attention to the location of ‘b – c’ on associativity problems, or implicitly communicate the validity and efficiency of a right-to-left strategy. These findings may be helpful for those designing brief activities that aim to aid the understanding of arithmetic principles and algebra. Number of papers per category for ten key-entropy concepts. The concepts were selected according with their frequency of appearances in all abstracts in our dataset.Interactive plot: https://plot.ly/~larckov/1992.embed Raw data for the 16/1 pollen surface sample dataset obtained from the Neotoma Paleoecological Database. Introduction A car company has the data for all the cars that are present in the market. They are planning to introduce some new ones of their own, but first, they want to find out what would be the popularity of the new cars in the market based on each car's attributes. We will provide you a dataset of cars along with the attributes of each car along with its popularity. Your task is to train a model that can predict the popularity of new cars based on the given attributes. Dataset You are given a training dataset, train.csv. The file is a comma separated file with useful information for this task: train.csv contains the information about a car along with its popularity level. Each row provides information on each car. Information such as buying_price, maintenance_cost, number_of_doors, number_of_seats, etc. The definition of each attribute is as follows: buying_price: The buying_price denotes the buying price of the car, and it ranges from [1...4], where buying_price equal to 1 represents the lowest price while buying_price equal to 4 represents the highest price. maintenance_cost: The maintenance_cost denotes the maintenance cost of the car, and it ranges from [1...4], where maintenance_cost equal to 1 represents the lowest cost while maintenance_cost equal to 4 represents the highest cost. number_of_doors: The number_of_doors denotes the number of doors in the car, and it ranges from [2...5], where each value of number_of_doors represents the number of doors in the car. number_of_seats: The number_of_seats denotes the number of seats in the car, and it consists of [2, 4, 5], where each value of number_of_seats represents the number of seats in the car. luggage_boot_size: The luggage_boot_size denotes the luggage boot size, and it ranges from [1...3], where luggage_boot_size equal to 1 represents smallest luggage boot size while luggage_boot_size equal to 3 represents largest luggage boot size. safety_rating: The safety_rating denotes the safety rating of the car, and it ranges from [1...3], where safety_rating equal to 1 represents low safety while safety_rating equal to 3 represents high safety. popularity: The popularity denotes the popularity of the car, and it ranges from [1...4], where popularity equal to 1 represents an unacceptable car, popularity equal to 2 represents an acceptable car, popularity equal to 3 represents a good car, and popularity equal to 4 represents the best car. We also provide a test set of car along with the above attributes excluding popularity, in test.csv. The goal is to predict the popularity of the car based on its attributes. Drosophila Melanogaster ----------------------- Drosophila Melanogaster, the common fruit fly, is a model organism which has been extensively used in entymological research. It is one of the most studied organisms in biological research, particularly in genetics and developmental biology. When its not being used for scientific research, *D. melanogaster* is a common pest in homes, restaurants, and anywhere else that serves food. They are not to be confused with Tephritidae flys (also known as fruit flys). https://en.wikipedia.org/wiki/Drosophila_melanogaster About the Genome ---------------- This genome was first sequenced in 2000. It contains four pairs of chromosomes (2,3,4 and X/Y). More than 60% of the genome appears to be functional non-protein-coding DNA. ![D. melanogaster chromosomes][1] The genome is maintained and frequently updated at [FlyBase][2]. This dataset is sourced from the UCSC Genome Bioinformatics download page. It uses the August 2014 version of the D. melanogaster genome (dm6, BDGP Release 6 + ISO1 MT). http://hgdownload.soe.ucsc.edu/downloads.htmlfruitfly Files were modified by Kaggle to be a better fit for analysis on Scripts. This primarily involved turning files into CSV format, with a header row, as well as converting the genome itself from 2bit format into a FASTA sequence file. Bioinformatics -------------- Genomic analysis can be daunting to data scientists who haven't had much experience with bioinformatics before. We have tried to give basic explanations to each of the files in this dataset, as well as links to further reading on the biological basis for each. If you haven't had the chance to study much biology before, some light reading (ie wikipedia) on the following topics may be helpful to understand the nuances of the data provided here: [Genetics][3], [Genomics][4] ([Sequencing][5]/[Genome Assembly][6]), [Chromosomes][7], [DNA][8], [RNA][9] ([mRNA][10]/[miRNA][11]), [Genes][12], [Alleles][13], [Exons][14], [Introns][15], [Transcription][16], [Translation][17], [Peptides][18], [Proteins][19], [Gene Regulation][20], [Mutation][21], [Phylogenetics][22], and [SNPs][23]. Of course, if you've got some idea of the basics already - don't be afraid to jump right in! Learning Bioinformatics ----------------------- There are a lot of great resources for learning bioinformatics on the web. One cool site is [Rosalind][24] - a platform that gives you bioinformatic coding challenges to complete. You can use Kaggle Scripts on this dataset to easily complete the challenges on Rosalind (and see [Myles' solutions here][25] if you get stuck). We have set up [Biopython][26] on Kaggle's docker image which is a great library to help you with your analyses. Check out their [tutorial here][27] and we've also created [a python notebook with some of the tutorial applied to this dataset][28] as a reference. Files in this Dataset --------------------- <hr> **Drosophila Melanogaster Genome** - genome.fa The assembled genome itself is presented here in [FASTA format][29]. Each chromosome is a different sequence of nucleotides. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are show in lower case; non-repeating sequence is shown in upper case. <hr> **Meta Information** There are 3 additional files with meta information about the genome. - meta-cpg-island-ext-unmasked.csv This file contains descriptive information about CpG Islands in the genome. https://en.wikipedia.org/wiki/CpG_site - meta-cytoband.csv This file describes the positions of cytogenic bands on each chromosome. https://en.wikipedia.org/wiki/Cytogenetics - meta-simple-repeat.csv This file describes simple tandem repeats in the genome. https://en.wikipedia.org/wiki/Repeated_sequence_(DNA) https://en.wikipedia.org/wiki/Tandem_repeat <hr> **Drosophila Melanogaster mRNA Sequences** Messenger RNA (mRNA) is an intermediate molecule created as part of the cellular process of converting genomic information into proteins. Some mRNA are never translated into proteins and have functional roles in the cell on their own. Collectively, organism mRNA information is known as a Transcriptome. mRNA files included in this dataset give insight into the activity of genes in the organism. https://en.wikipedia.org/wiki/Messenger_RNA - mrna-genbank.fa This file includes all mRNA sequences from GenBank associated with Drosophila Melanogaster. http://www.ncbi.nlm.nih.gov/genbank/ - mrna-refseq.fa This file includes all mRNA sequences from RefSeq associated with Drosophila Melanogaster. http://www.ncbi.nlm.nih.gov/refseq/ <hr> **Gene Predictions** A gene is a segment of DNA on the genome which, through mRNA, is used to create proteins in the organism. Knowing which parts of DNA are coding (genes) or non-coding is difficult, and a number of different systems for prediction exist. This dataset includes a number of different gene prediction systems applied to the drosophila melanogaster genome. https://en.wikipedia.org/wiki/Gene_prediction - genes-augustus.csv AUGUSTUS is a piece of software that predicts genes ab initio using Hidden Markov Models. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC441517/ - genes-genscan.csv GENSCAN is an older ab initio software for predicting genes. http://genes.mit.edu/GENSCANinfo.html - genes-ensembl.csv - ensembl-gtp.csv - ensembl-pep.csv - ensembl-source.csv - ensembl-to-gene-name.csv Ensembl provides gene annotation generated by their software Genebuild. This process combines automatic annotation alongside manual curation. http://uswest.ensembl.org/info/genome/genebuild/genome_annotation.html We have also included some supplementary files for these, including predicted protein peptide sequences for each predicted gene. - genes-refseq.csv - genes-xeno-refseq.csv - refseq-link.csv - refseq-summary.csv We have included two RefSeq gene predictions in this dataset. The first is based solely on information from the drosophila melanogaster genome. The second (genes-xeno-refseq.csv) uses genes from other organisms as a basis for predicting genes in drosophila melanogaster. RefSeq RNAs were aligned against the D. melanogaster genome using blat; those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept. We have also included supplementary files for these which include information about the genes that have been identified. http://www.ncbi.nlm.nih.gov/refseq/ <hr> What can you do with this data? ------------------------------- Genomic data is the foundation of bioinformatics, and there is an incredible array of things you can do with this data. A good place to start is to look at some of the meta supplementary files alongside the genomic sequence itself. We have a number of different gene prediction systems in the dataset, how do they compare to each other? How do they compare to the mRNA data? Working back from the refseq-summary.csv file, you can look at genes that code for particular proteins - can you find these genes in the genome? How much of the genome codes for the mRNA's found in our mRNA data? Of the mRNA's we have, how many map to the predicted genes and the predicted peptided sequence data? How much of the mRNA seems to be protein-coding vs how much looks like it is miRNA? Can you find pre-mRNA or splice variants within the mRNA data? Does meta information like cytogenic bands or CpG sites correspond with splice variants or a lack of mRNA altogether? Those are just some of many ideas that could get you started. Looking for Feedback -------------------- This is the first genomic dataset on Kaggle and we are looking for feedback from our community about how interesting this dataset is to them, or if there are ways we could improve it to better suit analysis. Please post suggestions for supplementary data, future genomes we could host, bioinformatics packages we should include on scripts, and any other feedback on the dataset forum. [1]: https://upload.wikimedia.org/wikipedia/commons/1/1d/Drosophila-chromosome-diagram.jpg [2]: http://flybase.org [3]: https://en.wikipedia.org/wiki/Genetics [4]: https://en.wikipedia.org/wiki/Genomics [5]: https://en.wikipedia.org/wiki/Sequencing [6]: https://en.wikipedia.org/wiki/Sequence_assembly [7]: https://en.wikipedia.org/wiki/Chromosome [8]: https://en.wikipedia.org/wiki/DNA [9]: https://en.wikipedia.org/wiki/RNA [10]: https://en.wikipedia.org/wiki/Messenger_RNA [11]: https://en.wikipedia.org/wiki/MicroRNA [12]: https://en.wikipedia.org/wiki/Gene [13]: https://en.wikipedia.org/wiki/Allele [14]: https://en.wikipedia.org/wiki/Exon [15]: https://en.wikipedia.org/wiki/Intron [16]: https://en.wikipedia.org/wiki/Transcription_(genetics) [17]: https://en.wikipedia.org/wiki/Translation_(biology) [18]: https://en.wikipedia.org/wiki/Peptide [19]: https://en.wikipedia.org/wiki/Protein [20]: https://en.wikipedia.org/wiki/Regulation_of_gene_expression [21]: https://en.wikipedia.org/wiki/Mutation [22]: https://en.wikipedia.org/wiki/Phylogenetics [23]: https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism [24]: http://rosalind.info/problems/list-view/ [25]: https://www.kaggle.com/mylesoneill/d/mylesoneill/drosophila-melanogaster-genome/rosalind-problem-solutions [26]: http://biopython.org [27]: http://biopython.org/DIST/docs/tutorial/Tutorial.html [28]: https://www.kaggle.com/mylesoneill/d/mylesoneill/drosophila-melanogaster-genome/getting-started-with-biopython [29]: https://en.wikipedia.org/wiki/FASTA_format Context Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor\'s Pragmatic Chaos team. This is the dataset that was used in that competition. Content **This comes directly from the README:** TRAINING DATASET FILE DESCRIPTION ================================================================================ The file "training_set.tar" is a tar of a directory containing 17770 files, one per movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format: CustomerID,Rating,Date - MovieIDs range from 1 to 17770 sequentially. - CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users. - Ratings are on a five star (integral) scale from 1 to 5. - Dates have the format YYYY-MM-DD. MOVIES FILE DESCRIPTION ================================================================================ Movie information in "movie_titles.txt" is in the following format: MovieID,YearOfRelease,Title - MovieID do not correspond to actual Netflix movie ids or IMDB movie ids. - YearOfRelease can range from 1890 to 2005 and may correspond to the release of corresponding DVD, not necessarily its theaterical release. - Title is the Netflix movie title and may not correspond to titles used on other sites. Titles are in English. QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION ================================================================================ The qualifying dataset for the Netflix Prize is contained in the text file "qualifying.txt". It consists of lines indicating a movie id, followed by a colon, and then customer ids and rating dates, one per line for that movie id. The movie and customer ids are contained in the training set. Of course the ratings are withheld. There are no empty lines in the file. MovieID1: CustomerID11,Date11 CustomerID12,Date12 ... MovieID2: CustomerID21,Date21 CustomerID22,Date22 For the Netflix Prize, your program must predict the all ratings the customers gave the movies in the qualifying dataset based on the information in the training dataset. The format of your submitted prediction file follows the movie and customer id, date order of the qualifying dataset. However, your predicted rating takes the place of the corresponding customer id (and date), one per line. For example, if the qualifying dataset looked like: 111: 3245,2005-12-19 5666,2005-12-23 6789,2005-03-14 225: 1234,2005-05-26 3456,2005-11-07 then a prediction file should look something like: 111: 3.0 3.4 4.0 225: 1.0 2.0 which predicts that customer 3245 would have rated movie 111 3.0 stars on the 19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher at 3.4 stars on the 23rd of Decemeber, 2005, etc. You must make predictions for all customers for all movies in the qualifying dataset. THE PROBE DATASET FILE DESCRIPTION ================================================================================ To allow you to test your system before you submit a prediction set based on the qualifying dataset, we have provided a probe dataset in the file "probe.txt". This text file contains lines indicating a movie id, followed by a colon, and then customer ids, one per line for that movie id. MovieID1: CustomerID11 CustomerID12 ... MovieID2: CustomerID21 CustomerID22 Like the qualifying dataset, the movie and customer id pairs are contained in the training set. However, unlike the qualifying dataset, the ratings (and dates) for each pair are contained in the training dataset. If you wish, you may calculate the RMSE of your predictions against those ratings and compare your RMSE against the Cinematch RMSE on the same data. See http://www.netflixprize.com/faqprobe for that value. Acknowledgements The training data came in 17,000+ files. In the interest of keeping files together and file sizes as low as possible, I combined them into four text files: combined_data_(1,2,3,4).txt The contest was originally hosted at http://netflixprize.com/index.html The dataset was downloaded from [https://archive.org/download/nf_prize_dataset.tar][1] Inspiration This is a fun dataset to work with. You can read about the winning algorithm by BellKor\'s Pragmatic Chaos [here][2] [1]: https://archive.org/download/nf_prize_dataset.tar [2]: http://netflixprize.com/community/topic_1537.html Context http://www.cell.com/cell/fulltext/S0092-8674(18)30154-5 ![](https://i.imgur.com/jZqpV51.png) Figure S6. Illustrative Examples of Chest X-Rays in Patients with Pneumonia, Related to Figure 6 The normal chest X-ray (left panel) depicts clear lungs without any areas of abnormal opacification in the image. Bacterial pneumonia (middle) typically exhibits a focal lobar consolidation, in this case in the right upper lobe (white arrows), whereas viral pneumonia (right) manifests with a more diffuse ‘‘interstitial’’ pattern in both lungs. http://www.cell.com/cell/fulltext/S0092-8674(18)30154-5 Content The dataset is organized into 3 folders (train, test, val) and contains subfolders for each image category (Pneumonia/Normal). There are 5,863 X-Ray images (JPEG) and 2 categories (Pneumonia/Normal). Chest X-ray images (anterior-posterior) were selected from retrospective cohorts of pediatric patients of one to five years old from Guangzhou Women and Children’s Medical Center, Guangzhou. All chest X-ray imaging was performed as part of patients’ routine clinical care. For the analysis of chest x-ray images, all chest radiographs were initially screened for quality control by removing all low quality or unreadable scans. The diagnoses for the images were then graded by two expert physicians before being cleared for training the AI system. In order to account for any grading errors, the evaluation set was also checked by a third expert. Acknowledgements Data: https://data.mendeley.com/datasets/rscbjbr9sj/2 License: [CC BY 4.0][1] Citation: http://www.cell.com/cell/fulltext/S0092-8674(18)30154-5 ![enter image description here][2] Inspiration Automated methods to detect and classify human diseases from medical images. [1]: https://creativecommons.org/licenses/by/4.0/ [2]: https://i.imgur.com/8AUJkin.png Arabic Handwritten Digits Dataset Abstract In recent years, handwritten digits recognition has been an important area due to its applications in several fields. This work is focusing on the recognition part of handwritten Arabic digits recognition that face several challenges, including the unlimited variation in human handwriting and the large public databases. The paper provided a deep learning technique that can be effectively apply to recognizing Arabic handwritten digits. LeNet-5, a Convolutional Neural Network (CNN) trained and tested MADBase database (Arabic handwritten digits images) that contain 60000 training and 10000 testing images. A comparison is held amongst the results, and it is shown by the end that the use of CNN was leaded to significant improvements across different machine-learning classification algorithms. The Convolutional Neural Network was trained and tested MADBase database (Arabic handwritten digits images) that contain 60000 training and 10000 testing images. Moreover, the CNN is giving an average recognition accuracy of 99.15%. Context The motivation of this study is to use cross knowledge learned from multiple works to enhancement the performance of Arabic handwritten digits recognition. In recent years, Arabic handwritten digits recognition with different handwriting styles as well, making it important to find and work on a new and advanced solution for handwriting recognition. A deep learning systems needs a huge number of data (images) to be able to make a good decisions. Content The MADBase is modified Arabic handwritten digits database contains 60,000 training images, and 10,000 test images. MADBase were written by 700 writers. Each writer wrote each digit (from 0 -9) ten times. To ensure including different writing styles, the database was gathered from different institutions: Colleges of Engineering and Law, School of Medicine, the Open University (whose students span a wide range of ages), a high school, and a governmental institution. MADBase is available for free and can be downloaded from (http://datacenter.aucegypt.edu/shazeem/) . Acknowledgements **CNN for Handwritten Arabic Digits Recognition Based on LeNet-5** http://link.springer.com/chapter/10.1007/978-3-319-48308-5_54 Ahmed El-Sawy, Hazem El-Bakry, **Mohamed Loey** Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016 Volume 533 of the series Advances in Intelligent Systems and Computing pp 566-575 Inspiration Creating the proposed database presents more challenges because it deals with many issues such as style of writing, thickness, dots number and position. Some characters have different shapes while written in the same position. For example the teh character has different shapes in isolated position. Arabic Handwritten Characters Dataset https://www.kaggle.com/mloey1/ahcd1 Benha University http://bu.edu.eg/staff/mloey https://mloey.github.io/ free online digital library that anyone can improve; Wikimedia project US non-profit organization online database project freely editable world geographic database free knowledge database project hosted by the Wikimedia Foundation and edited by volunteers federal list of historic sites in the United States place listed by the UNESCO as of special cultural or physical significance online music metadata database free database of the National Medical Library of the United States volunteer effort to digitize and archive books Online database for peptidases. organization providing information on French cinema online music database online database of Broadway theatre productions and their personnel inventory of the global conservation status of biological species international authority file for personal names, subject headings and corporate bodies Internet database of films, and movie professionals (actors, directors, screenwriters etc.) online database of taxa American website that collects review scores from both offline and online sources to give an average rating premier British dictionary of the English language International organization international authority file online database of burials online dictionary of medical eponyms annual list compiled and published by Fortune magazine collaborative project intended to create an encyclopedia documenting all living species known to science database arm of the US National Library of Medicine authoritative taxonomic information on plants, animals, fungi, and microbes of North America and the world Japanese digital library American review aggregation website for film and television database in the field of organic chemistry treaty An ontology for describing the function of genes and gene products. regional database of daily updated census information geographical database classification of membrane proteins including ion channels website that aggregates reviews of music albums, games, movies, TV shows, DVDs, and formerly, books online multilingual dictionary online database digital library, online database and large-scale digitization project for biodiversity literature online database with abstracts of medical articles, hosted by US National Library of Medicine social webradio collaborative compilation of information about the world's time zones multilingual open-content collaborative map controlled vocabulary for the purpose of indexing journal articles and books in the life sciences service from Google bibliographic database for economics global partnership of conservation organisations that strives to conserve birds Electronic index of zoological literature English website about anime, manga and Japanese culture website that tracks box office revenue repository of scholarly manuscripts that are free to read digital collection of European cultural heritage is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources web service providing access to resources of national libraries across Europe catalog of human genes and genetic disorders and traits, with a particular focus on the gene-phenotype relationship French digital library German bibliography project (17th century prints) company printed and online English dictionary International Architecture Database New Testament books art market company french video game website video games news and reviews website database of compact disc track listings many data series in political science research controlled vocabulary covering all areas of interest of the Food and Agriculture Organization of the United Nations (such as food, nutrition, agriculture, fisheries, forestry, environment etc.) published by FAO and edited by a community of experts the Swiss Myocardial Infarction Registry bibliographic database for marine science topics online database of DOS video games online clinical medical knowledge base computer database of medieval Latin abbreviations community-powered dictionary of slang terms digital library database for excellent scientists in Germany Archive of Amiga-related software and files bibliographic database stock photo licensing company. non-profit repository of high-quality, high-value media of endangered species library catalog hierarchical database that stores configuration settings and options on Microsoft Windows operating systems United States independent agency international digital library operated by UNESCO and the United States Library of Congress national library in Japan database of information about movie stars, movies and television shows Korean film database astronomical database online project collecting example sentences website and database about audio recordings collaborative site for sharing musical scores project for the creation of a virtual library of public domain music scores collection, text corpus of ancient Egyptian funerary spells written on coffins online resource online information service produced by the United States National Library of Medicine database collection of protected architectural creations in the United Kingdom ontology online dictionary online genealogy platform with web, mobile, and software products and services web-based database of marine species controlled vocabulary used for describing items of art, architecture, and material culture digital library theoretical and practical tool for information integration in the field of cultural heritage archaeological database vocabulary terms that can be used to describe web resources database of scientific plant names digital library European fingerprint database for identifying asylum seekers and irregular border-crossers. genealogy website astronomical database picture library in Dresden, Germany іndex of chemicals biomolecular database internet portal for art history research and teaching Shut down digital library social cataloging web application Digital Library portal for researchers in Astronomy and Physics, operated by the Smithsonian Astrophysical Observatory commercial scientific social network database of toxicity information national Dutch documentation center of art history international open access database of protein and nucleic acid structures national heritage register of Australia library Mexico's principal government institution in charge of statistics and census data online resource for fossil animals, plants, and microorganisms central database maintained by the French Ministry of Culture web-based database for the academic genealogy of mathematicians geographical database available and accessible through various web services, operated by Unxos GmbH collaborative database of audio snippets, samples, recordings, bleeps global species database of fish species dictionary of biographies of Canadian people published in both English and French free pornographic video sharing website international union library catalog library Russian Classification on Objects of Administrative Division organization German image database of art and architecture online database of open access digital repositories throughout the world website providing baseball statistics Free online index to biographical reference works in the German language area electronic library of the University of Szeged online database of board games, game designers and game publishers worldwide DNA sequence database database and ontology of molecular entities focused on small chemical compounds Hungarian digital library database of protein sequence and functional information streaming media system biological database database of plant names UNESCO publication of endangered languages digital cultural archive initiative that publishes free electronic versions of books significant to the culture and history of the Nordic countries pornographic video sharing website website on metal bands Repository and publisher for data from earth system research (georeferenced) web service that provides a searchable database of translations for a number of language pairs library data set of American English in 1961 global database of shark attacks service on internet United States government designation for food additives Israel's principal government institution in charge of statistics and census data project about 16th-century authors and publishers, run by Italy's Istituto Centrale per il Catalogo Unico online architecture database charity assessment organization that evaluates charitable organizations in the United States former academic search engine audiobook library genealogy website database bioinformatics and cheminformatics database from the University of Alberta organization for baseball history website and weblabel database of language struct Internate database of the Swedish Film Institute German online database about actors, films, TV series, video and advertising productions Esperanto dictionary periodical literature digital library social music cataloging, rating and reviewing website knowledge base and artificial intelligence project website digital library of works mostly in Hungarian dictionary database of geographical objects music streaming and recommendation website Wiki-based lyric database online project for book data of the Internet Archive online subscription index of citations an electronic archive of German-language text corpora of written language with over 42 billion words german text corpus computer science bibliography website hosted in Germany public Internet library catalog in Germany curated list of peer-reviewed Open-Access journals Internet Database of Diplomatic Documents of Switzerland Virtual library ancient catalogue of the Library of Alexandria Daily fixing of renminbi rates pornographic video sharing site Swiss digital library for antique works bibliographic database German National Library of Science and Technology (TIB) document delivery service on-line database about opera governmental database used by European countries description of triangle related points bibliographic database gene sequence database website library catalog Processing Network database of famous people astronomical database Semantic Web ontology to describe relations between people central service unit of the MPG ontology for the domain of human anatomy German periodical about films subscription digital library catalog of incunabula Register of organizations managed by some countries for statistical purposes book, website, database online knowledge base (2007-2016) citizen science project virtual art gallery for European fine art created before 1900 digital library Digital library of Bibliothèque Nationale de France film and photography museum and archive in Rochester, New York, United States organization association promoting French cinema abroad thesaurus of geographic names by the Getty Research Institute inter-institutional terminology database of the European Union German bibliography project (16th century prints) chemical database aggregator of scientific data on biodiversity; data portal zoological reference book medical bibliographical database volunteer-run library of free content sheet music catalogue of grapevine genetic resources biographical reference work in Norwegian geospatial extension for the PostgreSQL Database Virtual specialist library for german language studies online database collecting taxonomic information on all living reptile species free resource offering access to experimental data characterizing antibody and T cell epitopes involved in infectious disease, allergy, autoimmunity, and transplant. International Bibliography for Theology and Religious Studies Astronomy objects catalog digital library largest assembly of data on the world's terrestrial and marine protected areas German bibliography project database management system developed by Stanford University database providing an authoritative source of bibliographic information dictionary of the Portuguese language Swedish national union catalogue Online database multi-volume series covering all bird species Free access Quebec digital library non-profit organisation in the USA international project to index all formal (scientific) names in the kingdom of Fungi Internet Archive free, online database and bioinformatics resource German-speaking social-media-driven movie community US Department of Education data collector and publisher library catalog on the web image database in the Netherlands online dictionary German computerized civil registry Polish language film database Dutch photograph press agency record book of the Stationers' Company of London database of biological pathways online bibliographic database later known as Web of Science database of chemical structures online database of biochemical reactions heritage register of Victoria, Australia Dutch database of species W3C recommendation designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or other structured controlled vocabularies database of American death records Internet project providing information about the diversity and phylogeny of life database of chemicals owned by the Royal Society of Chemistry; see P661 oldest electronic library in the Russian Internet segment catalog of all Swiss libraries controlled, relational vocabularies of terms for the domain of systems biology Russian website about cinematography EC OJ bulletin of public procuremrnt tenders Berlin animal voice archive online database of energy-efficient appliances schema for describing posts and interactions on forums, message boards, blogs etc. website website about Dutch language and Dutch literature Database on private international law thesaurus of artists and people by the Getty Research Institute website documenting the known species of ants Online database of the Max Planck Institute for the History of Science Virtual specialist library website software tools for digital library collections online database for fungi database on tropical plants, mainly the ecozone Neotropica system used by the libraries of French universities and higher education establishments to identify, track and manage the documents in their possession The manually curated portion of the UniProt database of protein sequence information. movie recommendation website database Mixed martial arts website database of ZX Spectrum video games biological database; expands official version of the Enzyme Nomenclature system website database about animals in Europe social networking website for academics defunct American news photo agency database of scientific names for algae library resource portal in France amphibian database online database of animal natural history, distribution, classification, and conservation biology art historical database in Belgium French national database under the Ministry of Ecology Database for potato varieties interactive website accessing anatomical data US Department of Education online repository organization US American digital public library project of the Unicode Consortium to provide locale data in XML format for use in computer applications database of citations about engineering online bibliographic database created by Universidad de La Rioja, Spain General dictionary of Basque and its corpus digital library platform online database containing standardized peer-reviewed articles that describe specific heritable diseases US SEC computer system public site promoting maritime safety and quality genealogical organization and website Automated National File of Genetic Prints set of works about the flora of Australia French-language terminology databas bibliographic database database run by a multi-organisation initiative database developed by the IUCN listing information about taxon which are deemed invasive in various countries and regions of the world digital library French open access repository indexing database non-profit organisation in the USA digital archive created by the New York Public Library open-access digital library from United States open wiki to catalog food, nutrition facts and ingredients scientific database 3D encyclopedia of proteins and other molecules media annotation website online interpretation specialist company digital library supported by National and University library of Slovenia federal list of historic sites of Canada Database for the model organism Saccharomyces cerevisiae project to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond an online Scots-English dictionary bibliographic database of nuclear science and technology French national database of all companies French botanical community U.S. government database The Arabidopsis Information Resource (TAIR) collects information and maintains a database of genetic and molecular biology data for Arabidopsis thaliana, a widely used model plant. single window portal to integrate the digital repositories of India, sponsored by NME-ICT, MHRD, Govt. of India research center Czech and Slovak web project providing a movie database taxonomic database biological database database containing information about non-coding RNA (ncRNA) families and other structured RNA elements Italian-language anime, manga, and Japanese drama database website chemical database online public access catalog of the Library of Congress pornographic website internet-based database of comic book information digital library digital library online library Belgian foundation astronomy database star catalog English-language anime and manga database website web archive of Portugal Russian website ontology for descriptive linguistics Russian and English online dictionary corpus of the Russian language that has been partially accessible through a query interface online since 2004 the process of determining correspondences between concepts speech audio files and text transcriptions website online dictionary Compact disc database Russian online library Current Research Information System in Norway (CRIStin) supplier of library and information data for all the Norwegian university and college libraries Norwegian movie portal database serves as the catalog and index for the collections of the United States National Agricultural Library bibliographic database for topics related to religion system that measures occasional harms from medications to ascertain whether the risk-benefit ratio is high enough online genomic database database of allele frequencies in the human genome annual designation by the U.S. National Trust for Historic Preservation of 11 sites amateur ornithological association, founded 1968 text corpus of American English genomic database digital library digital library of Modern Greek studies animal transcription factor database. catalogue of published genome size estimates for various animal species digital repository of marine science information and images organization research database online dictionary produced by Oxford University Press online database containing historical information on the performing arts in Australia online database about Australian literature computational comparative linguistics program A system designed to collect, analyze, and respond to voluntarily submitted aviation safety incident reports bibliographic database for life sciences Binary subcomplexes in proteins database multilingual semantic network and encyclopedic dictionary E books. identifier issued by the European and Mediterranean Plant Protection Organization (EPPO), to uniquely identify plants, pests and pathogens that are important to agriculture online reference resource Open-access digital library of Spanish-language texts database of telephone calls maintained by the United States National Security Agency database for protein and small molecule interactions biological database biological database database of biological reactions bibliographic database provides basic biographical information on all past and present United States federal court Article III judges Australian ornithological conservation organization bitter compounds digital library bibliographic database covering humanities literature published in English database of protein fragments database of insects and arthropods bibliographic database platform produced by CABI database on evolutionary relationships of protein domains database Database of somatic cancer mutations library archival institution organization catalog of cultural heritage in the State of California, United States digital library government biodiversity commission in Mexico biographical reference work online catalogue for books digital library of resources related to the history of agriculture in the United States is a more than 560-million-word corpus of American English online database of contemporary and historical documents relating to Irish history and culture one of the official Digital Object Identifier Registration Agencies of the International DOI Foundation international digital library project aimed at putting text and images of recovered cuneiform tablets online project to digitize phonograph cylinders UN livestock genetics programme RDF/OWL schema for describing software projects database for virologists Database of Interacting Proteins computer database file American medical research faciity database produced by the National Institutes of Health (NIH), National Library of Medicine (NLM), and the NIH Office of Dietary Supplements (ODS) aggregator for New Zealand digitised content digital library of comic books preservation project Georgia's state-wide cultural heritage digitization initiative Database of Protein Disorder formal ontology of human disease digital library gene and protein interaction database for Drosophila melanogaster digital library online database of bird observations database digital theatre archive based at the University of East London, in London, England, put offline in 2018 a scientific database for the bacterium Escherichia coli K-12 MG1655 database of biological information database of published works Turkish social networking service is one of the biggest available parallel corpora involving the Arabic language scientific project set of documents of the proceedings of the European Parliament from 1996 to the present database of biomedical research Online database from the EBI on Nucleotides exoplanet and star catalogue funded by NASA image-archiving system formal ontology Brazilian film and TV social network online database of Western Australian flora multi-volume book and online database filamentous fungi database of administrative boundaries database High-resolution shoreline data set online bioinformatics database dictionary of location and spelling of geographical names in Australia human gene databases wiki-based collection of information related to human genes collection of interconnected applications and databases that biologists use as repositories and as tools biological database hosted by NCBI bibliographic database of scientific literature in the geosciences American psychologic bibliographic database Georgia's virtual library and an initiative of the Board of Regents of the University System of Georgia database of whole genome sequencing data of microorganisms Comprehensive annotation resource for human genes and transcripts Database at Stanford University that tracks 93 common mutations of HIV database query language database about hazardous chemical substances digital, user-generated archive of historical photos, videos, audio recordings and personal recollections Digital library database of human metabolites online database of proteins research database focused on computer science, electrical engineering, electronics, and allied fields integrated genomic resources of human cell lines for identification open database for high energy physics research online database on the history of science information about elements of a schema in a database management system linguistics database A no longer updated database covering information about the proteomes of humans, mice and other animals US national index of serious criminal histories online, open access reference work covering recognition, biology, distribution, impact, and management of invasive plants and animals virtual library providing access to academic literature for Iraqi universities and related institutions Functionally related proteins across PPI networks Guatemala's civil registry chemical database of bioactive molecules with drug-like properties Spanish electronic magazine biological database heritage institution Database British database of the Ministry of Labor and Pensions is a million-word collection of British English texts which was compiled in the 1970s Regional Cooperative Online Information System for Scholarly Journals from Latin America, the Caribbean, Spain and Portugal information system about researchers and institutions in Brazil online database that provides accurate names and info for prokaryotes according to ICNB on-line bibliographic database in medicine and health sciences astronomical catalogue standardised patent data corpus available for research purposes Online database Microsatellites database database digital library of primary sources about 19th-c America Database for putative Transcription Factor Binding Sites online database online library catalog of the University of California website displaying song lyrics biological database of microRNA sequences and annotations database on microRNAs and their targets mimotype database Database of comparative protein structure models, calculated by the modeling pipeline ModPipe online biological database free online biological database series of obituaries/biographies of fellows of the Royal College of Physicians collection, database of virtual musical scores representing the logical content of the standard classical repertory from 1690 to 1890 music database online music lyrics database Nucleic acid phylogenetic profiling epigenomics database methylation data derived from next-generation sequencing data biological database database for uniquely identifying all the points of access to public transport in the UK entity in UK database on all bridges and tunnels in the United States text corpus includes classics of Polish literature, daily and specialist press, conversation recordings, ephemera and Internet texts U.S. drivers database topographic data organization public domain geographic data collection human protein knowledge bioinformatics resource Digital library website of Catholic works digital library of New Zealand and Pacific Island texts and materials chess test suite Nynorsk dictionary freely accessible online library digital library and database database on corporate entities under share-alike, open licence online language and dictionary service that gives you access to digital dictionaries from your computer, tablet or mobile phone. DNA Replication Origin Database catalogue of orthologous protein-coding genes across vertebrates, arthropods, fungi, plants, and bacteria database of orthologous genes across multiple species is a text corpus of 21st century English comprehensive database on proteins information system designed to support the biomedical research community’s work on bacterial infectious diseases Presaging Critical Residues in Protein interfaces - DataBase pictorial database of 3D structures in the Protein Data Bank Pathogen Host Interaction database organization digitizing the cultural heritage of the Punjab region of India and Pakistan database personal Disposable Income dictionary of the Sumerian language online directory of philosophy works phosphorylation site database Database of 3D structures of phosphorylation sites derived from Phospho.ELM public database for catalogs of gene phylogenies distributed data system that NASA uses to archive data collected by Solar System missions Plant Proteome Database database resource that links plant traits to genomics data. organization promoting the development of persistent and openly accessible digital taxonomic literature database of digitized books from the Early Modern period Database of experimentally verified glycosites and glycoproteins of the prokaryotes Database of protein repeats non-profit, non-partisan research organization in the US database of circular dichroism and synchrotron radiation biological database Database of pseudogenes annotations compiled from various sources Canadian national digital repository digital library quilt documentation project hub and resource center for materials relevant to quilt study and primary source research Database of resources for systems biology of DNA damage and repair database of RNA-binding proteins registry of open access policies database for DNA restriction enzymes bibliographic database and digital library of open access journals; funded by Universidad Autónoma del Estado de México Maintains photographic and digital data as well as mission documentation and cartographic data. Each facility's general holding contains images and maps of planets and their satellites taken by solar system exploration spacecraft. Open to the public. heritage register that listed natural and cultural heritage places in Australia that was closed in 2007 normalized dictionary for drugs and drug formulations from the National Library of Medicine, part of UMLS database on aging bibliographic database of open access journals website about the history of British film, television and social history global online database of information about marine life Online database ontology of sequence features used in biological sequence annotation website American music library of the Eastman School of Music, Rochester, NY open access eLibrary; part of Elsevier computational biology database database of artworks of musée du Louvre biological database database of artificially engineered genes Database in the United Kingdom Norwegian dictionary valuation for ecclesiastical taxation of English, Welsh, and Irish parish churches and prebends national collaborative project massive collection of digital media from before 2000 online database of metabolic pathways knowledgebase of protein termini online database of compounds toxic to human Longterm biobank study of 500,000 people ontology a database of scientific publications vaccine database online database of viral genomes and bioinformatics tools list of historic properties in the Commonwealth of Virginia, United States Machine-readable description of an RDF data set website cataloging sampling in music database of biological pathways on-line publisher of sheet music 2006-2013 Academic project focused on pre-20th-century English language women writers, their writings, and the reception of their work project in Kew Gardens covered bridge numbering system citizen science ornithology project knowledge base developed at the Max Planck Institute for Computer Science in Saarbrücken database of chemical compounds open access website, official ICZN taxonomic registry subscription-based software as a service (SaaS) company based in Vancouver, Washington online reference (authoritative) database related to theater performances in Poland multilingual online dictionary public service system to provide resources sharing among academic libraries in China Brazilian film site is a linguistic corpus of Latin texts from ancient Galicia database of Brazilian comics and artists Internet database of movie scripts Czech database for medicines association football (soccer) website Gene Disease Database website showcasing sign languages worldwide database indexing the audiovisual collections at the National Library of Sweden Scholarly and Academic Information Navigator, Citation Information by NII (National Institute of Informatics) online database of companies and start-ups Event log on Windows NT systems. Chinese website electronic journal platform run by the Japan Science and Technology Agency Database system by TDC Tedecy Software Engineering Japanese online cataloging system online dictionary library Japanese government biometric database Polish digital library Repository of data about research herbarium, most importantly standarized codes. government database administered by the Polish Police heritage institution the National Digital Archives are one of three central archives of the state archive network in Poland union database showing information on the holdings of Polish research and academic libraries digital library of Christian Greek and Latin texts digital newspaper collection, part of Digital materials of the National Library of Finland Finnish register of social welfare and healthcare professionals archives, part of Tampere University Database of Norway protected areas A registry for patients in Norway free online cancer encyclopedia system for recording and storing data about inhabitants Czech digital library Database, used in mobile radio systems collection of 7667 Pathway/Genome Databases (PGDBs) Terminological data base for Basque language data base of the academic production in Basque language library Genbank of different varieties of fruit trees and shrubs Danish film portal online statistics database from Statistics Denmark online taxonomic encyclopaedia general-purpose multilingual Esperanto dictionary for the Internet digital library of Hebrew religious Books & periodicals Index of Articles on Jewish Studies South Korean digital library and repository digital table of contents of Hungarian scientific and technical journals monumental fifty-volume series of primary sources for the study of Byzantine history online English-Irish dictionary stratigraphic database of the Netherlands browsable database to view authority headings for subject, name, title and name/title combinations Czech digital library set of written texts in electronic form in the Czech language biographical database of Chinese people lithostratigraphic database of Germany Online athletics database database of French museums curated classification and nomenclature for all of the organisms in the public sequence databases Dutch photo agency free and open-source multi-model NoSQL database developed by Apple digital map database operated by the United States Geological Survey database for Canadian geographic features database of RNAs non-commercial online mineralogical database mineral database mineral database mineral database database of members of the French Lower houses since 1789 maintained by the French National Assembly Archive of research findings on subjective appreciation of life. wiki which hosts texts and images that are in the public domain according to New Zealand copyright law. Similar to Wikisource and Wikimedia Commons website with information about viruses movie database for German movies journal a corpus management system in the program area Oral Corpora of the Institute for German Language German database on hazardous substances heritage register for objects of cultural heritage in Poland film database online repository of royalty-free music archive of texts and translations of art songs and choral works ontology integrated web resource focused on maintaining a comprehensive database of broadly neutralizing HIV-1 antibodies Johns Hopkins University civil registry of the Netherlands Basque digital library heritage institution pornographic video hosting service Document and reference the Belgian public official legal status of Belgian Enterprises or organisations from 1950´s internet database of visual novels Russian scientific digital library, the largest legal research and educational resource of the Russian segment of the Internet text corpus of Russian online texts, created in 2001-2012, allows to search with combining character patterns, morphological and syntactic features annual list of the top 100 companies in Ghana register of buildings protected against demolition, extension, or alteration without special permission data format for metadata about datasets Australian government biometric database corpus of medieval English literature sequence database of DNA barcoding non-profit academic service provider; the only complete listing of all medieval and early modern manuscripts of European polyphonic music initiative to coordinate the development of the community standards and formats for computational models in systems biology Open-source software citizen science project and website collection of over 30,000 titles of American popular music spanning from the late 18th century to the early 20th century digitised newspaper collection online database on architects collection of 29,000 pieces of American popular music spanning the years of 1780 to 1980 publications database of the U.S. National Technical Information Service US online database of music video information statutory list of heritage places in Queensland, Australia Australian government register research output digital online sharing platform database that collects the expression patterns of Drosophila melanogaster in embryogenesis former database of schools of public health and educational institutions organization not-for-profit open access digital library featuring research resources on public interest issues from Australia, New Zealand and international sources An ontology for human phenotypes in hereditary and non-hereditary diseases. data model for bibliographic description US-based media company integrative and comprehensive publicly available database and analysis resource to search, analyze, visualize, save and share data for influenza virus research database of handwritten digits website which tracks film box office revenue system for validating protein structures NMR spectroscopy database of carefully corrected or re-referenced chemical shifts public archive of tree ring data Database of transcription factor binding sites open-access biological database pharmacology database interactive and machine access to commonly used ontologies, controlled vocabularies, and other lists for bibliographic description full-text database of texts from Czech media database inter-governmental demographic data sharing system in the United States predictor film digital library bibliographic database of the world's lesser-known languages, maintained at the Max Planck Institute for Evolutionary Anthropology in Leipzig, Germany. organization collection of information about Australian Aboriginal and Torres Strait Islander languages Spanish database of journal articles website; online music database of hymns and hymn authors Post-1970 terrorist incident database by the University of Maryland, College Park open database Databases of ribosomal RNAs online biological database protein structure validation web server archive of biodiversity-related scientific papers library online library database aggregator; hosted by the National Library of Australia official wiki gathering mapping guidelines and explaining recommended tag usage for the OpenStreetMap project tools that make open science practice easier national register that contains basic information about Finnish citizens and foreign citizens residing permanently in Finland multi-institutional repository, maintained by National Library of Finland online artist database online service that provides access to materials from Finnish museums, libraries and archives database of scientific information in Poland database of wild California plants Dutch open data website Online database of pipe organs in the United Kingdom online publication of aircraft accident details sports statistics website Database about Ice hockey community contributed taxonomic checklist of all vascular plants of Canada, Saint Pierre and Miquelon, and Greenland database Germplasm Resources Information Network Database about Car Racing Turkish Cinema Archive Database digital library Netherlands database of grasses Digital library of works of J. S. Bach and his family online genealogy database shared library cataloguing network in Australia authority file for persons, organisations, works, topics, and geographic places, of the French National Library A guide produced by NIOSH about hazardous chemicals database on the relationships between human variations and phenotypes online database of video games Chinese-language anime, manga, and games database website website about conifers cooperative repository of open access Catalan journals Swedish taxonomy database open access repository run by CERN Gene database dataset of shipyards Medieval imagery dataset of ships Chemical toxicity database by the U.S.A. Environmental Protection Agency database of the birds of the world database of mathematical software free, collaborative database about films species-metabolite relationship database database with lipid structures A database by the Department of Veterans Affairs maintaining FDA approved drug concepts and their interactions database not-for-profit digital archive website on European royal families web-based database for the academic genealogy Dutch database international research consortium database of American birth, death records website on victorian MPs since 1851 directory digital library with topics related to Qatar online database of the Austrian Parliament database managed by the INHA from France search engine for scientific articles and books Canadiana website portal of papyrological and epigraphical resources Argentina's civil registry An European Bioinformatics Institute web resource to search and visualize Biomedical ontologies website on basketball online database of gene-disease relationships a research data repository online database about female authors web-based database for the academic genealogy Archive center of the University of Pittsburgh a project indexing and formatting FDA data, and making it accessible to the public online project designed to be a "smart" search service for journal articles website about Greek mythology knowledge database lead by James Burke index of bibliographic information on academic journals in the humanities and social sciences the automatically annotated portion of UniProtKB which is uneviewed german database and website digital library Croatian database of scientific papers Cultural institution chemical information database from University of Californy, Irvine release of 11.5 million documents created by the Panamanian corporate service provider Mossack Fonseca catalogue contains the holdings information of the German National Library starting from 1913 Public knowledge base providing Research Resource IDentifiers (RRIDs) neuroimaging database catalogue of scientific names of New Zealand biota Database for nutrients database of hepatics database collecting three-dimensional structures of natural metabolites database database of researcher impact by Frontiers collection of pollen and spores information in the Australasian region database of recognised astronomical names on planets or their satellites database database of mass spectra Database of CYP reactions database for chemical compounds in patents, operated by the European Bioinformatics Institute (EBI) Database of chemical entities Database of NMR spectra Repository for metabolomics data. Database of bio-active chemical entities of nano size Nucleotide Database of the National Center for Biotechnology Information Wiki-database of plant lncRNAs Pathosystems Resource Integration Center database of CYP metabolism freely available web resource of analytical technology services and products used in biomedical research, listing expertise and molecular resource capabilities available at research centres and biotech companies Protein Database of the National Center for Biotechnology Information database of biomedical nanotechnology research Repository for nanosafety data. Database of genome-scale metabolic networks. Database of chemical and biological interactions Collection of toxicogenomics data sets WORLDWIDE CROCODILIAN ATTACK DATABASE metabolomics database Rett Syndrome Variation Database curated and comprehensive summary of L1HS insertion polymorphisms identified in healthy or pathological human samples and published in peer-reviewed journals DNA methylome programming database that integrates the genome-wide single-base nucleotide methylomes of gametes and early embryos in different model organisms is a corpus of Russian internet texts that has been accessible on request through an online query interface since 2013 online Japanese-English dictionary on-line knowledge resource on cell lines biographical database of cultural industries in the Dutch and Flemish Golden Ages biological database with maps of signaling and metabolic pathways Database of metabolic fluxes Database of apicomplexan metabolic pathways Database of metabolomics data biological database open-access database storing curated, non-redundant transcription factor (TF) binding profiles hierarchically structured, organism-independent, flexible and scalable controlled classification system enabling the functional description of proteins from any organism free database of commercially-available compounds for virtual screening German webportal for communication, media, and film studies Dutch photographers' archive and image bank institutional repository shared by Leeds, Sheffield and York Universities image dataset digital Medieval Latin library developed by the University of Zurich, Institute for Greek and Latin Philology global demographic product created by the United States Census Bureau comprehensive database for the fission yeast Schizosaccharomyces pombe, providing structural and functional annotation, literature curation and access to large-scale data sets scientific database for bacteria Indonesian music website; online music database; social music cataloging site DIGITAALARHIIV: digital colletion of Estonia database; index of all those who worked in the English and Welsh book trades up to 1851 German database about philology U.S. Department of Education database about all public schools, districts, and state education agencies in the United States virtual authority list for ancient people through Linked Data collection website and database on English poetry 1579-1930 digital archive of books, pamphlets, and periodical essays illustrating the causes and controversies that preoccupied Byron and his contemporaries database database on mtDNA data integrated with longevity records academic project of University College London Catalogue of Life in Taiwan chemical database of the European union for substances used as cosmetic ingredient online database public domain bibliographic database Web application interface for viewing and editing microbial genetic data in Wikidata database of biological enzymes crowdsourced street-level photo database prosopographical database of church musicians in France bibliographic database United States Army Corps of Engineers inventory of dams in the United States national taxonomic reference of France National-level gene storage biobank and data repository chemical database website for distributing open data website of World Chess Federation with elo ratings of chess players chemical database database for drug discovery biological database art in urban space of the city of Zurich, Switzerland proceeding series American seismological website and database public domain spectral chemical database trilingual ontology consisting mainly of general concepts online database about waterfalls an open access, open source, community-driven web resource for Clinical Interpretation of Variants in Cancer online database with information about digital art and artists database with drug information database of educational and research organizations database of protein interaction data biological pathway database online service provided by United Nations Statistics Division (UNSD) of the Department of Economic and Social Affairs (DESA) Ensembl database for accessing genome-scale data from plants. database of monasteries, convents and collegiate churches in the Holy Roman Empire, created by the Germania Sacra research project database with clinical trial data database about cycling database of libraries in the British Isles to 1850 deaths in California Gene Ontology (GO) annotations database artwork by Antoni Muntadas tool for importing data sets into Wikidata online flora on panarctic region an interdisciplinary team dedicated to annotating gene function related to human fetal development database for information on protein localization, interaction, functional assays and expression represents GO annotations created in 2001 for NCBI and extracted into UniProtKB-GOA from EntrezGene A systems biology approach to dissect cilia function and its disruption in human genetic disease online database of the legislative information of the United States Congress Brazilian music website online bibliographic database support assertions about things (such as scientific conclusions, gene annotations, or other statements of fact) that result from scientific research Spanish online library database of cultural heritage objects serving the Wiki Loves Monuments project website about ice hockey international prospective register of systematic review protocols Online archive of genetic and phenotypic human data botanical online database bibliographic database of the Mathematical Reviews journal NCATS: Global Ingredient Archival System (GINAS) database maintained by government of Victoria, Australia, containing official names of geographic features within the state taxonomic wiki cloud storage server compatible with Amazon S3 is a group of clay tablets from Iron Age Syria web service allowing authors to register and claim authorship of their works community database resource for the laboratory use of zebrafish scientific database dictionary of the Breton language a collection of 570k human-written English sentence pairs, supporting the task of natural language inference (NLI) database maintained by the Parliament of Finland online resource that helps users discover biographical and historical information about persons, families, and organizations that created or are documented in historical resources (primary source documents) and their connections to one another international database of classical philologists hosted by the Aristarchus project data model used by GeoNames database GO annotation database registry for metadata schemas and application profiles facility in Stockholm, Sweden other organization in Munich, Germany dataset virtual library Psychology online preprint service Elizabeth Hawley's climbing statistics Online library lobid-organisations is a directory of approximately 30,000 memory institutions (libraries, archives and museums) in Germany, Austria, and Switzerland. The North Rhine-Westphalian Library Service Centre's (hbz) union catalogue as Linked Open Data. The hbz union catalogue records approximately 20 Millioen bibliographic tiles plus holding information. It contains cooperatively created title and holdings information from libraries in North Rhine-Westphalia and Rhineland-Palatinate. an OBO Library ontology for environmental systems, components, and processes dataset A language database containing pronunciations. It is operated by, and available to emploees within, the Swedish quango media. art website/database dataset developed by Mikolov et al. 2013 collection database owned by the Smithsonian Institution question-answering dataset question-answering dataset question-answering dataset question-answering dataset database of Requests for Comments publications database of medieval and Renaissance manuscripts in the British Library registry of civil aircraft registration marks in Canada database of airline histories free-content digital library of Jewish texts compilation of academic papers about wikis biological database and online resource for integrating genotype and phenotype data database of file format identification patterns compiled by Gary Kessler dataset by Pang and Lee register of Australian women and their organisations dataset by Hu & Liu from KDD'04 dataset by Julian McAuley from article published in 2015 dataset by Socher et al. from 2013 dataset by Andrew Mass et al from 2011 with 2 times 25,000 movie reviews dataset by Weibe et al. from 2005 database of saints, first names and feasts architectural heritage database in Portugal Variant Annotation as a Service reference dataset for knowlege graph algorithms benchmark dataset for speech recognition dataset bibliographic database of electronic PhD, MD and DProf theses provided by the British Library music website focused on cover versions knowledgebase for lipid biology ontology to annotate experiments in the field of the life sciences database which contains information on tissue and biofluid expression of extracellular RNAs multilingual corpus open archive of the social sciences civil registry of births, marriages, and deaths in the state of Victoria, Australia civil registry for Queensland, Australia documents leak related to offshore investment bibliographic database of the ACM online database of Arthurian texts, images, and scholarship at the Robbins Library, University of Rochester, New York describes the elements used in the data export of Semantic MediaWiki database of botanical journals registry that provides and maintains identifiers for genetic variants classification for visual arts, encoding system for visual elements in artworks dataset for question-answering bibliography of medieval literature word similarity dataset created by Felix Hill dataset for word similarity OpenStreetMap-based dataset, made available under the Open Database License open initiative whose aim is to enrich the Web of Data with Spanish geospatial data voice dataset by  Mozilla database of early modern correspondence suite of OWL 2 DL ontology modules for describing aspects of semantic publishing and referencing 2018 version of ontology for describing entities that are or may be published ontology that enables characterization of the nature or type of citations ontology meant to define bibliographic records, bibliographic references, and their compilation into bibliographic collections and bibliographic lists online database of the world's tallest buildings image dataset image dataset open archive of Swedish National Heritage Board online database of open access mandates open access platform for digitized journals in Switzerland image database with 2,429 faces size 19x19 in the training set database of the burial grounds of the Commonwealth War Graves Commission Brazilian national heritage register for cultural assets of artistic value research database in consciousness research and neuroscience data set consists of 20000 messages taken from 20 newsgroups Internet database of movie theaters Internet database of American bridges based on the National Bridge Inventory is a dataset containing a collection of English paragraphs with over 3 billion words linked data server designed for ontologies database of pesticides evaluated by the Joint Meeting on Pesticide Residues database of plants used as agricultural crops, including their ideal growing conditions biological database online database on amphibian declines, natural history, conservation, and taxonomy database website database dataset for situation recognition library online dictionary of the Norwegian language directory online database of water resources in California, United States of America digital repository database for rare and/or genetic diseases online repository of Closure packages online database of the world's plants free library of 19th century medical texts Netherlands multi-volume historical dictionary of English slang repository for data from nuclear magnetic resonance spectroscopy on biomolecules database of projects funded by German Research Foundation database of World War II memorials in the Netherlands website that aggregates reviews of music albums Central Library of NTUA information system for Estonian museums biological database database of digitized images from the New York Public Library's collections Dutch online architecture database open data platform of the Biblioteca Nacional de España dataset large dataset with labeled videos database of the Hungarian Parliament data archive for technical sciences cancer registry for the state of Missouri, USA database by SpringerNature database with biographical details of Fellows of the Royal College of Surgeons website of the U.S. Government Publishing Office offering access to U.S. government documents that will replace FDsys taxonomic database Mexican Public Cultural Sector database digital library of the Biblioteca Europea di Informazione e Cultura Foundation, Milan english dictionary hosted on merriam-webster.com website Track and Field Results Reporting System website database for NCAA U.S. collegiate track and cross country database dataset dataset the OER World Map is a collaboratively maintained database doocumenting the growing number of actors and activities in the field of open education worldwide Russian online information system online database of contemporary music Harvard University's open-access digital repository of research The Sol Genomics Network (SGN) is a database and website dedicated to the genomic information of the nightshade family, which includes species such as tomato, potato, pepper, petunia and eggplant. online database of manga, anime, games and media art dataset online Finnish dictionary bibliographic database run by Harvard University Library website devoted to silent films database of historical women from book genre collective biographies gene expression database online database data repository ontology ontology Ontology for enabling interoperability of epidemic models and public health application software. ontology ontology ontology ontology An ontology for biodiversity data ontology ontology ontology ontology ontology ontology ontology ontology ontology ontology Ontology fo describe cell lines an ontology for cell types ontology Ontology of small molecular entities of biological interest Ontology for descriptors used in chemoinformatics database ontology ontology ontology ontology ontology ontology Ontology of concepts and relations relevant to evolutionary comparative analysis ontology ontology ontology ontology ontology ontology An ontology for types evidence used to support scientific claims. ontology An ontology of emotions, moods, and other kinds of feelings. ontology ontology ontology ontology AN ontology of phenotypes of fission yeast. ontology ontology ontology An ontology for foodborne pathogens and associated outbreaks. ontology ontology ontology ontology ontology A set of ontologies related to infectious diseases. An ontology of information entities. ontology An ontology for gene-gene interactions ontology ontology ontology An ontology for concepts related to malaria ontology An ontology for mental diseases An ontology for concepts related to mental functioning ontology ontology ontology ontology ontology An ontology of mouse pathology phenotypes. An ontology for organic reactions in organic synthesis. ontology ontology ontology ontology ontology An ontology for RNA function ontology An ontology for concepts related to biobanks An ontology which describes biological processes, cellular components and molecular functions in living organisms ontology ontology ontology An ontology for general aspects of medicine, with a focus on cancer. ontology ontology ontology ontology ontology ontology ontology ontology An ontology to describe biomedical statistics ontology ontology ontology ontology An ontology for adverse events of vaccines. Ontology Based Data Access refers to a range of semantic techniques, algorithms and systems developed to facilitate access to various types of data sources. ontology ontology ontology ontology ontology An ontology for describing groups of interacting organisms. ontology An ontology for the anatomy of sponges (Porifera) ontology ontology ontology ontology ontology ontology ontology An ontology for the provenance of scientific claims and supporting evidence. ontology ontology ontology An ontology to describe software applications with a focus on bioinformatics tools. ontology ontology ontology ontology ontology ontology ontology ontology An ontology of clinical informed consents An ontology for concepts related to vaccines and vaccination ontology Ontology of the anatomy of the African clawed frog (Xenopus laevis). ontology Bibliographic database of mostly German-language works on zoology. Research Dataset dataset word analogy dataset thematically and genre-balanced Polish language corpus with over 70 million words Chemical Property Database online search system for the United States Patent and Trademark Office's database of registered trademarks free international research database for tertiary education norwegian road data bank open access institutional repository that provides access to the scholarly, educational, and creative works of the US-based University of Maine community prosopographical directory of French learned societies' membership terminology database 2010 conceptual model and ontology for describing entities that are published US data center for the global PDB archive European data center for the global PDB archive number database Loeb Classical Library volume The Automated Weather Data Network (AWDN) gathers weather data for partners in agriculture and related fields chemistry indexing and abstracting service A Taiwan Linked Open Data Platform (data.odw.tw) build by the Institute of Information Science, Academia Sinica, Taiwan international disaster database, located at the Centre for Research on the Epidemiology of Disasters (CRED), Université catholique de Louvain, Brussels, Belgium library catalogue shared between several universities in Southern Italy biological database online Maltese lexicon database of material phase diagrams database about soybean genetics database about legume traits database about peanut genetics database about the genetics of corn (Zea mays) database for comparative plant biology online database of Sega video games specializes in the archiving, cataloging, and distributing of scientific data sets relevant to asteroids, comets and interplanetary dust national database of coronial information on every death reported a Coroner in Australia and New Zealand question-answering dataset repository https://publikationsserver.tu-braunschweig.de/ repository http://www.edshare.soton.ac.uk/ repository http://libres.uncg.edu/ir/ repository http://ddd.uab.cat/ repository http://digitalcommons.wayne.edu/ repository http://www.freidok.uni-freiburg.de/ system for automated soil mapping based on global soil profile and covariate data data set statistics database of INSEE authorised whole-of-government website for Commonwealth of Australia legislation and related documents Fire Effects Information System online database online database of mosquitoes digital atlas of coral reefs web mapping service of species' distribution database of clinical features seen in mitochondrial diseases online database full-text database for works published by Springer former central database of United Kingdom citizens Finding aid search interface Computer database of antiquities in Jordan database compiled and managed by Annita Lucchesi Electronic Flora of South Australia materials database covering historic and contemporary materials used in the production and conservation of art, architecture, and archaeology by Museum of Fine Arts, Boston, Massachusetts online database of names and descriptions of geologic units online archive of geoscience data in the United States a project at Indiana University Bloomington to advance women's art made in Europe (and later, in the US) during the 15th-19th centuries FDA database 4TU.ResearchData provides an archive for researchers around the world for long-term access and curation of research datasets, with a focus on data from science, engineering and technology. USGS Numbered Series pharmaceutical literature database database containing citations of historic medical literature serial published 1880 - 1961 bibliographic database Free global database of active blockchain businesses The European Criminal Records Information System is an EU-System for Criminal Records. heritage institution Butterflies of India online database Indian online database on moths Odonata of India online database Reptiles of India online database Birds of India online database Moths of North America online database an online publication on recent pollen Plantarium online database Paleozoic ammonoid online database system Catalogue of the Lepidoptera of Belgium online database Leeds Robotic Commands is a dataset of real-world RGB-D scenes of a robot manipulating different objects together with natural language descriptions of these actions. online database digital library USGS Database taxonomy of fossil plants database bibliographic database database database of students of Litchfield Law School and Litchfield Female Academy Biological database biological database biological database online taxonomic database on Psylloidea large-scale (1000 hours) corpus of read English speech corpus was made from audio talks and their transcriptions available on the TED website website, online entertainment collectors database crowd Sourced Emotional Multimodal Actors Dataset database of publicly funded research in the UK biographical reference for British women writers ontology ontology ontology ontology RDF representation of the Microsoft Academic Graph American theater news website database for Latin and Ancient Greek dictionaries set of email addresses and passwords online database of fleurons automatically derived from scanned public domain texts thesaurus database of the Consortium of European Research Libraries online database of Commodore Amiga video games biographical reference work online database online database on scale insects European database to search and record orphan works online resource on the magic lantern, an early slide projector invented in the 17th century dataset based on the MuseumFinland project Kalevala as semantic web metadata about Finnish fiction literature created by the Finnish Public Libraries botanical index heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study gazetteer of 630k person and 42k org names that provides spelling variants and EMM news about the entity (200k news per day!) The ontology provides a vocabulary for expressing facts about topological (ordering) relations among instants and intervals, together with information about durations, and about temporal position including date-time information. dataset portal to digital archives of Japanese culture and history website related to records of Comédie-Française theatre troupe, 1680-1791 The National Library of Medicine's web site for consumer information about genetic conditions and the genes responsible for those conditions. linguistic ontology, tree of the meaning of the Arabic terms Italian database 4th edition of Annie Besant's English translation with parallel Sanskrit text online bibliographic database of archaeology US National Library of Medicine's digital archive of scientists, physicians, and others who have advanced science taxonomic database for Antarctic marine species database of digital collection of Laval University library reference dataset for knowlege graph algorithms database regarding the tradition of Greek texts before the 16th century database management system for culture collections in the world a reference knowledge graph (ontology) to interoperate data and for machine learning bibliographic database of work-level records digital library of Latin prose texts of Late Antiquity (2nd-6th century AD) digital library of ancient Latin texts Norway's nationwide civil registry Ugo and Olga Levi Foundation institutional repository and database. A curated database of gene-disease panels biological database the myschool.edu.au site, a government source of compiled data database of stratigraphic units in Australia database of biological specimen records online database of music video information online database map and database of Australian First Languages Statistics database hosted by the International Labour Organization online directory of libraries located globally website by the Australian Government about orphanages, children's Homes, and other institutions website and database of flora in South Australia digital library of Latin poetry database connecting pathogens to phenotypes electronic library scientific database of beach measurements in New South Wales, Australia online store and database focused on electroacoustic music linked data platform database of Australian patent information managed by IP Australia database of food compositions operated by Food Standards Australia New Zealand database of unique codes for all New South Wales offences and Commonwealth offences dealt with in New South Wales JUSTfind - the online public access catalog of the university library Giessen A social network of and about cinema database of algal taxa in Australia database of scientific names for species database of Indian academicians and scientists a semantically annotated English corpus website dataset dataset dataset dataset dataset dataset dataset dataset dataset dataset dataset medical database database of common names for natural history collections dataset dataset dataset health research database database of French theatre of the seventeenth and eighteenth centuries project of the British Library digital repository for scholarly materials produced by members of the University of North Carolina at Chapel Hill community repository http://amsdottorato.unibo.it/ repository http://memory.loc.gov/ammem/index.html repository http://rruff.geo.arizona.edu/AMS/amcsd.php repository http://repository.alt.ac.uk/ repository http://agritrop.cirad.fr/ repository http://ageconsearch.umn.edu/ repository http://ahero.uwc.ac.za/ repository http://preprints.acmac.uoc.gr/ repository http://gtcni.openrepository.com/gtcni/ repository https://aperto.unito.it/ repository http://archiviomarini.sp.unipi.it/ digitize and liberate all public domain sheet music motorsport results and statistics database theses database Canadian movie and television news website and online database portal published by the ISSN International Centre, containing ISSNs assigned to serial publications An ontology for the diverse roles behind a scientific research article. Norwegian portal for collecting business information GBIF node in Finland IAEA database of nuclear power plants worldwide abandonware database dataset dataset register of historic places in Washington, D.C. online database of the Museum of Modern Art annual directory of library publishers digital repository of doctoral theses from the Consejo de Universidades photographic archive at the University of Chicago register of heritage-listed places in New Zealand Online database of movie theaters, distributors, movies and screenings in the Netherlands An Indonesian academic repository database of films, movie, actors, directors, etc., in Taiwan digital library at the University of Patras database of born-digital projects and resources for gender studies site with detailed bibliographic info about an identity clinical trials database database CORNELL NEWSROOM dataset database online flora for plants a database of scholarly works, developed and maintained by MDPI (Q6715186). The name Scilit uses components of the words “scientific” and “literature”. digital repository for archaeological research speech database website dataset of bibliographic metadata dictionary of the Aramaic language bibliographic and full text database of agricultural information biological database online database database of animal ageing and longevity AI for object recognition in images and videos chemical database of natural products The LiverTox database in NCBI Bookshelf website on amoeboid organisms cybersecurity ontology online database of video games dataase maintained by the Ministry of Culture of Brazil database of archives, libraries and museums of Regione Toscana A publicly-accessible online system designed to facilitate the development, validation, curation and distribution of large-scale, evidence-based datasets for use in diagnostic variant filtering. An ontology of disease symptoms online map, marked-up texts, and descriptive gazetteer and encyclopedia of people, places, topics, and terms relating to London pre-1700 online database of video games online database of video game music genealogical database system of registration of basic vital records such as birth, marriage and death regional digital library in the United States, and service hub for DPLA regional digital library in the United States, and service hub for DPLA regional digital library in the United States, and service hub for DPLA regional digital library in the United States, and service hub for DPLA regional digital library in the United States, and service hub for DPLA regional digital library in the United States, and service hub for DPLA regional digital library in the United States, and service hub for DPLA regional digital library in the United States, and service hub for DPLA regional digital library in the United States, and service hub for DPLA regional digital library in the United States, and service hub for DPLA regional digital library in the United States, and service hub for DPLA botanical database knowledgebase provided by the WHO Culture digital repository digital archive of art and literature of antebellum New York biological dataset question-answering dataset question-answering dataset database multiple monolingual text corpora digital library of Latin poetry database of union members question-answering dataset question answering dataset the first annotated corpus of Russian language texts, developed since 1998, includes complete morphological and syntactic tags, includes disambiguated text biological database question-answering dataset Mexican newspaper database genome database genome database online database of Attic inscriptions Historic England database of archaeological, architectural and maritime sites digital library of South Dakota university collections database of ancient authors and texts national clinical trials registry lyrics website collaborative volunteer-run effort to track the COVID-19 outbreak in the United States mobility report produced by Google online database regarding Scholasticism database of Spanish hospitals maintained by the ministry of health Online directory of biobanks database for historians using Wikibase Pornographic film production company and distribution website pornographic distribution platform datasæt for composition language and visual recognition scientific bibliographic database medical database of systematic reviews free online platform for language lovers and an online community Land Information New Zealand (LINZ) database for official place names/coordinates database jointly compiled by the French School at Athens and the British School at Athens database compiled by the German Archaeological Institute website about Dutch language and Dutch literature by the National Library of the Netherlands Lexical data.base of Basque database by the Koninklijke Bibliotheek on the history of printed books a list of publicly known cybersecurity vulnerabilities database of ecuadorian species dataset of images dataset of scenes collection of over 100 up-to-date datasets relevant to California counties Information database about captive and wild elephants compendium of locales, maintained by the Center for Land Use Interpretation (Q5059738) online taxonomic database repository of digital items and collections searchable repository of full text publications and citations by LSE staff online archive of PhD theses for the London School of Economics and Political Science Knowledge Graph containing historical photographs and metadata of Stuttgart State Theatres source of accessibility information in the UK computer vision dataset distributor of documentary films in North America Alexander Turnbull Library's catalogue for unpublished collections database ontology Taxonomic database on Cephalopoda biological database about glycans and glycoproteins ZivaHub is the University of Cape Town's institutional open access data repository. It houses scholarly outputs of the University of Cape Town. Ziva is a Shona word meaning "to know". digital single entry point service for all UNESCO resources An electronic archive for digital resource materials in the fields of minority health and health disparities research and policy. a digital repository in the USGS Knowledge base used by Google which lets individuals create their profile on its search engine publicly accessible database of vertebrate biodiversity data from natural history collections around the world a research data repository in Taiwan union catalog operated by Jisc Historical financial database by Refinitiv large small ComputerApplications_MISCELLANEOUS DATE/TIME File format File name File size Uniform resource locator/link to file 80109 Pattern Recognition and Data Mining 170203 Knowledge Representation and Machine Learning 110906 Sensory Systems 79901 Agricultural Hydrology (Drainage Flooding Irrigation Quality etc.) thermal Imaging Molecular Biology 80699 Information Systems not elsewhere classified signaling Rattus norvegicus 59999 Environmental Sciences not elsewhere classified acetylome post-translational modification Ecology 69999 Biological Sciences not elsewhere classified host-pathogen interactions Cancer Science Policy acetyltransferase Toxoplasma gondii magnesium aluminium ate complex X-ray crystallography Molecular Biology 80699 Information Systems not elsewhere classified 69999 Biological Sciences not elsewhere classified Platycodon grandiflorum Africa Asia Pliocene Canarina canariensis Ostrowskia magnifica Cyclocodon lancifolium Pleistocene Canarina abyssinica climate-driven extinction continental islands vicariance nested phylogenetic dating 110309 Infectious Diseases Canarina eminii Canary Islands Miocene Cell Biology long-distance dispersal Pharmacology Bayesian biogeography Paleoecology Uncategorised Uncategorized small machine learning &gt; classification analysis &gt; image processing featured technology and applied sciences &gt; computing &gt; computer science data type &gt; image data medium machine learning &gt; deep learning ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION medium featured natural and physical sciences &gt; nature &gt; animals human activities natural and physical sciences &gt; biology &gt; ecology natural and physical sciences &gt; nature &gt; plants small featured GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) general reference &gt; research tools and topics &gt; books medium 15 Geothermal Energy geothermal Colorado Routt County Routt Hot Springs Strawberry Park Hot Springs reconnaissance shallow temperature survey air photo lineaments groundwater geochemistry geology geologic map topographic map geothermometry map small Time Series Prediction Statistics Computational Biology 80301 Bioinformatics Software 60102 Bioinformatics Ecology Cancer Pleistocene community disassembly Inorganic Chemistry Biotechnology Neuroscience Developmental Biology Rancholabrean Holocene Plant Biology functional diversity extinction North America megafauna Biochemistry 60506 Virology Mammalia machine learning &gt; classification featured data type &gt; image data medium machine learning &gt; deep learning problem type &gt; multiclass classification culture and arts &gt; games and toys ComputingMilieux_COMPUTERSANDEDUCATION GeneralLiterature_MISCELLANEOUS Molecular Biology Cell Biology FIS distribution gametic phase disequilibrium 29999 Physical Sciences not elsewhere classified Markov chains Cyclical parthenogenesis de Finetti diagrams Biophysics Immunology individual-based simulations Health Care Molecular Biology 69999 Biological Sciences not elsewhere classified 110309 Infectious Diseases Pharmacology Plant Biology Immunology endosymbiotic gene transfer Microbiology Medicine eukaryotic phylogeny tree of life Computational Biology Space Science Genetics Evolutionary Biology sampling strategy phylogenetics microbial diversity Uncategorised Uncategorized 54 Environmental Sciences ngee ngee-arctic barrow alaska Radiocarbon in CO2 Radiocarbon in soil CO2 production carbon mineralization soil organic matter nitrogen concentration soil organic matter carbon concentration soil organic matter geochemistry small natural and physical sciences &gt; nature &gt; plants natural and physical sciences &gt; biology ComputerApplications_COMPUTERSINOTHERSYSTEMS technology and applied sciences &gt; agriculture medium InformationSystems_INFORMATIONINTERFACESANDPRESENTATION(e.g. HCI) ComputingMethodologies_PATTERNRECOGNITION analysis &gt; nlp medium featured technology and applied sciences &gt; computing &gt; internet technology and applied sciences &gt; computing &gt; internet &gt; twitter society and social sciences &gt; society &gt; politics geography and places &gt; asia &gt; russia society and social sciences &gt; social sciences &gt; international relations 60102 Bioinformatics Supplementary materials Geophysics Treatment Fugacity of carbon dioxide (water) at sea surface temperature (wet air) Carbonate ion Mass Sample ID Ammonium Type pH standard error Calculated using seacarb after Nisumaa et al. (2010) Uniform resource locator/link to reference Nitrate and Nitrite Alkalinity total Salinity Carbon inorganic dissolved Temperature water Potentiometric Carbonate system computation flag Carbon dioxide Registration number of species Bicarbonate ion Aragonite saturation state Phosphate Chlorophyll c Potentiometric titration Calcite saturation state Chlorophyll a South Pacific Partial pressure of carbon dioxide (water) at sea surface temperature (wet air) Experiment duration Species small ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Computer Science::Digital Libraries Biomarkers Quantitative Biology::Genomics Statistics::Applications Statistics::Methodology Computer Science::Computer Vision and Pattern Recognition Astrophysics::Galaxy Astrophysics Computer Engineering 130309 Learning Sciences 150201 Finance small natural and physical sciences &gt; physical sciences &gt; space Astrophysics::Earth and Planetary Astrophysics Physics::Space Physics small featured GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) health and fitness &gt; self care &gt; exercise &gt; weight training featured medium technology and applied sciences &gt; computing &gt; internet culture and arts &gt; culture and humanities &gt; food and drink machine learning &gt; classification society and social sciences &gt; society &gt; business society and social sciences &gt; social sciences &gt; linguistics small featured data science terrorism Worldwide forecasting models conflict geography and places &gt; cities File name File size Uniform resource locator/link to file File format Pygoscelis adeliae Aptenodytes forsteri Antarctica Thalassoica antarctica Functional Ecology @ AWI (AWI_FuncEco) Development of a CCAMLR Marine Protected Area in the Antarctic Weddell Sea (WSMPA) File content Weddell Sea Marine Protected Area (MPA) Neuroscience 170205 Neurocognitive Patterns and Neural Networks small ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION technology and applied sciences &gt; transport &gt; cycling health and fitness &gt; self care &gt; exercise &gt; sports health and fitness &gt; self care &gt; exercise ComputerSystemsOrganization_MISCELLANEOUS small 90399 Biomedical Engineering not elsewhere classified 90302 Biomechanical Engineering locomotion analyses Biological Engineering 110903 Central Nervous System 110999 Neurosciences not elsewhere classified 90399 Biomedical Engineering not elsewhere classified 69999 Biological Sciences not elsewhere classified Science Policy Microbiology Space Science 80699 Information Systems not elsewhere classified 19999 Mathematical Sciences not elsewhere classified Data Format medium featured Astrophysics::Galaxy Astrophysics natural and physical sciences &gt; physical sciences &gt; space natural and physical sciences &gt; physical sciences &gt; astronomy natural and physical sciences &gt; physical sciences &gt; physics natural and physical sciences &gt; nature data type &gt; image data featured skin and connective tissue diseases large problem type &gt; binary classification technology and applied sciences &gt; medicine machine learning small society and social sciences &gt; society &gt; finance society and social sciences &gt; society &gt; money 80699 Information Systems not elsewhere classified 110309 Infectious Diseases Inorganic Chemistry Neuroscience Developmental Biology Computational Biology Evolutionary Biology Genetics Molecular Biology Physiology Marine Biology 80107 Natural Language Processing stance classification stance detection fake news rustance featured medium large society and social sciences &gt; society &gt; crime data type &gt; bigquery geography and places &gt; north america &gt; united states society and social sciences &gt; society &gt; crime &gt; violence Neuroscience 80599 Distributed Computing not elsewhere classified 89999 Information and Computing Sciences not elsewhere classified 80107 Natural Language Processing Distributed Computing 80705 Informetrics 10401 Applied Statistics Mathematics::Category Theory small 80699 Information Systems not elsewhere classified 110309 Infectious Diseases Inorganic Chemistry Biotechnology Biophysics Immunology Genetics Molecular Biology Exome capture small population structure Pseudotsuga menziesii positive selection Southwestern Germany 14 SOLAR ENERGY NREL energy data low income low and moderate income lmi pv rooftop technical potential solar for all photovoltaic solar tract prediction USA LiDAR residential 2011-2015 demographic data cost-benefit analysis Cancer Medicine small machine learning &gt; classification geography and places &gt; asia &gt; india people and self &gt; personal life &gt; entertainment ComputingMilieux_MISCELLANEOUS featured medium ComputingMilieux_MISCELLANEOUS culture and arts &gt; culture and humanities &gt; popular culture culture and arts &gt; performing arts &gt; film Molecular Biology 59999 Environmental Sciences not elsewhere classified Microbiology Computational Biology Evolutionary Biology sexual selection social behaviour fluids and secretions integumentary system inclusive fitness aggression Drosophila melanogater sexual conflict kin selection parasitic diseases virus diseases 60102 Bioinformatics Applied Computer Science 80106 Image Processing Artificial Intelligence and Image Processing 80199 Artificial Intelligence and Image Processing not elsewhere classified 170299 Cognitive Science not elsewhere classified 170205 Neurocognitive Patterns and Neural Networks 179999 Psychology and Cognitive Sciences not elsewhere classified small featured medium health and fitness &gt; self care &gt; exercise &gt; sports &gt; horse racing data type &gt; image data ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION analysis &gt; nlp large analysis &gt; image processing Paleoecology 100509 Video Communications 100599 Communications Technologies not elsewhere classified small ComputingMilieux_MISCELLANEOUS Computational Biology 60102 Bioinformatics Plant Biology Data_FILES 60702 Plant Cell and Molecular Biology Applied Computer Science 80106 Image Processing Artificial Intelligence and Image Processing 80199 Artificial Intelligence and Image Processing not elsewhere classified 69999 Biological Sciences not elsewhere classified Pharmacology Developmental Biology Biochemistry Microbiology Genetics Evolutionary Biology Molecular Biology Physiology Sceloporus occidentalis Sceloporus zosteromus Sceloporus cowlesi Sceloporus hunsakeri Sceloporus torquatus Sceloporus smithi Sceloporus bicanthalis Sceloporus adleri Sceloporus woodi Sceloporus graciosus Sceloporus gadoviae Sceloporus horridus Sceloporus tristichus Comparative Biology Sceloporus magister Sceloporus ochoterenae animal structures Genomics/Proteomics Sceloporus malachiticus Sceloporus variabilis Sceloporus taeniocnemis Sceloporus edwardtaylori Sceloporus grammicus Sceloporus jalapae Sceloporus orcutti Reptiles Sceloporus palaciosi Sceloporus spinosus Sceloporus siniferus Sceloporus angustus Sceloporus utiformis Sceloporus formosus Sceloporus carinatus Sceloporus clarkii Sceloporus mucronatus Sceloporus olivaceus Sceloporus exsul Gene Structure and Function Sceloporus scalaris Sceloporus licki small featured natural and physical sciences &gt; nature &gt; animals initiatives &gt; socrata mathematics and logic &gt; statistics &gt; time series Ecology 69999 Biological Sciences not elsewhere classified 110309 Infectious Diseases Plant Biology North America Genetics Evolutionary Biology Inorganic Chemistry phylogenetic acoustic communication Aves Caribbean acoustical environment habitat use foraging ecology Europe novel environments sensory ecology File name File size Uniform resource locator/link to file File format File content Analytical method ORDINAL NUMBER Comment small Data_FILES small 170299 Cognitive Science not elsewhere classified Geology Ecology 69999 Biological Sciences not elsewhere classified Plant Biology Evolutionary Biology USA Europe Population Genetics - Empirical Adaptation Australia Natural Selection and Contemporary Evolution Agriculture Raphanus raphanistrum Ecological Genetics 69999 Biological Sciences not elsewhere classified Plant Biology Biochemistry Genetics 80699 Information Systems not elsewhere classified 19999 Mathematical Sciences not elsewhere classified Dolichonyx oryzivorus landscape buffer scale of effect kernel landscape context Passerculus sandwichensis spatial scale Pterostichus Melanarius distance decay Canada landscape management landscape structure Habitat model landscape extent Pain Questionnaire Pain education 110399 Clinical Sciences not elsewhere classified Neurophysiology Measurement Computer Engineering Data_MISCELLANEOUS 91303 Autonomous Vehicles Ecology 50202 Conservation and Biodiversity 50211 Wildlife and Habitat Management 80604 Database Management Benchmarking InformationSystems_DATABASEMANAGEMENT SPARQL Log Analysis Triple Stores InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL small initiatives &gt; socrata small small GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) ComputingMethodologies_PATTERNRECOGNITION ComputingMilieux_MISCELLANEOUS Data_GENERAL Health Care 60301 Animal Systematics and Taxonomy 54 Environmental Sciences ngee ngee-arctic respiration methane production methane oxidation soil incubation Snow thickness Snow cover fraction DATE/TIME Digital camera CC640 80106 Image Processing 80103 Computer Graphics ComputingMethodologies_PATTERNRECOGNITION Bioinformatics Data_MISCELLANEOUS small medium ComputingMethodologies_PATTERNRECOGNITION society and social sciences &gt; society &gt; business Computer Science::Machine Learning Computer Science::Computer Vision and Pattern Recognition Computer Science::Sound Statistics::Machine Learning Computer Science::Neural and Evolutionary Computation data type &gt; image data medium featured Computational Physics 20599 Optical Physics not elsewhere classified algorithms &gt; neural networks relevance assessment 80704 Information Retrieval and Web Search Neuroscience Imaging Genetics Ecology 69999 Biological Sciences not elsewhere classified Plant Biology Genetics Evolutionary Biology 80699 Information Systems not elsewhere classified Deep-sea Gulf of California Xenoturbellida Phylogenomics Monterey Canyon Paleoecology 111706 Epidemiology small ComputingMilieux_COMPUTERSANDEDUCATION small GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) small ComputingMethodologies_PATTERNRECOGNITION machine learning &gt; classification InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL InformationSystems_INFORMATIONSYSTEMSAPPLICATIONS ComputingMilieux_COMPUTERSANDSOCIETY data type &gt; text data analysis &gt; text mining ComputingMilieux_LEGALASPECTSOFCOMPUTING small featured society and social sciences &gt; social sciences &gt; sociology people and self &gt; personal life &gt; love 69999 Biological Sciences not elsewhere classified Cancer Science Policy 110309 Infectious Diseases Pharmacology Evolutionary Biology 80699 Information Systems not elsewhere classified Physiology Marine Biology Extracorporeal perfusion free flap transplantation rat model microsurgery membrane oxygenator ECMO tissue perfusion 39999 Chemical Sciences not elsewhere classified small 40402 Geodynamics InformationSystems_DATABASEMANAGEMENT InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ComputingMethodologies_DOCUMENTANDTEXTPROCESSING EARTH SCIENCE &gt; ATMOSPHERE &gt; PRECIPITATION &gt; PRECIPITATION RATE small analysis &gt; image processing featured data type &gt; image data technology and applied sciences &gt; computing &gt; internet technology and applied sciences &gt; computing &gt; computer security small GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) ComputingMethodologies_GENERAL Ecology 69999 Biological Sciences not elsewhere classified Science Policy Evolutionary Biology ComputingMethodologies_PATTERNRECOGNITION 80699 Information Systems not elsewhere classified Data_MISCELLANEOUS flower colour polymorphism Mediterranean area Iris pumila pollen limitation phenotypic selection Iris lutescens East Europe 69999 Biological Sciences not elsewhere classified Science Policy Neuroscience Plant Biology Immunology Genetics 80699 Information Systems not elsewhere classified Quantitative genetics and Mendelian inheritance quantitative genetics salmonid Subject area: Genomics and gene mapping qPCR quantitative trait loci medium medium featured GeneralLiterature_MISCELLANEOUS culture and arts &gt; arts and entertainment &gt; humor medium featured GeneralLiterature_MISCELLANEOUS culture and arts &gt; arts and entertainment &gt; humor small 110201 Cardiology (incl. Cardiovascular Diseases) Uncategorized 69999 Biological Sciences not elsewhere classified Science Policy Genetics ComputingMethodologies_PATTERNRECOGNITION 80699 Information Systems not elsewhere classified Data_MISCELLANEOUS analyses example dataset Triassic Permian small small ComputingMilieux_COMPUTERSANDEDUCATION GeneralLiterature_MISCELLANEOUS ComputingMethodologies_PATTERNRECOGNITION InformationSystems_INFORMATIONSYSTEMSAPPLICATIONS ComputingMilieux_MANAGEMENTOFCOMPUTINGANDINFORMATIONSYSTEMS featured medium society and social sciences &gt; society &gt; finance health care economics and organizations natural and physical sciences &gt; biology &gt; health sciences &gt; public health &gt; healthcare ComputingMethodologies_PATTERNRECOGNITION Data_MISCELLANEOUS Ecology population density mammals reptiles abundance birds amphibians ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION Computer Engineering small featured medium analysis &gt; nlp data type &gt; text data analysis &gt; text mining audience &gt; beginner analysis &gt; data visualization small machine learning &gt; classification society and social sciences &gt; society &gt; finance featured small society and social sciences &gt; social sciences &gt; sociology society and social sciences &gt; social sciences &gt; linguistics featured data type &gt; image data medium machine learning &gt; deep learning ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION problem type &gt; multiclass classification people and self &gt; personal life &gt; clothing culture and arts &gt; visual arts &gt; photography Uncategorized small GeneralLiterature_MISCELLANEOUS ComputingMilieux_LEGALASPECTSOFCOMPUTING ComputingMilieux_THECOMPUTINGPROFESSION small medium Ecology 50301 Carbon Sequestration Science Environmental Science Molecular Biology Ecology 69999 Biological Sciences not elsewhere classified 110309 Infectious Diseases Pharmacology Inorganic Chemistry Biochemistry Genetics Biodiversity Natural resources policy Sociology Synergy and trade-off French Alps Multi-scale assessment Ecosystem service association Biophysical assessment Landscape heterogeneity Library and Information Studies 90399 Biomedical Engineering not elsewhere classified 80704 Information Retrieval and Web Search query formulations Molecular Biology Evolutionary Biology Asteraceae 60408 Genomics polymorphisms 60309 Phylogeny and Comparative Analysis sunflower family duplication probabilistic models dysploidy polyploidy ancestral chromosome number 60409 Molecular Evolution small GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) ComputingMilieux_THECOMPUTINGPROFESSION Environmental Science 80704 Information Retrieval and Web Search w3c 80505 Web Technologies (excl. Web Search) datasets descriptions Linked Data small medium featured ComputingMethodologies_PATTERNRECOGNITION large small Uncategorised Uncategorized small featured machine learning technology and applied sciences &gt; computing &gt; computer security society and social sciences &gt; society &gt; crime general reference &gt; reference works &gt; web sites Data_MISCELLANEOUS ComputingMethodologies_PATTERNRECOGNITION 80612 Interorganisational Information Systems and Web Services 80699 Information Systems not elsewhere classified 69999 Biological Sciences not elsewhere classified Developmental Biology Genetics Molecular Biology 39999 Chemical Sciences not elsewhere classified Biochemistry Climate Science 50204 Environmental Impact Assessment 50206 Environmental Monitoring Hydrology 40107 Meteorology Soil Science general reference &gt; research tools and topics &gt; books small analysis &gt; nlp culture and arts &gt; culture and humanities &gt; languages machine learning &gt; recommender systems Ecology 69999 Biological Sciences not elsewhere classified Cancer 110309 Infectious Diseases Immunology Microbiology Genetics 80699 Information Systems not elsewhere classified Physiology Marine Biology Mixed-species bird flocks Mutualisms South Asia Anthropogenic disturbance Human-modified ecosystems Biodiversity loss Species interaction networks Anthropocene Fungi data 30101 Analytical Spectrometry Spectroscopy data small people and self &gt; personal life &gt; entertainment ComputingMilieux_MISCELLANEOUS culture and arts &gt; performing arts &gt; film 80505 Web Technologies (excl. Web Search) QA Applied Computer Science Semantic Web Question Answering 80607 Information Engineering and Theory Solar System Solar Physics Planets and Exoplanets 80104 Computer Vision Neuroscience Molecular Biology 59999 Environmental Sciences not elsewhere classified Ecology 69999 Biological Sciences not elsewhere classified Pliocene Pleistocene 110309 Infectious Diseases Miocene Cell Biology Biochemistry 60506 Virology Genetics Evolutionary Biology Poecilia sphenops Costa Rica Oligocene Poecilia mexicana limantouri taxonomy hybridization species trees Poecilia cryptic species Guatemala coalescent Panama Poecilia orri Poecilia sulphuraria Poecilia catemaconis Poecilia mexicana incomplete lineage sorting Poecilia butleri El Salvador Nicaragua freshwater fishes conservation Mexico Central America Poecilia gillii Poecilia hondurensis general mixed Yule-coalescent (GMYC) Honduras non-adaptive radiations Poecilia mexicana mexicana Bayesian species delimitation featured medium machine learning &gt; deep learning analysis &gt; nlp technology and applied sciences &gt; computing &gt; internet &gt; twitter society and social sciences &gt; social sciences &gt; linguistics large algorithms &gt; neural networks culture and arts &gt; culture and humanities &gt; languages 80106 Image Processing 80103 Computer Graphics 10401 Applied Statistics ComputingMilieux_THECOMPUTINGPROFESSION 80309 Software Engineering File size Uniform resource locator/link to file File name small featured ComputingMethodologies_PATTERNRECOGNITION society and social sciences &gt; society &gt; business machine learning &gt; regression analysis Computer Engineering small featured society and social sciences &gt; society &gt; business algorithms &gt; decision tree people and self &gt; personal life &gt; employment Applied Computer Science Mathematics::Analysis of PDEs Mathematics::Numerical Analysis anomaly detection framework time series analysis technique Machine Learning Techniques featured large health and fitness &gt; self care &gt; positive psychology &gt; mental health humanities natural and physical sciences &gt; biology &gt; health sciences &gt; public health medium medium ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION health and fitness &gt; self care &gt; exercise &gt; sports technology and applied sciences &gt; computing &gt; computer science technology and applied sciences &gt; computing &gt; computer engineering 69999 Biological Sciences not elsewhere classified Cancer Science Policy 110309 Infectious Diseases Biotechnology Plant Biology Biochemistry 80699 Information Systems not elsewhere classified 19999 Mathematical Sciences not elsewhere classified 39999 Chemical Sciences not elsewhere classified 9 configurations Supplementary Data 7 Supplementary Data 7. Dataset analysis script TNT small medium small ComputingMethodologies_PATTERNRECOGNITION medium 91299 Materials Engineering not elsewhere classified 30307 Theory and Design of Materials 30306 Synthesis of Materials Neuroscience Imaging 10401 Applied Statistics ComputingMilieux_THECOMPUTINGPROFESSION 80309 Software Engineering featured medium ComputingMethodologies_DOCUMENTANDTEXTPROCESSING society and social sciences &gt; social sciences &gt; linguistics InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL culture and arts &gt; culture and humanities &gt; languages technology and applied sciences &gt; computing &gt; artificial intelligence ComputingMethodologies_ARTIFICIALINTELLIGENCE 50202 Conservation and Biodiversity species abundance distributions small Molecular Biology 80699 Information Systems not elsewhere classified 69999 Biological Sciences not elsewhere classified 110309 Infectious Diseases Cell Biology Biochemistry Biophysics Microbiology Genetics Evolutionary Biology 39999 Chemical Sciences not elsewhere classified analyses sequence length species identification efficiency content DNA barcoding gaps GC Dataset Artificial Intelligence and Image Processing Climate Cloudbase cloud base Neuroscience Data_MISCELLANEOUS Imaging Ecology 69999 Biological Sciences not elsewhere classified Neuroscience Genetics Evolutionary Biology 80699 Information Systems not elsewhere classified Adaptation Population Genetics - Empirical Insects Conservation Genetics Speciation Uncategorised Uncategorized Molecular Biology Ecology North America Biochemistry Mammalia Microbiology Evolutionary Biology 19999 Mathematical Sciences not elsewhere classified Price equation Species selection Palaeocene Eocene Macroevolution Body size Palaeocene/Eocene boundary Wyoming Bighorn Basin Clarks Fork Basin Neuroscience 170205 Neurocognitive Patterns and Neural Networks 179999 Psychology and Cognitive Sciences not elsewhere classified Developmental and Educational Psychology adult ages Naturalistic Stimuli 170299 Cognitive Science not elsewhere classified Teenagers 170102 Developmental Psychology and Ageing Development Characteristics 170112 Sensory Processes Perception and Performance Movies Human Brain Activation fMRI image analysis approach Behavioral Neuroscience Neuroscience and Physiological Psychology humanities IUGR dataset ntrauterine growth restriction low birth weight preterm birth preterm labor premature rapture of membranes prenatal care 140301 Cross-Sectional Analysis Climate WAGHC observational data ocean Molecular Biology 80699 Information Systems not elsewhere classified 69999 Biological Sciences not elsewhere classified 110309 Infectious Diseases Biochemistry Microbiology Computational Biology Genetics Evolutionary Biology 19999 Mathematical Sciences not elsewhere classified Gene Structure and Function Population Genetics - Empirical Conservation Genetics Host Parasite Interactions Hybridization Quebec Salvelinus fontinalis Parasitology ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION 80106 Image Processing Earth Observation Satellites &gt; LANDSAT &gt; LANDSAT-8 EARTH SCIENCE &gt; CRYOSPHERE &gt; GLACIERS/ICE SHEETS &gt; GLACIERS Landsat 8 Debris-covered glaciers Remote sensing EARTH SCIENCE SERVICES &gt; MODELS &gt; CRYOSPHERE MODELS Earth Observation Satellites &gt; Sentinel GMES &gt; SENTINEL-2 Sentinel-2A/B Cell Biology FIS distribution gametic phase disequilibrium 29999 Physical Sciences not elsewhere classified Markov chains Cyclical parthenogenesis de Finetti diagrams Immunology individual-based simulations Genetics 19999 Mathematical Sciences not elsewhere classified education behavioral disciplines and activities 80301 Bioinformatics Software 80107 Natural Language Processing 80607 Information Engineering and Theory Computation Theory and Mathematics small featured ComputingMethodologies_PATTERNRECOGNITION Data_MISCELLANEOUS data type &gt; tabular data featured culture and arts &gt; culture and humanities &gt; food and drink small digestive oral and skin physiology small GeneralLiterature_MISCELLANEOUS Paleoecology gridded data EARTH SCIENCE &gt; AGRICULTURE &gt; AGRICULTURAL PLANT SCIENCE &gt; CROPPING SYSTEMS soil management plowing ploughing tillage EARTH SCIENCE &gt; AGRICULTURE &gt; SOILS Conservation Agriculture small mathematics and logic &gt; statistics &gt; time series audience &gt; beginner analysis &gt; time series analysis machine learning &gt; forecasting small GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) technology and applied sciences &gt; computing &gt; internet ComputingMilieux_MISCELLANEOUS Database featured data type &gt; image data medium problem type &gt; multiclass classification Image Processing culture and arts &gt; culture and humanities &gt; food and drink Data Analysis Fresh Fruits Convolutional Neural Network 179999 Psychology and Cognitive Sciences not elsewhere classified education Applied Psychology Helmholtz-Verbund Regionale Klimaänderungen = Helmholtz Climate Initiative (Regional Climate Change) (REKLIM) Paleo Modelling (PalMod) Ecology Cell Biology Biochemistry Microbiology assembly motif functional effect groups theoretical ecology clustering community modelling combinatorics dictionaries encyclopedias glossaries) GeneralLiterature_REFERENCE(e.g. small small featured medium geography and places &gt; asia &gt; india health and fitness &gt; self care &gt; positive psychology &gt; mental health natural and physical sciences &gt; nature &gt; death society and social sciences &gt; society &gt; health medium health and fitness &gt; self care &gt; exercise &gt; sports &gt; horse racing Horse Racing Statistics small featured medium culture and arts &gt; visual arts culture and arts &gt; arts and entertainment &gt; museums culture and arts &gt; arts and entertainment culture and arts &gt; culture and humanities DATE/TIME Randolph Glacier Inventory 6.0 glacier ID Flag Elevation of event Snow density uncertainty Longitude of event Event label Number Snow water equivalent DEPTH ice/snow Latitude of event Density snow Snow depth 59999 Environmental Sciences not elsewhere classified 69999 Biological Sciences not elsewhere classified Cancer 110309 Infectious Diseases Biochemistry Immunology 80699 Information Systems not elsewhere classified 39999 Chemical Sciences not elsewhere classified Parasitology Alveolata Polychromophilus Plasmodium Haemoproteus Parahaemoproteus Malaria Leucocytozoon Apicomplexa Haemosporida File name File size Uniform resource locator/link to file File format Molecular Biology 59999 Environmental Sciences not elsewhere classified 69999 Biological Sciences not elsewhere classified Science Policy Inorganic Chemistry Developmental Biology Plant Biology 60506 Virology Medicine Computational Biology Evolutionary Biology 12 days MS column.unigene.fasta files day.unigene.fasta transcriptome assembly 7 days 4 days fb.flower bud.Unigene.fa equestri L 5.root L 6.stem Uncategorized journal Uncategorised Uncategorized Diseases 80699 Information Systems not elsewhere classified 69999 Biological Sciences not elsewhere classified 110309 Infectious Diseases Developmental Biology 60506 Virology Microbiology Genetics Evolutionary Biology Molecular Biology Biochemistry small featured society and social sciences &gt; social sciences &gt; linguistics Computational Biology Neuroscience Diseases Uncategorised Uncategorized Hydrology Soil Science 60102 Bioinformatics Cancer otorhinolaryngologic diseases small education small Data Format Computational Biology Bioinformatics 60408 Genomics 60405 Gene Expression (incl. Microarray and other genome-wide approaches) 30307 Theory and Design of Materials 30306 Synthesis of Materials 91299 Materials Engineering not elsewhere classified Process Design Conventional Power Plants 80107 Natural Language Processing Cell Biology small featured analysis &gt; nlp technology and applied sciences &gt; computing &gt; internet technology and applied sciences &gt; computing &gt; internet &gt; twitter society and social sciences &gt; social sciences &gt; linguistics InformationSystems_MISCELLANEOUS society and social sciences &gt; social sciences small featured GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) ComputingMilieux_MISCELLANEOUS ComputingMilieux_COMPUTERSANDSOCIETY ComputingMilieux_LEGALASPECTSOFCOMPUTING ComputingMilieux_THECOMPUTINGPROFESSION society and social sciences &gt; society &gt; crime featured ComputingMilieux_COMPUTERSANDEDUCATION large society and social sciences &gt; society &gt; education medium ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION natural and physical sciences &gt; biology &gt; health sciences &gt; public health &gt; healthcare ComputingMethodologies_PATTERNRECOGNITION small health and fitness &gt; self care &gt; exercise &gt; sports &gt; basketball small ComputingMethodologies_PATTERNRECOGNITION technology and applied sciences &gt; computing &gt; internet society and social sciences &gt; social sciences &gt; linguistics Data_MISCELLANEOUS Inertial measurement unit multimodal sensor input 90602 Control Systems Robotics and Automation grasps depth imaging RGB images 90305 Rehabilitation Engineering Grasping Activities featured small technology and applied sciences &gt; computing &gt; programming TheoryofComputation_LOGICSANDMEANINGSOFPROGRAMS ComputingMilieux_PERSONALCOMPUTING Software_SOFTWAREENGINEERING 80301 Bioinformatics Software Applied Computer Science Computer Software 80105 Expert Systems 80702 Health Informatics 80108 Neural Evolutionary and Fuzzy Computation 80109 Pattern Recognition and Data Mining featured medium society and social sciences &gt; society &gt; finance medium machine learning &gt; deep learning natural and physical sciences &gt; nature &gt; animals Muinchille Drung Cootehill Drong 80705 Informetrics chi index h-index global surface temperature land surface air temperature Sea surface temperature EARTH SCIENCE &gt; TERRESTRIAL HYDROSPHERE &gt; SNOW/ICE &gt; SNOW DEPTH EARTH SCIENCE &gt; LAND SURFACE &gt; TOPOGRAPHY &gt; SURFACE ROUGHNESS EARTH SCIENCE &gt; ATMOSPHERE &gt; PRECIPITATION &gt; PRECIPITATION AMOUNT EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WINDS &gt; WIND DYNAMICS &gt; VERTICAL WIND VELOCITY/SPEED EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WINDS &gt; UPPER LEVEL WINDS &gt; U/V WIND COMPONENTS EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WINDS &gt; SURFACE WINDS &gt; U/V WIND COMPONENTS EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WATER VAPOR &gt; WATER VAPOR INDICATORS &gt; HUMIDITY &gt; SPECIFIC HUMIDITY EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WATER VAPOR &gt; WATER VAPOR INDICATORS &gt; HUMIDITY &gt; RELATIVE HUMIDITY EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC TEMPERATURE &gt; UPPER AIR TEMPERATURE EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC TEMPERATURE &gt; SURFACE TEMPERATURE &gt; AIR TEMPERATURE EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC PRESSURE &gt; HYDROSTATIC PRESSURE EARTH SCIENCE &gt; ATMOSPHERE &gt; ALTITUDE &gt; GEOPOTENTIAL HEIGHT EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC RADIATION &gt; RADIATIVE FLUX EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC PRESSURE &gt; SEA LEVEL PRESSURE EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC TEMPERATURE &gt; SURFACE TEMPERATURE &gt; MAXIMUM/MINIMUM TEMPERATURE EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WATER VAPOR &gt; WATER VAPOR PROCESSES &gt; SUBLIMATION EARTH SCIENCE &gt; ATMOSPHERE &gt; PRECIPITATION &gt; PRECIPITATION PROFILES &gt; LATENT HEAT FLUX EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WINDS &gt; WIND DYNAMICS &gt; WIND SHEAR &gt; VERTICAL WIND SHEAR EARTH SCIENCE &gt; ATMOSPHERE &gt; CLOUDS &gt; CLOUD PROPERTIES &gt; CLOUD BASE HEIGHT EARTH SCIENCE &gt; ATMOSPHERE &gt; CLOUDS &gt; CLOUD PROPERTIES &gt; CLOUD FRACTION EARTH SCIENCE &gt; ATMOSPHERE &gt; PRECIPITATION &gt; ACCUMULATIVE CONVECTIVE PRECIPITATION EARTH SCIENCE &gt; ATMOSPHERE &gt; CLOUDS &gt; CLOUD MICROPHYSICS &gt; CLOUD PRECIPITABLE WATER EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WATER VAPOR &gt; WATER VAPOR PROCESSES &gt; EVAPOTRANSPIRATION EARTH SCIENCE &gt; ATMOSPHERE &gt; PRECIPITATION &gt; SOLID PRECIPITATION &gt; ICE PELLETS EARTH SCIENCE &gt; ATMOSPHERE &gt; PRECIPITATION &gt; LIQUID PRECIPITATION &gt; RAIN &gt; FREEZING RAIN EARTH SCIENCE &gt; CLIMATE INDICATORS &gt; CRYOSPHERIC INDICATORS &gt; ICE DEPTH/THICKNESS EARTH SCIENCE &gt; CLIMATE INDICATORS &gt; CRYOSPHERIC INDICATORS &gt; SNOW COVER EARTH SCIENCE &gt; ATMOSPHERE &gt; PRECIPITATION &gt; SOLID PRECIPITATION &gt; SNOW EARTH SCIENCE &gt; ATMOSPHERE &gt; WEATHER EVENTS &gt; Stability/Severe Weather Indices &gt; CONVECTIVE AVAILABLE POTENTIAL ENERGY (CAPE) EARTH SCIENCE &gt; ATMOSPHERE &gt; PRECIPITATION &gt; LIQUID PRECIPITATION &gt; RAIN EARTH SCIENCE &gt; TERRESTRIAL HYDROSPHERE &gt; SURFACE WATER &gt; SURFACE WATER PROCESSES/MEASUREMENTS &gt; RUNOFF EARTH SCIENCE &gt; ATMOSPHERE &gt; PRECIPITATION &gt; PRECIPITATION RATE EARTH SCIENCE &gt; ATMOSPHERE &gt; PRECIPITATION &gt; SNOW WATER EQUIVALENT EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC PRESSURE &gt; PLANETARY BOUNDARY LAYER HEIGHT EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC TEMPERATURE &gt; SURFACE TEMPERATURE &gt; POTENTIAL TEMPERATURE EARTH SCIENCE &gt; CRYOSPHERE &gt; SEA ICE &gt; HEAT FLUX EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC CHEMISTRY &gt; OXYGEN COMPOUNDS &gt; OZONE EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WINDS &gt; WIND DYNAMICS &gt; VORTICITY &gt; POTENTIAL VORTICITY EARTH SCIENCE &gt; BIOSPHERE &gt; VEGETATION &gt; VEGETATION COVER Paleoecology small featured society and social sciences &gt; social sciences &gt; linguistics Sociology culture and arts &gt; culture and humanities &gt; languages Politics Science of education Education The Netherlands Behavioural sciences Socio-cultural sciences society and social sciences &gt; social sciences &gt; demographics geography and places &gt; world Health sciences Psychology Demography Temporal coverage: 2012 December Society and social systems Social attitudes and values Leisure and recreation studies Health and well-being Demography and population Leisure recreation and culture Housing and household Social behavior Religion Environment Social sciences analysis &gt; survey analysis Ecology Quantitative Biology::Populations and Evolution medium dictionaries encyclopedias glossaries) ComputingMethodologies_PATTERNRECOGNITION GeneralLiterature_REFERENCE(e.g. ComputerSystemsOrganization_SPECIAL-PURPOSEANDAPPLICATION-BASEDSYSTEMS featured medium society and social sciences &gt; society &gt; crime geography and places &gt; north america &gt; united states society and social sciences &gt; society &gt; crime &gt; violence mathematics and logic &gt; statistics &gt; time series natural and physical sciences &gt; earth sciences &gt; geography society and social sciences &gt; society &gt; crime &gt; illegal drugs Data_MISCELLANEOUS ComputingMethodologies_PATTERNRECOGNITION Data Format Yelp 80505 Web Technologies (excl. Web Search) Computer Software Data Format 80404 Markup Languages 80306 Open Software 80602 Computer-Human Interaction Library and Information Studies Crystallography Neuroscience Imaging 59999 Environmental Sciences not elsewhere classified 69999 Biological Sciences not elsewhere classified Space Science Genetics Evolutionary Biology 80699 Information Systems not elsewhere classified energy limitation fungi density compensation species interaction Neotropics EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC TEMPERATURE &gt; UPPER AIR TEMPERATURE EARTH SCIENCE &gt; ATMOSPHERE &gt; ALTITUDE &gt; GEOPOTENTIAL HEIGHT EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WINDS &gt; UPPER LEVEL WINDS EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WINDS &gt; SURFACE WINDS EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WATER VAPOR &gt; WATER VAPOR INDICATORS &gt; DEW POINT TEMPERATURE EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC PRESSURE &gt; ATMOSPHERIC PRESSURE MEASUREMENTS EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Pressure &gt; Sea Level Pressure EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Pressure &gt; Surface Pressure EARTH SCIENCE &gt; Oceans &gt; Ocean Pressure &gt; Sea Level Pressure Dependability Reliability 80501 Distributed and Grid Systems Anomaly detection Logs Log analysis comets File format File name File size Uniform resource locator/link to file Arctic Ocean Dynamic Ocean Topography Geostrophic Currents Ocean Modeling Principal Component Analysis Satellite altimetry Variations in ocean currents sea ice concentration and sea surface temperature along the North-East coast of Greenland (NEG-OCEAN) Ecology Cell Biology Developmental Biology 60506 Virology Medicine Evolutionary Biology Marine Biology Adaptation Speciation Silene dioica present time Switzerland Silene latifolia Hematology reproductive barrier Economics accounting network banks graphml balancesheets communities Ecology 69999 Biological Sciences not elsewhere classified Science Policy 110309 Infectious Diseases 19999 Mathematical Sciences not elsewhere classified generation time pre-breeding condition Mus domesticus overdominance t haplotype intergenerational costs intermittent breeding Perisoreus infaustus reproductive costs life-history intragenomic conflict Palearctic intragenerational costs t frequency paradox "BIOLOGICAL CLASSIFICATION" "ANIMALS/INVERTEBRATES" "MOLLUSKS" Biomineralization Shell Matrix Proteins SMPs Shell formation featured culture and arts &gt; culture and humanities &gt; food and drink large Cancer Paleoecology Diet and health Medical science and disease ComputingMethodologies_PATTERNRECOGNITION Data_MISCELLANEOUS 90302 Biomechanical Engineering Innervation zone Image-based clustering Graph-Cut segmentation electromyography Environmental Science small GeneralLiterature_MISCELLANEOUS small small featured natural and physical sciences &gt; physical sciences &gt; space Astrophysics::Earth and Planetary Astrophysics Physics::Space Physics natural and physical sciences &gt; physical sciences &gt; astronomy Physics::Geophysics Astrophysics::Solar and Stellar Astrophysics 80104 Computer Vision 170299 Cognitive Science not elsewhere classified Genetics Data_FILES 60205 Marine and Estuarine Ecology (incl. Marine Ichthyology) Biological Techniques Zoology Paleoecology medium featured ComputingMethodologies_PATTERNRECOGNITION technology and applied sciences &gt; computing &gt; internet society and social sciences &gt; social sciences &gt; linguistics small small technology and applied sciences &gt; medicine natural and physical sciences &gt; biology &gt; health sciences &gt; public health &gt; healthcare &gt; surgery ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION 90602 Control Systems Robotics and Automation 80104 Computer Vision Molecular Biology 80699 Information Systems not elsewhere classified 69999 Biological Sciences not elsewhere classified Cancer 110309 Infectious Diseases Evolutionary Biology gene flow asymmetric introgression ABC demographic history newt Lissotriton Since Miocene Central Europe Molecular Biology 80101 Adaptive Agents and Intelligent Robotics 69999 Biological Sciences not elsewhere classified Cancer 110309 Infectious Diseases Cell Biology Plant Biology Biochemistry 60506 Virology Medicine Computational Biology Genetics Hematology Root architecture and plasticity Sinorhizobium meliloti Rhizobia responses Nitrogen responses Cell identity Plant responses to the environment Root cell types Arabidopsis thaliana Fluorescence Activated Cell Sorting Molecular Biology 80699 Information Systems not elsewhere classified 59999 Environmental Sciences not elsewhere classified Ecology 69999 Biological Sciences not elsewhere classified Cancer Science Policy Developmental Biology Immunology Computational Biology Macroevolution Hematology Directional Evolution Mollusca geomorph Geometric Morphometrics Pectinidae 59999 Environmental Sciences not elsewhere classified Ecology 69999 Biological Sciences not elsewhere classified Cancer Science Policy Evolutionary Biology Physiology humanities education Galliformes Body mass Anatidae urologic and male genital diseases Birds Anseriformes Galloanserae Herbivory Diet small featured human activities health care economics and organizations natural and physical sciences &gt; biology &gt; health sciences &gt; public health &gt; healthcare equipment and supplies health services administration population characteristics small GeneralLiterature_MISCELLANEOUS ComputingMethodologies_PATTERNRECOGNITION society and social sciences &gt; society &gt; business health and fitness &gt; self care &gt; exercise &gt; sports ComputingMilieux_PERSONALCOMPUTING featured society and social sciences &gt; society &gt; politics society and social sciences &gt; social sciences &gt; linguistics small technology and applied sciences &gt; computing &gt; internet medium small audience &gt; beginner analysis &gt; data cleaning algorithms &gt; linear regression people and self &gt; personal life &gt; housing society and social sciences &gt; society &gt; real estate featured small geography and places &gt; asia &gt; india natural and physical sciences &gt; biology &gt; health sciences &gt; public health &gt; healthcare natural and physical sciences &gt; biology &gt; health sciences medium large Data_FILES general reference &gt; research tools and topics &gt; databases Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION 110999 Neurosciences not elsewhere classified Microbiology 110303 Clinical Microbiology 110307 Gastroenterology and Hepatology 110307 Gastroenterology and Hepatology File format File size Uniform resource locator/link to file File name small medium GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) ComputingMethodologies_PATTERNRECOGNITION ComputingMilieux_MISCELLANEOUS InformationSystems_DATABASEMANAGEMENT Data_MISCELLANEOUS Hydrology 80110 Simulation and Modelling 40105 Climatology (excl. Climate Change Processes) 90509 Water Resources Engineering 40608 Surfacewater Hydrology 40604 Natural Hazards Climate Science small ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION medium medium featured GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) general reference &gt; research tools and topics &gt; knowledge general reference &gt; reference works &gt; encyclopedias Hydrology 80110 Simulation and Modelling 40105 Climatology (excl. Climate Change Processes) 40608 Surfacewater Hydrology Oceanography 90905 Photogrammetry and Remote Sensing 40104 Climate Change Processes 90902 Geodesy ComputingMethodologies_PATTERNRECOGNITION Neuroscience Computer Science Linguistics 120507 Urban Analysis and Development Computational Biology Data_FILES Biological Techniques small GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) small analysis &gt; nlp ComputingMethodologies_DOCUMENTANDTEXTPROCESSING data type &gt; text data featured medium ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION society and social sciences &gt; society &gt; business EARTH SCIENCE &gt; LAND SURFACE &gt; EROSION/SEDIMENTATION &gt; EROSION EARTH SCIENCE &gt; PALEOCLIMATE &gt; PALEOCLIMATE RECONSTRUCTIONS &gt; DROUGHT/PRECIPITATION RECONSTRUCTION EARTH SCIENCE &gt; SOLID EARTH &gt; GEOCHEMISTRY &gt; GEOCHEMICAL PROPERTIES &gt; ISOTOPE RATIOS Mapping tool EARTH SCIENCE &gt; PALEOCLIMATE &gt; OCEAN/LAKE RECORDS &gt; ISOTOPES EARTH SCIENCE &gt; LAND SURFACE &gt; EROSION/SEDIMENTATION &gt; SEDIMENT TRANSPORT Neodymium radioisotopes EARTH SCIENCE SERVICES &gt; DATA ANALYSIS AND VISUALIZATION &gt; STATISTICAL APPLICATIONS EARTH SCIENCE &gt; PALEOCLIMATE &gt; LAND RECORDS &gt; SEDIMENTS Strontium radioisotopes nanozyme assay silver nanoparticles catalytic activity SERRS small GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) society and social sciences &gt; social sciences &gt; linguistics 60302 Biogeography and Phylogeography Uncategorized featured culture and arts &gt; culture and humanities &gt; food and drink small digestive oral and skin physiology food and beverages small Physics::Instrumentation and Detectors Nuclear Experiment Computer Science::Mathematical Software 79901 Agricultural Hydrology (Drainage Flooding Irrigation Quality etc.) Molecular Biology 80699 Information Systems not elsewhere classified 69999 Biological Sciences not elsewhere classified Cancer 110309 Infectious Diseases Inorganic Chemistry Neuroscience Developmental Biology Plant Biology 60506 Virology Marine Biology Data_FILES Ecological Genetics Hydrology Conservation Genetics 140201 Agricultural Economics Population Genetics - Theoretical Agricultural economics 70104 Agricultural Spatial Analysis and Modelling Climate Change Impact Population Ecology 59999 Environmental Sciences not elsewhere classified Ecology 69999 Biological Sciences not elsewhere classified Plant Biology functional diversity USA alpha diversity Illinois temporal dynamics beta diversity fire management ecological restoration Midwest USA environmental filtering Chicago community assembly 60309 Phylogeny and Comparative Analysis Data_FILES 50206 Environmental Monitoring small Computer Engineering r language ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ComputingMethodologies_PATTERNRECOGNITION Computer Science::Computer Vision and Pattern Recognition Astrophysics::Galaxy Astrophysics Computer Engineering 130309 Learning Sciences Data_MISCELLANEOUS Computer Science::Multimedia dictionaries encyclopedias glossaries) Neuroscience Data_MISCELLANEOUS GeneralLiterature_REFERENCE(e.g. featured medium small health and fitness &gt; self care &gt; exercise &gt; sports &gt; american football InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL InformationSystems_GENERAL small data type &gt; image data Computer Science::Computer Vision and Pattern Recognition large Mathematics::Geometric Topology small ComputingMilieux_GENERAL 170299 Cognitive Science not elsewhere classified small Cancer 110309 Infectious Diseases Cell Biology Developmental Biology Plant Biology 60506 Virology Immunology Microbiology Computational Biology Genetics 39999 Chemical Sciences not elsewhere classified Hematology Root architecture and plasticity Sinorhizobium meliloti Rhizobia responses Nitrogen responses Cell identity Plant responses to the environment Root cell types Arabidopsis thaliana Fluorescence Activated Cell Sorting small society and social sciences &gt; society &gt; finance philosophy and thinking &gt; philosophy &gt; history 80309 Software Engineering Molecular Biology 80699 Information Systems not elsewhere classified 69999 Biological Sciences not elsewhere classified Genetics Evolutionary Biology Palaeoptera Bayesian phylogenetics BEAST Pterygota Metapterygota Chiastomyaria featured data type &gt; image data medium natural and physical sciences &gt; biology &gt; health sciences &gt; public health &gt; healthcare respiratory system respiratory tract diseases large 69999 Biological Sciences not elsewhere classified Cancer Science Policy Pharmacology Inorganic Chemistry Biotechnology Medicine Genetics 80699 Information Systems not elsewhere classified 19999 Mathematical Sciences not elsewhere classified EARTH SCIENCE &gt; TERRESTRIAL HYDROSPHERE &gt; SURFACE WATER &gt; DISCHARGE/FLOW EARTH SCIENCE &gt; TERRESTRIAL HYDROSPHERE &gt; SURFACE WATER &gt; RIVERS/STREAMS In Situ/Laboratory Instruments &gt; Conductivity Sensors &gt; CONDUCTIVITY METERS In Situ/Laboratory Instruments &gt; Gauges &gt; STREAM GAUGES Catchments as Organised Systems EARTH SCIENCE &gt; TERRESTRIAL HYDROSPHERE &gt; SURFACE WATER &gt; HYDROPATTERN CAOS EARTH SCIENCE &gt; TERRESTRIAL HYDROSPHERE &gt; SURFACE WATER &gt; DRAINAGE In Situ/Laboratory Instruments &gt; Photon/Optical Detectors &gt; Cameras &gt; CAMERA 70103 Agricultural Production Systems Simulation small Artificial Intelligence and Image Processing audience &gt; beginner analysis &gt; data visualization weka Big Data Uncategorised Uncategorized small problem type &gt; multiclass classification machine learning &gt; classification problem type &gt; binary classification algorithms &gt; xgboost machine learning &gt; model comparison machine learning &gt; feature engineering society and social sciences &gt; society &gt; finance &gt; banking algorithms &gt; svm algorithms &gt; logistic regression 59999 Environmental Sciences not elsewhere classified Ecology Pharmacology Inorganic Chemistry Biochemistry 60506 Virology Immunology Medicine Computational Biology Space Science Genetics Evolutionary Biology fungi food and beverages genetic processes animal diseases featured ComputingMilieux_COMPUTERSANDEDUCATION large society and social sciences &gt; society &gt; education small medium 120104 Architectural Science and Technology (incl. Acoustics Lighting Structure and Ecologically Sustainable Design) small featured society and social sciences &gt; society &gt; finance mathematics and logic &gt; statistics &gt; time series natural and physical sciences &gt; biology &gt; health sciences &gt; public health &gt; healthcare natural and physical sciences &gt; biology &gt; health sciences &gt; public health society and social sciences &gt; social sciences &gt; demographics society and social sciences &gt; society &gt; government general reference &gt; research tools and topics &gt; government agencies small small small ComputingMethodologies_PATTERNRECOGNITION medium ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION culture and arts &gt; visual arts &gt; photography technology and applied sciences &gt; computing &gt; human-computer interaction technology and applied sciences &gt; electronics &gt; digital media medium stomatognathic diseases small EARTH SCIENCE &gt; ATMOSPHERE &gt; PRECIPITATION &gt; PRECIPITATION AMOUNT featured large health and fitness &gt; self care &gt; positive psychology &gt; mental health humanities natural and physical sciences &gt; biology &gt; health sciences &gt; public health Data Format 80499 Data Format not elsewhere classified 80403 Data Structures 160511 Research Science and Technology Policy 100504 Data Communications 91305 Energy Generation Conversion and Storage Engineering 160808 Sociology and Social Studies of Science and Technology 80703 Human Information Behaviour 80699 Information Systems not elsewhere classified 69999 Biological Sciences not elsewhere classified Cancer Biochemistry 19999 Mathematical Sciences not elsewhere classified Dataset 110309 Infectious Diseases Supplemental 59999 Environmental Sciences not elsewhere classified 69999 Biological Sciences not elsewhere classified Cancer GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) Biotechnology Biochemistry Genetics 80699 Information Systems not elsewhere classified 19999 Mathematical Sciences not elsewhere classified reductionism mechanistic simulation bibliographic network science integration systems science individual-based model 80106 Image Processing 80103 Computer Graphics Plant Biology ComputingMethodologies_PATTERNRECOGNITION Data_MISCELLANEOUS bioenergy crop plantings 70304 Crop and Pasture Biomass and Bioproducts biomass C 70306 Crop and Pasture Nutrition small ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION 80106 Image Processing Fourier Optics Image and Signal Processing Law 130306 Educational Technology and Computing data protection privacy 150306 Industrial Relations employee surveillance lecture capture small Environmental Science 69999 Biological Sciences not elsewhere classified 110309 Infectious Diseases Cell Biology 60506 Virology Genetics Evolutionary Biology 80699 Information Systems not elsewhere classified kinship inbreeding temperate rainforests Conservation genetics and biodiversity South America conservation genetics habitat fragmentation Molecular Biology 80699 Information Systems not elsewhere classified Ecology 69999 Biological Sciences not elsewhere classified 110309 Infectious Diseases Inorganic Chemistry Computational Biology Genetics Evolutionary Biology ComputingMethodologies_PATTERNRECOGNITION Data_MISCELLANEOUS coalescence RAD-Seq speciation Genotyping-by-sequencing Western Mediterranean concatenation Linaria radiation Quaternary Iberian Peninsula phylogeny 111706 Epidemiology 119999 Medical and Health Sciences not elsewhere classified Health Care small featured culture and arts &gt; culture and humanities &gt; food and drink Ecology 60205 Marine and Estuarine Ecology (incl. Marine Ichthyology) 50202 Conservation and Biodiversity medium medium ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION data type &gt; image data problem type &gt; future prediction Biological Sciences 170203 Knowledge Representation and Machine Learning 60102 Bioinformatics 80109 Pattern Recognition and Data Mining Evolutionary Biology Paleoecology small featured medium ComputingMilieux_PERSONALCOMPUTING health and fitness &gt; self care &gt; exercise &gt; sports &gt; association football geography and places &gt; europe Brain small ComputingMilieux_PERSONALCOMPUTING culture and arts &gt; games and toys &gt; video games Biomarkers 110201 Cardiology (incl. Cardiovascular Diseases) Ecology Cancer Science Policy Pharmacology Biotechnology Plant Biology Evolutionary Biology 19999 Mathematical Sciences not elsewhere classified Marine Biology Inorganic Chemistry body size microevolution Selection small featured featured medium society and social sciences &gt; society &gt; finance philosophy and thinking &gt; philosophy &gt; history medium society and social sciences &gt; society &gt; crime small small 91299 Materials Engineering not elsewhere classified 30307 Theory and Design of Materials 30306 Synthesis of Materials 40604 Natural Hazards Solid Earth Sciences 40407 Seismology and Seismic Exploration Neuroscience behavioral disciplines and activities nervous system genetic structures psychological phenomena and processes small featured society and social sciences &gt; social sciences &gt; economics trade balance of payments exchange rates interest rates government expenditures monetary reserves international economics government revenues financial policy featured data type &gt; image data medium problem type &gt; multiclass classification small people and self &gt; personal life &gt; clothing problem type &gt; object identification Cancer Molecular Biology Ecology 110309 Infectious Diseases North America Biochemistry Mammalia Microbiology Computational Biology Evolutionary Biology Price equation Species selection Palaeocene Eocene Macroevolution Body size Palaeocene/Eocene boundary Wyoming Bighorn Basin Clarks Fork Basin small Plant Biology Computer Science::Computer Vision and Pattern Recognition Computer Science::Neural and Evolutionary Computation fungi urologic and male genital diseases Physics::Accelerator Physics urogenital system cardiovascular diseases test datasets female genital diseases and pregnancy complications small general reference &gt; research tools and topics &gt; books society and social sciences &gt; social sciences &gt; linguistics culture and arts &gt; culture and humanities &gt; languages medium featured society and social sciences &gt; social sciences &gt; linguistics ComputingMilieux_THECOMPUTINGPROFESSION people and self &gt; personal life &gt; employment small ComputingMethodologies_DOCUMENTANDTEXTPROCESSING geography and places &gt; asia &gt; india society and social sciences &gt; society &gt; crime small 80799 Library and Information Studies not elsewhere classified small 110309 Infectious Diseases 60506 Virology Medicine Reptiles Population Genetics - Empirical Speciation virology/viral replication and gene regulation exon capture Top End Northern Australia Carlia gracilis 111714 Mental Health virology/viruses and cancer Carlia amax SNP Virology virology/persistence and latency virology/effects of virus infection on host gene expression Kimberley snow cover maps small cardiovascular diseases Paleoecology 120504 Land Use and Environmental Planning Developmental Biology Environmental Science Cell Biology small culture and arts &gt; games and toys ComputingMilieux_PERSONALCOMPUTING culture and arts &gt; games and toys &gt; video games small ComputerSystemsOrganization_PROCESSORARCHITECTURES small GeneralLiterature_MISCELLANEOUS health and fitness &gt; self care &gt; exercise &gt; sports &gt; fishing small ComputerApplications_COMPUTERSINOTHERSYSTEMS initiatives &gt; socrata 110704 Cellular Immunology 110203 Respiratory Diseases 80699 Information Systems not elsewhere classified 59999 Environmental Sciences not elsewhere classified 69999 Biological Sciences not elsewhere classified Cancer Science Policy 110309 Infectious Diseases Pharmacology Developmental Biology Plant Biology Biochemistry 60506 Virology Immunology Genetics Evolutionary Biology 19999 Mathematical Sciences not elsewhere classified Molecular Biology Agriculture Insects Quaternary 111714 Mental Health Biotechnology reaction norm forensic probe design Brittany frugivory 48°36´N seed dispersal Cytomegalovirus serology forest restoration release Version 3.0 Western France multicellularity Community Ecology Raw Data tuberculosis understory fires Myxococcus xanthus Aphididae phenotypic plasticity version Aphidiinae http Brazil Zone Atelier Armorique natural regeneration Drosophila genome gene sequences CVD Epidemiology Foodwebs HCMV Amazonia physical forces in development population genetics Cardiovascular disease 1°32´W DNA Barcoding TB non-CODIS STRs txt Tapirus terrestris Agilent Chinese Kyrgyz Uganda small sediments disturbance porosity permeability Uncategorised Uncategorized large medium Neuroscience Imaging medium technology and applied sciences &gt; computing &gt; internet EARTH SCIENCE &gt; LAND SURFACE &gt; TOPOGRAPHY &gt; SURFACE ROUGHNESS EARTH SCIENCE &gt; ATMOSPHERE &gt; PRECIPITATION &gt; PRECIPITATION AMOUNT EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WINDS &gt; WIND DYNAMICS &gt; VERTICAL WIND VELOCITY/SPEED EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC TEMPERATURE &gt; UPPER AIR TEMPERATURE EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC TEMPERATURE &gt; SURFACE TEMPERATURE &gt; AIR TEMPERATURE EARTH SCIENCE &gt; ATMOSPHERE &gt; ALTITUDE &gt; GEOPOTENTIAL HEIGHT EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC PRESSURE &gt; SEA LEVEL PRESSURE EARTH SCIENCE &gt; ATMOSPHERE &gt; PRECIPITATION &gt; SOLID PRECIPITATION &gt; SNOW EARTH SCIENCE &gt; BIOSPHERE &gt; VEGETATION &gt; VEGETATION COVER EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WINDS &gt; SURFACE WINDS EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WATER VAPOR &gt; WATER VAPOR INDICATORS &gt; DEW POINT TEMPERATURE EARTH SCIENCE &gt; TERRESTRIAL HYDROSPHERE &gt; SURFACE WATER &gt; SURFACE WATER PROCESSES/MEASUREMENTS &gt; RUNOFF EARTH SCIENCE &gt; TERRESTRIAL HYDROSPHERE &gt; SNOW/ICE &gt; SNOW/ICE TEMPERATURE EARTH SCIENCE &gt; TERRESTRIAL HYDROSPHERE &gt; SNOW/ICE &gt; SNOW MELT EARTH SCIENCE &gt; TERRESTRIAL HYDROSPHERE &gt; SNOW/ICE &gt; SNOW DEPTH EARTH SCIENCE &gt; TERRESTRIAL HYDROSPHERE &gt; SNOW/ICE &gt; SNOW DENSITY EARTH SCIENCE &gt; TERRESTRIAL HYDROSPHERE &gt; SNOW/ICE &gt; ALBEDO EARTH SCIENCE &gt; OCEANS &gt; SEA ICE &gt; ICE EXTENT EARTH SCIENCE &gt; OCEANS &gt; OCEAN TEMPERATURE &gt; SEA SURFACE TEMPERATURE EARTH SCIENCE &gt; LAND SURFACE &gt; TOPOGRAPHY &gt; TERRAIN ELEVATION EARTH SCIENCE &gt; LAND SURFACE &gt; SURFACE THERMAL PROPERTIES &gt; SKIN TEMPERATURE EARTH SCIENCE &gt; LAND SURFACE &gt; SURFACE RADIATIVE PROPERTIES &gt; ALBEDO EARTH SCIENCE &gt; LAND SURFACE &gt; SOILS &gt; SOIL TEMPERATURE EARTH SCIENCE &gt; LAND SURFACE &gt; SOILS &gt; SOIL MOISTURE/WATER CONTENT EARTH SCIENCE &gt; LAND SURFACE &gt; SOILS &gt; SOIL CLASSIFICATION EARTH SCIENCE &gt; CRYOSPHERE &gt; SNOW/ICE &gt; SNOW/ICE TEMPERATURE EARTH SCIENCE &gt; BIOSPHERE &gt; VEGETATION &gt; VEGETATION SPECIES EARTH SCIENCE &gt; ATMOSPHERE &gt; CLOUDS &gt; CLOUD PROPERTIES &gt; CLOUD FREQUENCY EARTH SCIENCE &gt; ATMOSPHERE &gt; CLOUDS &gt; CLOUD MICROPHYSICS &gt; CLOUD LIQUID WATER/ICE EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WINDS &gt; WIND DYNAMICS &gt; WIND STRESS EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WINDS &gt; WIND DYNAMICS &gt; VORTICITY EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WINDS &gt; WIND DYNAMICS &gt; CONVERGENCE EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WINDS &gt; WIND DYNAMICS &gt; CONVECTION EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WINDS &gt; UPPER LEVEL WINDS EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WATER VAPOR &gt; WATER VAPOR PROCESSES &gt; EVAPORATION EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WATER VAPOR &gt; WATER VAPOR INDICATORS &gt; WATER VAPOR EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WATER VAPOR &gt; WATER VAPOR INDICATORS &gt; TOTAL PRECIPITABLE WATER EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WATER VAPOR &gt; WATER VAPOR INDICATORS &gt; HUMIDITY EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC RADIATION &gt; SHORTWAVE RADIATION EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC RADIATION &gt; OUTGOING LONGWAVE RADIATION EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC RADIATION &gt; LONGWAVE RADIATION EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC RADIATION &gt; INCOMING SOLAR RADIATION EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC RADIATION &gt; HEAT FLUX EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC PRESSURE &gt; SURFACE PRESSURE EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC PRESSURE &gt; GRAVITY WAVE EARTH SCIENCE &gt; ATMOSPHERE &gt; ALTITUDE &gt; PLANETARY BOUNDARY LAYER HEIGHT EARTH SCIENCE &gt; ATMOSPHERE &gt; AIR QUALITY &gt; TROPOSPHERIC OZONE Molecular Biology 80699 Information Systems not elsewhere classified 69999 Biological Sciences not elsewhere classified Science Policy ComputingMethodologies_PATTERNRECOGNITION Data_FILES Data_MISCELLANEOUS Iminium reactive intermediates Abemaciclib Side effects Reactive metabolites featured small society and social sciences &gt; society &gt; war culture and arts &gt; arts and entertainment &gt; literature people and self &gt; people &gt; social groups small fungi urologic and male genital diseases 130306 Educational Technology and Computing urogenital system cardiovascular diseases female genital diseases and pregnancy complications 69999 Biological Sciences not elsewhere classified Pliocene Pleistocene Genetics Evolutionary Biology 19999 Mathematical Sciences not elsewhere classified integumentary system parasitic diseases amphibians fungi Southeastern U.S. hybrid enrichment Lithobates sphenocephalus discordance Hyla squirella Hyla cinerea mitogenome Rana sphenocephala frogs Anaxyrus terrestris barrier testing phylogenomics 59999 Environmental Sciences not elsewhere classified Ecology 69999 Biological Sciences not elsewhere classified Cancer Medicine 39999 Chemical Sciences not elsewhere classified Sociology sexual cycling Amboseli Park Kenya Papio cynocephalus 1977-2014 steroid hormones postpartum amenorrhea gestation P. anubis body fat 80403 Data Structures datasetsR 69999 Biological Sciences not elsewhere classified Evolutionary Biology Coral Triangle Tropical shallow-marine biodiversity Western Pacific Indo-Australian Archipelago Temporal diversity dynamics Cenozoic Latitudinal diversity gradients Biodiversity hotspot Ostracoda Ecology 60205 Marine and Estuarine Ecology (incl. Marine Ichthyology) 20299 Atomic Molecular Nuclear Particle and Plasma Physics not elsewhere classified medium GeneralLiterature_MISCELLANEOUS ComputingMilieux_MISCELLANEOUS people and self &gt; personal life &gt; hotels GeneralLiterature_INTRODUCTORYANDSURVEY InformationSystems_GENERAL Temperature DATE/TIME ELEVATION HEIGHT above ground air daily minimum Method comment LATITUDE Description daily mean daily maximum Station label LONGITUDE small 60102 Bioinformatics Data_FILES API 59999 Environmental Sciences not elsewhere classified 69999 Biological Sciences not elsewhere classified Science Policy data study participants membrane experiment assay information infectivity surveys dataset Cell Biology 29999 Physical Sciences not elsewhere classified 80699 Information Systems not elsewhere classified speciation 111714 Mental Health Stream Restoration species delimitation Fluvial Geomorphology United States Morphodynamics Physical Geography Dynastes 59999 Environmental Sciences not elsewhere classified Ecology Developmental Biology Plant Biology Immunology Evolutionary Biology Inorganic Chemistry Hematology Alps Heterozygosity-fitness correlation MHC Capra ibex bottleneck Alpine ibex Infectious kerato-conjunctivitis medium audience &gt; beginner Environmental Science 50204 Environmental Impact Assessment 50202 Conservation and Biodiversity Forest 50209 Natural Resource Management Food Security Health Impact evaluation Dietary Diversity 15 Geothermal Energy geothermal Colorado reconnaissance shallow temperature survey air photo lineaments groundwater geology geologic map geothermometry map Rico Geodatabase Dolores County San Miguel County Geochemistry structural point information mines and prospects travertine land ownership rico quadrangle topographic 12 Built Environment and Design air temperatures Earth and Environmental Sciences Heat waves 10 Technology window position Bedroom Environmental Science Genetics Data_FILES 60405 Gene Expression (incl. Microarray and other genome-wide approaches) 60205 Marine and Estuarine Ecology (incl. Marine Ichthyology) Biological Techniques Zoology featured medium ComputingMilieux_PERSONALCOMPUTING health and fitness &gt; self care &gt; exercise &gt; sports &gt; association football small ComputingMilieux_COMPUTERSANDSOCIETY ComputingMilieux_LEGALASPECTSOFCOMPUTING small medium analysis &gt; nlp 80106 Image Processing Artificial Intelligence and Image Processing 80602 Computer-Human Interaction 80104 Computer Vision 80504 Ubiquitous Computing featured medium Emotion Perception Arabic Language Speech Analysis Molecular Biology Cancer Cell Biology human activities Neuroscience Biochemistry Genetics Evolutionary Biology parasitic diseases distance decay fungi animal diseases Quaternary dispersal processes the Hengduan Mountains Scandentia geographic isolation Soricomorpha reproductive and urinary physiology species turnover Erinaceomorpha Lagomorpha and Rodentia halving distance 69999 Biological Sciences not elsewhere classified Medicine Genetics 80699 Information Systems not elsewhere classified 19999 Mathematical Sciences not elsewhere classified Speciation Geography of Speciation Phylogenetic Comparative Methods Global Approximate Bayesian Computation medium GeneralLiterature_MISCELLANEOUS ComputingMethodologies_PATTERNRECOGNITION data type &gt; text data Cancer 80106 Image Processing Fourier Optics Image and Signal Processing Computer Engineering 60408 Genomics humanities kinship behavior and behavior mechanisms single nucleotide polymorphism social sciences sampling medium geography and places &gt; asia &gt; india technology and applied sciences &gt; transport &gt; vehicles mathematics and logic &gt; mathematics &gt; numbers Uncategorised Uncategorized 80699 Information Systems not elsewhere classified 59999 Environmental Sciences not elsewhere classified 69999 Biological Sciences not elsewhere classified Science Policy 60506 Virology Immunology Evolutionary Biology phylogenomics Caenorhabditis Strigamia Oikopleura Lottia Ixodes 2 billion years Tetranychus Hydra Capitella Ciona Danaus Acyrthosiphon Gasterosteus Monosiga phylogenetic conflict Saccoglossus Apis Branchiostoma Gallus Anolis Trichoplax Brugia Pinctada Rhodnius Bombus Fugu Tribolium Mnemiopsis Amphimedon Homo Daphnia Strongylocentrotus Xenopus Latimeria Nematostella locus selection Salpingoeca long-branch attraction Drosophila Earth Acropora small small Ecology 50301 Carbon Sequestration Science Climate Science 40104 Climate Change Processes 54 Environmental Sciences ngee ngee-arctic barrow alaska soil characteristics elements organic carbon organic matter Molecular Biology 59999 Environmental Sciences not elsewhere classified Ecology 69999 Biological Sciences not elsewhere classified Neuroscience Genetics Evolutionary Biology Dataset 369 individuals formatted Label sample Microsatellite genotype data Genotype information GenAlEx location column headers .6.502 diploid loci small ComputingMethodologies_PATTERNRECOGNITION Data_MISCELLANEOUS Ecology Environmental Science Soil Science 49999 Earth Sciences not elsewhere classified 50102 Ecosystem Function Limnology small GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) medium featured ComputerApplications_COMPUTERSINOTHERSYSTEMS society and social sciences &gt; social sciences &gt; linguistics general reference &gt; research tools and topics &gt; writing small small ComputerApplications_COMPUTERSINOTHERSYSTEMS ComputingMilieux_MISCELLANEOUS 40101 Atmospheric Aerosols 69999 Biological Sciences not elsewhere classified Biochemistry Microbiology analysis universality Dataset II Dataset II primer Conversion and Storage Engineering 90607 Power and Energy Systems Engineering (excl. Renewable Power) 91305 Energy Generation medium small featured geography and places &gt; north america &gt; united states society and social sciences &gt; society &gt; crime &gt; violence society and social sciences &gt; society &gt; crime society and social sciences &gt; society &gt; crime &gt; terrorism small featured ComputingMilieux_PERSONALCOMPUTING culture and arts &gt; games and toys &gt; video games medium skin and connective tissue diseases Uniform resource locator/link to file Comment Longitude of event Event label Latitude of event Station label Elevation of event Comment of event Baseline Surface Radiation Network (BSRN) WCRP/GEWEX hyperspectral VNIR Tea MODLIFE File name File size Uniform resource locator/link to file File format Antarctica Functional Ecology @ AWI (AWI_FuncEco) Development of a CCAMLR Marine Protected Area in the Antarctic Weddell Sea (WSMPA) File content Weddell Sea Marine Protected Area (MPA) Porifera Echinodermata Database outer membrane Halanaerobiales Source Code Negativicutes evolution Firmicutes small problem type &gt; multiclass classification machine learning &gt; classification large technology and applied sciences &gt; computing &gt; human-computer interaction small GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) society and social sciences &gt; society &gt; business society and social sciences &gt; society &gt; finance geography and places &gt; north america &gt; united states ComputingMilieux_MISCELLANEOUS society and social sciences &gt; society &gt; organizations technology and applied sciences &gt; computing &gt; companies small GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) GeneralLiterature_MISCELLANEOUS health and fitness &gt; self care &gt; exercise &gt; sports Data_GENERAL health and fitness &gt; self care &gt; exercise &gt; running small technology and applied sciences &gt; computing &gt; internet society and social sciences &gt; society &gt; business ComputingMilieux_MISCELLANEOUS analysis &gt; data visualization analysis &gt; data cleaning mathematics and logic &gt; statistics &gt; categorical data machine learning &gt; classification analysis &gt; image processing featured data type &gt; image data medium machine learning &gt; deep learning mathematics and logic &gt; statistics &gt; categorical data Microbiology 110307 Gastroenterology and Hepatology 110303 Clinical Microbiology ComputingMilieux_COMPUTERSANDEDUCATION ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ComputingMilieux_LEGALASPECTSOFCOMPUTING ComputingMethodologies_ARTIFICIALINTELLIGENCE 130306 Educational Technology and Computing GeneralLiterature_INTRODUCTORYANDSURVEY Education 130103 Higher Education Neuroscience Imaging Temperature DATE/TIME air coastal climate Mediterranean weather conditions Humidity relative Wind speed Wind direction description gust Pressure atmospheric Precipitation Thermometer Hygrometer Anemometer Barometer Pluviometer 69999 Biological Sciences not elsewhere classified 110309 Infectious Diseases Developmental Biology Immunology Computational Biology Genetics Molecular Biology small medium 60702 Plant Cell and Molecular Biology food and beverages natural sciences 60703 Plant Developmental and Reproductive Biology Ecology 69999 Biological Sciences not elsewhere classified Biochemistry Computational Biology ComputingMethodologies_PATTERNRECOGNITION Data_MISCELLANEOUS cortical entrainment semantic processing EEG cocktail party selective attention natural speech multisensory integration 70204 Animal Nutrition 70108 Sustainable Agricultural Development 70501 Agroforestry Proximate Composition fatty acids 70199 Agriculture Land and Farm Management not elsewhere classified 70107 Farming Systems Research Phenolic Compounds Digestibility chemical composition profiles Leaf traits Neuroscience Imaging featured small society and social sciences &gt; social sciences &gt; linguistics people and self &gt; self &gt; gender medium InformationSystems_INFORMATIONINTERFACESANDPRESENTATION(e.g. HCI) ComputingMethodologies_PATTERNRECOGNITION 59999 Environmental Sciences not elsewhere classified 69999 Biological Sciences not elsewhere classified Science Policy Cell Biology Microbiology Inorganic Chemistry Sociology 111714 Mental Health Amboseli Park Kenya Papio cynocephalus 1977-2014 steroid hormones postpartum amenorrhea gestation P. anubis body fat sexual cycling ComputingMilieux_THECOMPUTINGPROFESSION Computer Science Social Web small ComputingMilieux_COMPUTERSANDEDUCATION medium medium featured culture and arts &gt; games and toys ComputingMilieux_PERSONALCOMPUTING culture and arts &gt; games and toys &gt; video games EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Pressure &gt; Surface Pressure EARTH SCIENCE &gt; Atmosphere &gt; Altitude &gt; Geopotential Height EARTH SCIENCE &gt; Land Surface &gt; Land Temperature &gt; Skin Temperature EARTH SCIENCE &gt; Oceans &gt; Ocean Temperature &gt; Sea Surface Temperature EARTH SCIENCE &gt; Oceans &gt; Sea Ice &gt; Sea Ice Concentration EARTH SCIENCE &gt; Hydrosphere &gt; Snow/Ice &gt; Snow Water Equivalent EARTH SCIENCE &gt; Land Surface &gt; Soils &gt; Soil Temperature EARTH SCIENCE &gt; Land Surface &gt; Soils &gt; Soil Moisture/Water Content EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Pressure &gt; Sea Level Pressure EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Winds &gt; Upper Level Winds EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Winds &gt; Surface Winds EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Winds &gt; Boundary Layer Winds EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Temperature &gt; Air Temperature EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Water Vapor &gt; Humidity featured medium ComputingMilieux_COMPUTERSANDEDUCATION GeneralLiterature_MISCELLANEOUS InformationSystems_INFORMATIONSYSTEMSAPPLICATIONS society and social sciences &gt; society &gt; finance small natural and physical sciences &gt; biology machine learning &gt; classification problem type &gt; regression small people and self &gt; personal life &gt; housing problem type &gt; future prediction small ComputingMilieux_COMPUTERSANDEDUCATION ComputingMilieux_LEGALASPECTSOFCOMPUTING ComputingMilieux_THECOMPUTINGPROFESSION ComputingMethodologies_PATTERNRECOGNITION Data_MISCELLANEOUS Physics::Accelerator Physics ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS Computer Science::Performance APS Citation data Physics::Fluid Dynamics Computer Science::Networking and Internet Architecture Applied Physics MathematicsofComputing_DISCRETEMATHEMATICS Computer Software 80109 Pattern Recognition and Data Mining 80403 Data Structures 80301 Bioinformatics Software 69999 Biological Sciences not elsewhere classified Inorganic Chemistry Plant Biology Biochemistry Genetics Evolutionary Biology 80699 Information Systems not elsewhere classified Phylogenomics Plastid genome Trebouxiophyceae Chlorophyceae Ulvophyceae Prasinophyceae Chlorellales Chlorophyta Pedinophyceae Cancer otorhinolaryngologic diseases 89999 Information and Computing Sciences not elsewhere classified 80612 Interorganisational Information Systems and Web Services Library and Information Studies 80799 Library and Information Studies not elsewhere classified 80706 Librarianship Cancer 160699 Political Science not elsewhere classified Immunology Data_MISCELLANEOUS ComputingMethodologies_PATTERNRECOGNITION Uncategorised Uncategorized Ecology 69999 Biological Sciences not elsewhere classified Science Policy Neuroscience Genetics cryptic female choice ovarian fluid genetic heterozygosity sperm competition embryo survival small ComputingMilieux_LEGALASPECTSOFCOMPUTING ComputingMilieux_THECOMPUTINGPROFESSION ComputingMilieux_GENERAL large ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION culture and arts &gt; visual arts &gt; comics culture and arts &gt; visual arts &gt; animation small Cell Biology Developmental Biology RNAi muscle screen Drosophila melanogaster small medium InformationSystems_MISCELLANEOUS medium 69999 Biological Sciences not elsewhere classified Science Policy Inorganic Chemistry Neuroscience North America ComputingMethodologies_PATTERNRECOGNITION 80699 Information Systems not elsewhere classified Europe Anthropocene taxonomy Leptogium biogeography Leptogium saturninum lichen revision Statistics::Computation macrolichen cyanolichen Mallotium Collemataceae UNFCCC emissions data small small featured ComputingMilieux_PERSONALCOMPUTING culture and arts &gt; games and toys &gt; board games small large small featured GeneralLiterature_MISCELLANEOUS ComputingMilieux_COMPUTERSANDSOCIETY ComputingMilieux_THECOMPUTINGPROFESSION society and social sciences &gt; social sciences &gt; economics medium featured GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) ComputingMilieux_MISCELLANEOUS culture and arts &gt; performing arts &gt; film culture and arts &gt; visual arts non-spherical sand particle CARES organic aerosol 40601 Geomorphology and Regolith and Landscape Evolution rotation CalNex Atmospheric Sciences volatility 40302 Extraterrestrial Geology drag forces Geology 40607 Surface Processes particulate matter CMAQ Aging SOAS Diseases small featured natural and physical sciences &gt; physical sciences &gt; physics ORDINAL NUMBER Symbol Melting point Boiling point natural and physical sciences &gt; physical sciences &gt; chemistry Name Atomic weight medium analysis &gt; nlp ComputingMethodologies_DOCUMENTANDTEXTPROCESSING society and social sciences &gt; social sciences &gt; linguistics analysis &gt; text mining TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES medium ComputerApplications_COMPUTERSINOTHERSYSTEMS problem type &gt; customer value medium ComputingMethodologies_DOCUMENTANDTEXTPROCESSING society and social sciences &gt; social sciences &gt; linguistics InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL culture and arts &gt; culture and humanities &gt; languages ComputingMethodologies_ARTIFICIALINTELLIGENCE human activities 90399 Biomedical Engineering not elsewhere classified parasitic diseases health services administration population characteristics endocrine system diseases Cancer Pharmacology Inorganic Chemistry 19999 Mathematical Sciences not elsewhere classified Hybridization Hematology 110309 Infectious Diseases RADseq Quercus EARTH SCIENCE &gt; TERRESTRIAL HYDROSPHERE &gt; SNOW/ICE &gt; SNOW DEPTH EARTH SCIENCE &gt; ATMOSPHERE &gt; PRECIPITATION &gt; PRECIPITATION AMOUNT EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC TEMPERATURE &gt; SURFACE TEMPERATURE &gt; AIR TEMPERATURE EARTH SCIENCE &gt; LAND SURFACE &gt; SOILS &gt; SOIL MOISTURE/WATER CONTENT EARTH SCIENCE &gt; ATMOSPHERE &gt; ATMOSPHERIC WATER VAPOR &gt; WATER VAPOR PROCESSES &gt; EVAPORATION medium ComputingMilieux_MISCELLANEOUS initiatives &gt; socrata 59999 Environmental Sciences not elsewhere classified Ecology 69999 Biological Sciences not elsewhere classified Science Policy Immunology Temperate forest Soil biota Population abundance Boreal forest Soil respiration Metabolic rate Temperate grassland Tundra Soil community Temperature sensitivity Biome Individual mass Tropical forest medium Financial Uncategorized Computer Science small Medicine Data_MISCELLANEOUS ComputingMethodologies_PATTERNRECOGNITION small ComputingMilieux_PERSONALCOMPUTING featured natural and physical sciences &gt; nature &gt; plants society and social sciences &gt; social sciences &gt; international relations small natural and physical sciences &gt; nature &gt; animals natural and physical sciences &gt; nature &gt; environment small medium 170299 Cognitive Science not elsewhere classified 10401 Applied Statistics Paleoecology small medium featured natural and physical sciences &gt; biology technology and applied sciences &gt; medicine education featured large culture and arts &gt; performing arts &gt; film technology and applied sciences &gt; computing &gt; artificial intelligence featured natural and physical sciences &gt; biology large technology and applied sciences &gt; medicine data type &gt; image data featured medium ComputingMethodologies_PATTERNRECOGNITION ComputingMethodologies_DOCUMENTANDTEXTPROCESSING general reference &gt; research tools and topics &gt; writing mathematics and logic &gt; mathematics public domain peptidase dictionary of the English language authority control burial grave cemetery genealogy medicine molecular function gene product biological process gene cellular component Membrane transport protein botany daylight saving time IANA time zone zoology manga anime open access in France architecture video game video game video game genealogy genealogy art design photography astronomy genealogy Earth sciences Canadians baseball board game medicine biology heavy metal shark attack food charitable organization genealogy medicine lyrics triangle center celebrity human anatomy art rare disease geographic location place name biology biodiversity informatics score astronomical catalog protected area geography history ornithology chemistry death chemistry UNIDROIT private international law United Nations Convention on Contracts for the International Sale of Goods Jewish people mixed martial arts ZX Spectrum chemical reaction enzymes nomenclature classification system enzyme activity chemistry zoology environmental protection maritime transport genealogy invasive species open access in France food food labeling regulations recycling codes nutrition facts label Saccharomyces cerevisiae biology Arabidopsis thaliana film television series Caenorhabditis elegans chemistry CAS Registry Number manga anime manhua manhwa Korean animation donghua algorithmics algorithm data structure archaeology Australian literature biographical article United States federal judge bitterness nursing plant virus death Drosophila melanogaster extrasolar planet spectroscopy human genome protein comics telephony open educational resource protein two-component regulatory system Punjab heritage Punjabi culture Sumerian open access policy open-access repository open access open access in Portugal open access in Latin America open-access journal open access in Uruguay open access in Argentina open access in Chile open access in Spain open access in Mexico open access in Brazil open access in Peru aging open access in Latin America open-access journal open access in Brazil geochemistry taxation gene product biological network metabolite biological pathway covered bridge citizen science bird chemistry Theatre of Poland comics gene disease sign language Jewish studies bibliography Czech track and field long non-coding RNA mineral member of the French National Assembly member of parliament medicine health care listed building in the United Kingdom data set Welsh newspapers biographical article musician music video script typeface Arabic numeral transcription factor Y chromosome haplotype protein Michaelis constant drug history of medicine Christian hymn media of Australia history of Australia OpenStreetMap Australian rules football Australian Football League phenotype genotype racing automobile driver auto racing Poaceae genealogy hazardous substances industrial safety personal protective equipment anime Gymnospermae nobility academic genealogy given name parent date of birth Social Security number Member of the Victorian Legislative Assembly Member of the Victorian Legislative Council open access monograph history basketball genealogy Greek mythology cultural landscape monument shell corporation tax noncompliance mass spectrum chemistry Crocodile attack Russian cell line biotechnology cell biology media studies communication studies film studies cell biology Schizosaccharomyces pombe genetics bookselling philology state school school district state education agency Lord Byron Lucas Cranach the Elder botany nomenclature classification system open data public art earthquake seismology database spectroscopy new media art installation art educational institution research institute medicine library history death censorship cultural heritage botany Danio rerio member of the Parliament of Finland Auvergne library museum archives airline Wikipedia wiki biography sentiment analysis review sentiment analysis Internet Movie Database sentiment analysis film criticism sentiment analysis speech segmentation chat room Extracellular RNA shell corporation tax noncompliance King Arthur ontology Semantic Web Semantic MediaWiki semantic similarity open access in France open access policy open access in Portugal open access in Spain open access in the United States of America open access in Norway open access in the United Kingdom open access in Japan open access in Switzerland open access in Australia open access in Austria open access in New Zealand open access in Sweden open access in Hungary open access in India open access in Luxembourg open access in Germany open access in China open access in Belgium open access in the Netherlands open access in Finland open access in Italy open access in Ireland open access in Denmark open access in Canada open access in South Africa altered state of consciousness macromolecular complex death North Carolina violence plant slang open educational resource cancer Majorana fermion physics nanotechnology biology number open-access repository open access open data cultural heritage metadata GLAM Semantic processing data quality provenance Taiwan linked data LODLAM Comprehensive Knowledge Archive Network Linked Open Data Maltese soil immunology coral reef academic publishing Disappeared indigenous women violence against women artist medication Baxter robotics artificial intelligence medieval studies women's studies medieval studies women's studies Litchfield Law School Litchfield Female Academy women writers theater biography children's writer Coccoidea orphan work magic lantern National Museum of Finland Lahti City Museum Kalevala book botany business record Comédie-Française gene rare disease genome genetic disease DNA cluster of differentiation orphanage residential child care community pathogen Ukrainian studies scholarly communication open-access publisher library publishing digital humanities project gender studies kidney urinary bladder citation bibliographic metadata author disambiguation Aramaic microorganism protein kinase computer security symptom London genealogy systematic review history of books Proboscidea vascular plant cultural heritage theater gender sexual orientation sex health equity dedup_wf_001::2d13dd919b0ec4519c4a0967c4c7cd47 dedup_wf_001::28e209b61a52482a0ae1cb9f5959c792 datacite____::b28d97a3796c731d7942540a524e838c datacite____::99ab1f72bc7dd695c7a7c3cbe61a71c9 dedup_wf_001::9a3b12eae9d47e02ef64d44e7d810b45 dedup_wf_001::91e7741fa646208b4957a0787bdff276 datacite____::f9fbd711271a819edbc9fc9f4f40d649 dedup_wf_001::fd2e1d7f642d76c31b4bcaad920964f2 dedup_wf_001::9405e729b5f6f1170f1e8c9fae04449c dedup_wf_001::6e1064560bceeef0fb808280313b5ca0 dedup_wf_001::565b4bb4c813ca7af0852174ce8036f4 r38d07aef7b7::039c7319b67bb87c9b7f62111caf65d1 r38d07aef7b7::17c276c8e723eb46aef576537e9d56d0 r38d07aef7b7::26da9d37357b01ee4fe35ce3fc969b1e r38d07aef7b7::a61eded670f6e29acff242cae3b82a96 datacite____::d79fd236a9a91905c5bb199526073143 dedup_wf_001::daff15865daf824544b3a939d1f2bdc3 dedup_wf_001::0c7d40cfc50b5509570a6ebe7162c94c datacite____::01bf8e9bb3c67b1432fc474bb0a3dc80 dedup_wf_001::0eb8d0c496ea727752cafc3c607d3072 dedup_wf_001::fa8c1cb0271969daab5d9a0f0c2592e2 r38d07aef7b7::4457906f472a0a4e966a17de3054a8bc dedup_wf_001::b77b84aec681a41428579a44347402d3 datacite____::ba2372c047e9151f5566dc17c768e13f dedup_wf_001::874be2e7cb10e5a88fb39039785ce274 dedup_wf_001::07db729092a1bf924ac83f935a954255 datacite____::0c5738984e63350071535fd8e73b35a5 r38d07aef7b7::124461dcd3571e6674ec4e0e140cc298 r38d07aef7b7::fd9f2aa91ceacfb305f86f2f76bfd494 dedup_wf_001::2f4cd0a689df7a6613b9ff4e84b34df6 dedup_wf_001::3c74f6d5c43355752f342f3e30bddf86 datacite____::152a0eb3563671c2ec7b2e6b84bd6d0b dedup_wf_001::dd88b51c7ec34b0e50f7c49ed4164bb6 r38d07aef7b7::4ccc3735e387537e61269a976a33e412 dedup_wf_001::80cea7c26dcb03eb4b39e61b1effd8d1 datacite____::5b7018ad8caa4e45bac84223d93897c7 datacite____::66734f6fbfece8fe1dcdf2515628be52 r38d07aef7b7::3bcad4e7af821b33b29f7078b90ab75a r38d07aef7b7::bddcda5d65fcfdec9de3838794a77265 dedup_wf_001::6f4922f45568161a8cdf4ad2299f6d23 dedup_wf_001::1a7f5cee6c09c0031cd5783d79740e14 dedup_wf_001::de50dd7c2f79237408a53ae6086551e2 dedup_wf_001::b6a05becd449b2d9d9d95010954c9308 r38d07aef7b7::133b5f08ade8b354bfd42b98c629ef05 r38d07aef7b7::13d429db192fbc7b5cabf9b936cf78e1 dedup_wf_001::7387f1ed39c0198734cd774f398e4398 dedup_wf_001::29c117378bda70200aa09a0baae05afe datacite____::f4a9cdb6299031f87b702d50e93431d0 dedup_wf_001::261732af0fe337df28be726375463dce datacite____::6093c3a026868d7dbcb976b2900976fc r38d07aef7b7::abd815286ba1007abfbb8415b83ae2cf r38d07aef7b7::e261489ab942429a6600c1c4121ac14d dedup_wf_001::174f8f613332b27e9e8a5138adb7e920 datacite____::dcb90243f3119e47795e6dde40a1c44f dedup_wf_001::44b0e8fa282c03644a16def023c48cdd dedup_wf_001::462b7359bda3d8ed2873c091c2f3b367 dedup_wf_001::3a078109c16feb217c4a4b697d044990 datacite____::14063ac79d0d4dccffb95930463da296 datacite____::560451e078bd4a00adf84fcbf2d475b1 datacite____::379350260e640997e4a79c00b37d69aa dedup_wf_001::53c16d65d012198a587f8745bad50014 dedup_wf_001::c0ff1c505fd116e5a8464fc4068554f3 datacite____::5af31a4dc554de0fd5bcf320a33e1494 datacite____::005f157c6f2ebd438eac5a55540457fb dedup_wf_001::962e0272f808572f42c896e12720f625 r38d07aef7b7::1f10c3650a3aa5912dccc5789fd515e8 r38d07aef7b7::be1bc7997695495f756312886f566110 dedup_wf_001::8cde4f55f710f9f07236662ad05f7f05 datacite____::62cb9ba8515a239a73f5f21f98e10a0a datacite____::0070899a302319800e37c660b21fe1e2 datacite____::13afe469cfa9d9840d9fb145ff1cd702 dedup_wf_001::18997733ec258a9fcaf239cc55d53363 r38d07aef7b7::754da7dc2ed681cb2084a83124fc63cf datacite____::d9f9300f8c65d87bcadc4f061fe67a73 dedup_wf_001::9ec1e0f696fb8327e37674f9b67fec35 r38d07aef7b7::f1b8b7b3ceb65c188dcdc0851634cadf dedup_wf_001::503a356966a9db3f68c9ca050d3d77fb dedup_wf_001::413827d57b940b4b9f0d23012330d573 dedup_wf_001::4423231251806e094b61c5afeba7a535 dedup_wf_001::60bc551ca678c042256508c5a0f46689 dedup_wf_001::4f9e8bd6c0b2752cc4eb8115ee61c923 dedup_wf_001::c4b7a093d0c8baf772bf67cedc999d2c r38d07aef7b7::4b4c6c207e1e59c5af70b3b4c7b46c5a r38d07aef7b7::d542599794c1cf067d90638b5d3911f3 dedup_wf_001::d5bc8b494bd1879cf590498995206e14 dedup_wf_001::dcdeb3bdb79cff4f5225298409e438c3 dedup_wf_001::f88b71913a966f761ded194e27330ead dedup_wf_001::17759a641175195d47a241f627dd7003 dedup_wf_001::1055f539f5e622ab14d46487c3daf73f dedup_wf_001::4bd4226f0f3144fdc5647d65e5a8d873 datacite____::858ac0b14c81cdc8db1005a67ca816a0 dedup_wf_001::1d082b72d1f60e0582fc0ffe412aaac4 dedup_wf_001::365aa6ebdc3dbf28e7b9ea1c1b4d2908 r38d07aef7b7::5a90c7cf26f2109e4db8466c251911be r38d07aef7b7::e5c6f944080958c264936693c43f8aaa datacite____::4848ada5d4aae379ae89924371316479 datacite____::191647d9e0a75d7d0c797541df62e300 datacite____::55eecfe9e90ff06bcf1245658a4aed77 dedup_wf_001::f09e534a6bfc6b05de696f1fe27634c8 datacite____::694b608a1bbf54d3bf03f33478f62f0a datacite____::3426c36e6b0668ea520b848862aaa343 dedup_wf_001::446ac7480f6cd015d176f8b3d28a03b5 dedup_wf_001::544791a2847e5e9324cc4747a27f7237 dedup_wf_001::3e161c33e87de157ab48186b6420e768 datacite____::798018b75283b200ce7052d73be3b7b5 dedup_wf_001::7f2dc9ca702c66e1dd36b63fdd0d2dae dedup_wf_001::b65825e7b2c0e9d8e051d4a3b97ef088 dedup_wf_001::6fc1e19f936b4766aaf858f978dd9b0c dedup_wf_001::8637141cb688de20443ba785b28e3ab0 r38d07aef7b7::243f6a5292350cc163601aac9ad3e854 r38d07aef7b7::52dbb0686f8bd0c0c757acf716e28ec0 r38d07aef7b7::8c41eebf5a1f5867cbe38cf59b37c1bf datacite____::66fcb8d2a85690d69b9e29571a09362e dedup_wf_001::1c383cd30b7c298ab50293adfecb7b18 dedup_wf_001::1707acd183aaa7bc989a6ac92fabb2c8 r38d07aef7b7::8db1625bead0f643f7f7913edc2a8434 datacite____::fbdbc51c1cae07b3f75f086256686c7d datacite____::1f36f259997b5795202a0bebd2292007 dedup_wf_001::66dd89ef074e7ac9d4c1de6775991b0c dedup_wf_001::6e0e24295e8a86282cb559b860416812 r38d07aef7b7::aebf7782a3d445f43cf30ee2c0d84dee dedup_wf_001::9264177717e350795dc4687789512c34 dedup_wf_001::95375ffaf5a308b340e5eba805715568 r38d07aef7b7::07b2ee9f02d5e6e8894377afb4feed32 r38d07aef7b7::186a157b2992e7daed3677ce8e9fe40f r38d07aef7b7::a431d70133ef6cf688bc4f6093922b48 r38d07aef7b7::ec0f40c389aeef789ce03eb814facc6c datacite____::9ea315721d9089632922f8ca28c25849 dedup_wf_001::6cdb2c0acda55360ac8e3e33fc39bbd3 dedup_wf_001::b9e1be0891c6a16e9644a57b798ac8a0 r38d07aef7b7::41d626e181cd445e3cac18440a448424 r38d07aef7b7::5607fe8879e4fd269e88387e8cb30b7e r38d07aef7b7::8f53295a73878494e9bc8dd6c3c7104f dedup_wf_001::a368f8f84bce73d071a34722eb55f03f datacite____::95843e4e22c345e8ee61f7ba834c70b3 dedup_wf_001::0181dbcc3606f670bbe50f984967f358 dedup_wf_001::063b7d7ae9cd5ea74e1f879c52a91917 r38d07aef7b7::298923c8190045e91288b430794814c4 r38d07aef7b7::987b75e2727ae55289abd70d3f5864e6 dedup_wf_001::03c65c6be9c8b37f09759c662325f152 r38d07aef7b7::e58cc5ca94270acaceed13bc82dfedf7 dedup_wf_001::20aee3a5f4643755a79ee5f6a73050ac dedup_wf_001::47cf1cacc977063fd3ab8c1681d344c4 dedup_wf_001::f39650b66c31ee5d8d33a7ac2b5977ad dedup_wf_001::338bda4610126bf5b01eb64f01c39b5e dedup_wf_001::44ca77772ddd0d2200fd5e95cfb37ae2 dedup_wf_001::4e38a56a966fd2c2b4fc978ce57d20ee r38d07aef7b7::cda72177eba360ff16b7f836e2754370 dedup_wf_001::04cf31999d95c51dc6b3eb0770c9b520 dedup_wf_001::cc2ec0e9790ea02622e0c9bea8822804 dedup_wf_001::09eeef09f2210c5176693da3b918d36c r38d07aef7b7::1cbaa4e5609fb6517f54f0ab0c205ada r38d07aef7b7::536a76f94cf7535158f66cfbd4b113b6 r38d07aef7b7::9430142689f1e3004253e1d85c9aef57 r38d07aef7b7::e84401ad27c4cfb9815776eb9432ff17 dedup_wf_001::c9a37ed9f5261e7b116c7cf0065c0794 dedup_wf_001::ba9ff9ccf2c1a2b138a20c5c2fc6501a datacite____::eaf73b8884b8bfec62dcb523b454e2be datacite____::dc875eac206cbf1660d30888f29db383 datacite____::c0acc1df4fad82cc743e2cbcac528e9a r38d07aef7b7::d79b5c2f0375f87503706a142964d7d5 dedup_wf_001::6e2715c4ee2dd9a6eabb9279d6684699 dedup_wf_001::5f5c048868bce9c55853e587a4ced9d0 r38d07aef7b7::531db99cb00833bcd414459069dc7387 dedup_wf_001::1643511b5df8dc9bb2bc4b5d712370ca datacite____::3b28b73795eb7af88aba69cbe8314005 dedup_wf_001::264f2e01237058d1ae12f4f56ced8347 dedup_wf_001::96feb27374dde404eec29783f9c5b504 dedup_wf_001::0829cab14fd3f2444652a9cf2b779732 datacite____::f00a6893d85d8f44f6f4eec2dac8d4f7 dedup_wf_001::5d5f388fcd32bfec29ea54c9fdb1e578 datacite____::2e76c94162df19df9b7db19bd211904b dedup_wf_001::5d02327b6b75f0167a87557b655ba440 dedup_wf_001::34fde688ef533f60a90e13e3238fc23e dedup_wf_001::26408ffa703a72e8ac0117e74ad46f33 dedup_wf_001::197de90defc3b543859d0c1ad3c2c77e r38d07aef7b7::14ea0d5b0cf49525d1866cb1e95ada5d r38d07aef7b7::361440528766bbaaaa1901845cf4152b r38d07aef7b7::cceb1161867ab91def7fac026ead455c dedup_wf_001::db8361c8ead35eb0376d50cba21777e4 r38d07aef7b7::78ccad7da4c2fc2646d1848e965794c5 r38d07aef7b7::0f9ef8cb70bb4135133a24a464ad55e1 r38d07aef7b7::137ffea9336f8b47a66439fc34e981ee r38d07aef7b7::cfd66e741860718ddecf1f6eabd05fc6 datacite____::949f6a3dacb58c49d5efe348d9a07f3e datacite____::bdaf5bbdff5355a78584bb805686c2f7 dedup_wf_001::1d2aee2a6f5c58a1561041f50cb27981 r38d07aef7b7::9af76329c78e28c977ab1bcd1c3fe9b8 dedup_wf_001::56207854195a80308778147e1f7c7728 r38d07aef7b7::50177f8a9ab8866cb77c77ae1e47c5fa dedup_wf_001::ad38ae802c1dba5c60a3ea8e4f5b1e08 dedup_wf_001::74929e31e2071052d67719bd92af6aba datacite____::27cf64b29ad0dacffe8d397c221224f6 datacite____::aa970012fe3b849f5afe19c372ecf2e6 dedup_wf_001::29d6d05568a6fa882f3ff0b5c9ece960 dedup_wf_001::ee6f4043bba9021436c19cea29eb9b08 dedup_wf_001::3f845e7b5d2ed2f3ff23ff4f96da779b dedup_wf_001::417229acc3bc53ab3c260f42a5780caa dedup_wf_001::9db4f120e04f921a331a70f1e5d41ef3 datacite____::5cf7bbf9c1c71c410649210e79685402 dedup_wf_001::3d19b555f4a3f063b0cf7660af9fbf79 dedup_wf_001::27213d4c9e43e44257a1bafb26d1cfa5 dedup_wf_001::f874adbde59aa80586975c7e15fd5232 dedup_wf_001::88df1d18a85e41643042895f6c9c336a datacite____::d63344ce48568515fb7b9c64efdaf68e r38d07aef7b7::4e4dfebee38dd25062b6888505bcca50 r38d07aef7b7::6547884cea64550284728eb26b0947ef r38d07aef7b7::6de4bfe9504589a457d6e92fae4f9613 datacite____::05b5e8e5bc3b197e0719c06104414f2d dedup_wf_001::8fbbac76293ae93d906ca0f49ab85b48 dedup_wf_001::543a84894716c6c0ed6cabe60aac9945 dedup_wf_001::430108f92cc52a0e62cdec1b8df60297 dedup_wf_001::3b944683a96ec43a374835f2f2691d81 datacite____::8e8d6f148d9e7b46c2b8af9d47546a08 dedup_wf_001::cc086bcd836b5672ca48377fb58139f6 dedup_wf_001::3ee4f8cffd4de02743d2dae81ed6e82b datacite____::66a1c3eeeac3ddaca069542a9800a106 r38d07aef7b7::a60b48c9d56949d618129c45511b5cad dedup_wf_001::02e2259af55fb6a382ca7aa9f13319b3 dedup_wf_001::37967876fe062fb38ae244af3f793d71 dedup_wf_001::9675f0018d103531e073a3da2945df41 dedup_wf_001::983a6e7a05d3fb4d27f85e86ef2a71c0 dedup_wf_001::ffcc0381a555d393b99f4efeecbe186f dedup_wf_001::e49c24db85af66780b897252c6a01866 dedup_wf_001::3d684d83ae41e66000194029e38a36d8 dedup_wf_001::6c2be392910c44524d26880c6ae35b48 dedup_wf_001::6d17ca3223407ef8f72e53cf865fed4d dedup_wf_001::13c1e9d1de2f0482c2243be60006a19a datacite____::75569e6cfc6d139792da0782049a42b2 dedup_wf_001::7d2b92b6726c241134dae6cd3fb8c182 dedup_wf_001::18def9f1e15f5c8cb3f88ce32d0388f4 dedup_wf_001::75822634893f55d9a2005bed4c612f1c datacite____::f6b893e834147be4b637a0d7443ff9c7 datacite____::1d3c862887c18173a8fe037be9bdabee datacite____::ea1bee9747f7279685714f5be8421c14 r38d07aef7b7::df877f3865752637daa540ea9cbc474f r38d07aef7b7::eb2538078fc0e47beef6c4bd5188c471 datacite____::efc3049f89651aaa1edc882e65e203fd datacite____::7b86fb6d10a17a0fe17722b8011cd71c datacite____::4d0f219162004041ff0733e9a65cb007 datacite____::be1b6cf4ac2d7984282389904d727955 dedup_wf_001::75db916b377e3165bf35abb397a7cfac dedup_wf_001::4fc28826fe37fa34663f221c2bab937b r38d07aef7b7::1e5afae270de728fd14f20133233d33a r38d07aef7b7::2fc02e925955d516a04e54a633f05608 r38d07aef7b7::471c75ee6643a10934502bdafee198fb r38d07aef7b7::8e68c3c7bf14ad0bcaba52babfa470bd r38d07aef7b7::9ed017d7372360c256add7a8fe35a0a6 r38d07aef7b7::ecaeafc0832340d4da28bd0370c03094 dedup_wf_001::5490e4f6955c4b2a8f2763bbe336ddf5 r38d07aef7b7::cfecdb276f634854f3ef915e2e980c31 datacite____::d48abd8596cb55452f8c07aedb7c3b23 dedup_wf_001::8964b6fed0c6a0ec12da82220a30acbc dedup_wf_001::23700fe6da220c7f55d656d0683d0738 dedup_wf_001::61971e77205538fca2c4881aef59479f datacite____::6c53efd0ebf25715dc9e6fb3d6d687b5 datacite____::6bf835676954787dca207ebb300273d5 dedup_wf_001::146e8732caf21f3894e93086140512a3 dedup_wf_001::13ef8ed30b64b2b14db548c261c8c883 dedup_wf_001::58268887d013c9f1673fdad95024bbe6 dedup_wf_001::9198c75f61dcf56b585072b7688505e4 dedup_wf_001::8ae7733f9bc11275e8d0a0fdabe5be0a dedup_wf_001::e2c37302cbfdb63fb34daac0c6dbbd79 dedup_wf_001::b7d26731ec66915057fa34708db628bf datacite____::2fa2238a2b3be2be73178f81a1efce8e datacite____::4ea26001033ecb866ac1511a8046a39d datacite____::e985b68394e01237631e34a732768c7e datacite____::0cb90d01629c53cecee9432e990729a6 dedup_wf_001::8ad0f4cc8e986e45bf8f1b2b4e7d8600 datacite____::47ecfe3a0c954e4073b426c5fd3230fd datacite____::7288e3745d519df376d0fbb61bad4db4 dedup_wf_001::242b9b4c4271fe4d56d4039dbf41572d datacite____::7d8ec8ffe3244c8bea4af57bbb37e938 datacite____::d6fc7b2b7ad0f3903417a455a0ace9c6 dedup_wf_001::b093e2197a49e076c37add66d52102cf dedup_wf_001::3e4aa8e05e65178e26aed32291ca1d7c dedup_wf_001::e627b271e4f233a60a896e9ffb3a175b datacite____::eb6f3ae189d959021b29a498eadca680 r38d07aef7b7::ac2d43ef3f26cc74de242202e822ecb0 datacite____::9f1e3177c8f870b5feb19ddad2da6762 dedup_wf_001::1de2543a72800e2e61fd582da3a752f6 dedup_wf_001::033983e84a082f5ad40b077d3e6edfbd dedup_wf_001::289c30baffae946d976d4d7777bd44c7 dedup_wf_001::5e461fd134d7a62ccb1c7f413a3028d8 r38d07aef7b7::19b5f0dd9d71b2003189f2d35a7c89d1 r38d07aef7b7::98afdcc1ebd85daa0f1749c5e56b9d8c r38d07aef7b7::a8ecbabae151abacba7dbde04f761c37 datacite____::f0108c21bd777755b87c787f598a84c2 dedup_wf_001::b22890f15a401a1b855ceedf98b785fe dedup_wf_001::14652db923ec15adc21191dfbfb70b2e dedup_wf_001::648a984e9adcde2fa868a0f8bd36ab6d r38d07aef7b7::54f5f4071faca32ad5285fef87b78646 r38d07aef7b7::7c1192a2afb55fdee2a326ef8de8a3a0 r38d07aef7b7::f5e083092550d2f93898e9829e677e39 dedup_wf_001::4bcc9b7fafa01106a128cfad3cebcf16 dedup_wf_001::59ad2d62dd9fb49c02416cd78ddf0beb datacite____::994637be6a83574e514d98326f6e69c2 datacite____::a3c6fb4b7dd5f6dbdd7cc479a194503e dedup_wf_001::d88dae8a9ffdc1e2290ddf1c3658c0a8 dedup_wf_001::466ad710a1f835f157d6f375efb4434b dedup_wf_001::1ffe4b75670b34433401171b486b787c dedup_wf_001::02500a38cc6522f4d58ec42959990abe r38d07aef7b7::a97da629b098b75c294dffdc3e463904 r38d07aef7b7::b2f83c409ce63012229fb9cd465bdcfe r38d07aef7b7::53e232bcc4a6386499454667194addd1 r38d07aef7b7::621fbd17da27241c58015eabe4164a52 r38d07aef7b7::416849da96fb73bee793e2bf65ae43ac r38d07aef7b7::74db120f0a8e5646ef5a30154e9f6deb r38d07aef7b7::976f3d77e359f934970e7287f2318116 r38d07aef7b7::c4c65c2e1f678ba44aa520651fee3941 datacite____::ee23dc27e33651a38376e8b66fe93553 datacite____::175b852967276c0c1d7d92885ec98523 datacite____::408651b0fe3138755e69ff86c6040f70 datacite____::3b8cb72745e9715e6db6e13202d762d7 dedup_wf_001::6b18886bc278247582704943f5c66eb9 dedup_wf_001::732ecd34115f8996fb9fa4ae613f89ba r38d07aef7b7::0d0fd7c6e093f7b804fa0150b875b868 r38d07aef7b7::ea9c07b5d0be2b8ee4631ee110f97fb4 r38d07aef7b7::f21e255f89e0f258accbe4e984eef486 dedup_wf_001::1e52b1356d89b333af2e7cc9414894c5 dedup_wf_001::705227e50c215ba45f4f457882c90607 datacite____::51b1a85f5cbfe2cea1449a654f857999 datacite____::62c2aafd182bd92e51b897bfec9a8732 dedup_wf_001::3d064e0a5c576bd7f01f50a434e59c81 datacite____::be28173838ccc613b589e35e30d9cd1d r38d07aef7b7::bcbe3365e6ac95ea2c0343a2395834dd r38d07aef7b7::e103d1ed1d6c41b0f098ff377dde2966 r38d07aef7b7::e2c61965b5e23b47b77d7c51611b6d7f dedup_wf_001::016ad0c411c1a571ecc34b1addd78c4e datacite____::0b4c9e61bd83b8248dde2e8108a1d2a6 r38d07aef7b7::68af34529bb4ab8575457d9e16801849 dedup_wf_001::8b2091796967c6935a7bdfde87fe7604 datacite____::cd7e06ba3f1c0dc47b7bb7c1a6deb2c9 r38d07aef7b7::05a5cf06982ba7892ed2a6d38fe832d6 r38d07aef7b7::13e5ebb0fa112fe1b31a1067962d74a7 dedup_wf_001::28aea90b6334886746b6bcd368670b2b dedup_wf_001::ce98a19af0a0f0c2b1b402f9ca0706e7 datacite____::76549c6c590b06a61dd1e893c5ae9637 datacite____::5c0f10a6bcea9e4003232056a5cc71f7 dedup_wf_001::3c983381665c92b6082a37cd7f7752b0 datacite____::61e5dfd8d039cf15772ab5255db65443 datacite____::869834d55887ca42da79370b0904b8e4 dedup_wf_001::79a49b3e3762632813f9e35f4ba53d6c datacite____::75c17f22977302fcbefb8e6104a68932 datacite____::cd4572f28d7940bad945fb63f4a460b7 dedup_wf_001::9843a745d90a5a55cb0039aadeea32c0 r38d07aef7b7::295029833128d5e7b4f965599342d793 r38d07aef7b7::ecd62de20ea67e1c2d933d311b08178a dedup_wf_001::348554ac0daf5675fa4e45f9b6c71006 dedup_wf_001::21cf1e2c7605ae77ececeed18a7e2c96 dedup_wf_001::9e46f1febae5c757b00c5d088ce85825 dedup_wf_001::68c694de94e6c110f42e587e8e48d852 dedup_wf_001::4fbeacd6aeb8b09cbb453009783148f9 dedup_wf_001::50032b5dc686cd7a73d7b71794ac21e6 r38d07aef7b7::36a1694bce9815b7e38a9dad05ad42e0 r38d07aef7b7::dd939412d661b27a92e611a89e977f0a dedup_wf_001::0c4313664052e64896fc623e29bccb87 dedup_wf_001::560421d76f9da92cfca0742842f2d1ed datacite____::326991d241a3450480a0cafe60cb69da dedup_wf_001::0111809bba9c15ce87713981bd8201e0 dedup_wf_001::144375bda25220a19494870a020d60ef dedup_wf_001::04112144566d77d75f935b26101dd71d datacite____::d070a076bb93165ccf3f8f1c767bb855 r38d07aef7b7::f84d465177e84bb4e756a8319443cdcb dedup_wf_001::87ec2f451208df97228105657edb717f dedup_wf_001::4ec405bc6270f314eadb139c14099a4b dedup_wf_001::93d6db6c098728cb60cdbdd2567150ab r38d07aef7b7::58ca56b8d08b89f6972767847e087c72 r38d07aef7b7::61fb56acb88a66651048b4b2086d5b5a r38d07aef7b7::8e987cf1b2f1f6ffa6a43066798b4b7f r38d07aef7b7::b8a6550662b363eb34145965d64d0cfb r38d07aef7b7::a8abb4bb284b5b27aa7cb790dc20f80b r38d07aef7b7::018dbfb5fec8d864714ede49cef50343 datacite____::db81fa74a252ff64106cd90db5b4003c r38d07aef7b7::ca0daec69b5adc880fb464895726dbdf dedup_wf_001::081af284f65bf4aa9e0c90c0ab2137bf datacite____::02da3d8c6b76dfc6113deb31be56046d datacite____::17ffffe91dc151c24d308225e955b579 dedup_wf_001::2173f0f840665f35051c72fff8137ef2 dedup_wf_001::4e6f583a702f7d015ca5439403429535 datacite____::a15ab1619aba7071dd86ff49cd00974e dedup_wf_001::29f08fe3ea82326fc67685c9a8cb9909 dedup_wf_001::08040837089cdf46631a10aca5258e16 dedup_wf_001::32c8c34d2691ea14db86416811a29726 dedup_wf_001::43c183b75768df57ede4d6b5361e9311 dedup_wf_001::7f9f1c8d90c069f16dc638b529ba03ba dedup_wf_001::87fea017dd924d0be1eb71951e50148f dedup_wf_001::e7b125dae1dc6d70a6b4c46e42800ae4 dedup_wf_001::e8cd13ac4246ea90c1dbb5aed941c8ed dedup_wf_001::9654f1faed5e2011c2ce76eecfb76325 dedup_wf_001::102b91e75544875f2a482fe6f9fe18b6 dedup_wf_001::a5f7b3e3fdf4559cebd36f8fc57adf16 r38d07aef7b7::65c57c59b1c396fb0bee33d21a7fe822 r38d07aef7b7::e69cf84ed41fbe71985972c027190b49 dedup_wf_001::c470366225ec05683c4306d5eb22762c datacite____::eb40f91c891e0f366e4905091ac254c6 datacite____::1c0a8ef7fbd0c4476e06f6274487072b dedup_wf_001::166171ebacbd5235960b5d8faf4437f7 dedup_wf_001::ad36ca1e07a86a645286e275594b9e43 dedup_wf_001::03afdbd66e7929b125f8597834fa83a4 r38d07aef7b7::704afe073992cbe4813cae2f7715336f dedup_wf_001::bee7df60a52591ffcb299458de260512 dedup_wf_001::55f57b40ca8a55340f65920258e213fe dedup_wf_001::9f823045742f913d36ae4d9a9f0d75ad dedup_wf_001::34b6467a7039bd0e8aa5f6983aa303b0 dedup_wf_001::654897e28023b9d57f35dc4424b892fb r38d07aef7b7::46d09c503b30980ffc325cc243e1c0f5 r38d07aef7b7::63ceea56ae1563b4477506246829b386 r38d07aef7b7::ceff40a7fe4e78d7c988ed83759f7d91 datacite____::710404f715306843504c991c3460bc0e dedup_wf_001::06b6c25e06b0a7fbb9bfa0aeccaf34b1 datacite____::004922adb51caeec5c859eb58568fbd7 dedup_wf_001::bd5c5e1c04111451ed8b63079ea181e7 dedup_wf_001::90aef91f0d9e7c3be322bd7bae41617d datacite____::d3214ebb0ff294f713a2a9cac8695f6b dedup_wf_001::b69c16d99962a7bb5c3d97761ba9727c dedup_wf_001::30547fb3599ed51e4f075ba9c753c8c3 dedup_wf_001::944fd3693dacd39280bf4c651c2a149e dedup_wf_001::951b9abf25fef051c0486f49a1d44ee5 r38d07aef7b7::2b38c2df6a49b97f706ec9148ce48d86 r38d07aef7b7::4a11654ad1e1e48352252859ff3032a0 r38d07aef7b7::5bd7ddf87f22021a5f5d682ce5f93ad6 datacite____::48dfb0cf4e383428f5dc2a6763d51782 dedup_wf_001::51d3a6f35b8dfe611ff24214c8ef79d1 dedup_wf_001::f939c4bc508ae7ffe02769f71e883a3f datacite____::4946a918e5703257dade63f00c21e0fc r38d07aef7b7::4d771504ddcd28037b4199740df767e6 dedup_wf_001::539d6630cf6e6141a0cde76b7d24adb5 dedup_wf_001::59fcb3d5af85bc0ac8ae2de7fcec84ee dedup_wf_001::7bca78fcdba29dd12d74ba20a5afb058 r38d07aef7b7::803dddd7ea91e91ff16610f6c8009355 r38d07aef7b7::9308b0d6e5898366a4a986bc33f3d3e7 r38d07aef7b7::bd1354624fbae3b2149878941c60df99 r38d07aef7b7::ebcd0fb1c44b0ed07842254daec4c3cc datacite____::a07a322113e29c41fb87367cdeca13b5 dedup_wf_001::0405f859a58c10eb6d646b9f31c569a5 dedup_wf_001::0e87de457be3fffbc57408200d762452 dedup_wf_001::617b469f86bdb00a998cba5afdb2f8ce datacite____::ae0b73e683137bb5f1eb97c2ca92d6f4 dedup_wf_001::2e1968be4ca3bf9143cc09de21bb7015 dedup_wf_001::411ae1bf081d1674ca6091f8c59a266f datacite____::9af7f9365530f0e0f90f59efc383d0b0 r38d07aef7b7::2aa9c1afdc1323b9c19b35a4a09b989b datacite____::3655732d9d09342fa1de8b0bd5f92614 datacite____::3b27b78d0996506638ba6f5463516bb2 dedup_wf_001::f8e6db8fcc7e428005cf296ea2d7e8eb dedup_wf_001::026947ba375f344b921f7c825ec784c1 dedup_wf_001::22e90e50fd9480d9f0e6142740238040 dedup_wf_001::5fec4bb494d06947cad115993fc92794 dedup_wf_001::18e39d0ce826acbaba5877d4eaa3857d dedup_wf_001::8e62985e4b2d75cb6e07e5cd2a71006a dedup_wf_001::4c5ae0c7f73c189ea164f56f7aee8878 dedup_wf_001::c80d9d47dd687f273bf375d81e6aa393 dedup_wf_001::a714695bf9f489d1bad29c5aff5acadc datacite____::886e4c4c825678a89a97c0a2f139affe r38d07aef7b7::1b9812b99fe2672af746cefda86be5f9 dedup_wf_001::ef0178ca5c3bfb727a260fe1f2802595 dedup_wf_001::96445d7c7343f3aac48e05721bd4d5a4 dedup_wf_001::925c923915874d0820ac71d1a05ce30b dedup_wf_001::0a9c4bbc33e3c6cb716683c1fdcfa388 dedup_wf_001::fdabfa58ee681c90b867cdacc5cf0efc dedup_wf_001::0e087ec55dcbe7b2d7992d6b69b519fb dedup_wf_001::692672a7737fd732204b9233245c22ac datacite____::e23d1462f95afef4e05be178ab20ccd9 dedup_wf_001::1cb03f96b2b4f9015c5abe7f331eb12f dedup_wf_001::eb28cc74968e585054a4d577ea04d80b dedup_wf_001::44e9fb1fc36212c86eff5e797d123d7a r38d07aef7b7::19bc916108fc6938f52cb96f7e087941 r38d07aef7b7::f5701b023d76d7b269d43e06c4a879bd dedup_wf_001::1cecc7a77928ca8133fa24680a88d2f9 dedup_wf_001::444eb4f45f208a9a73e3739f4797ca83 dedup_wf_001::285ab34d956801aa940bb44874d1b54d dedup_wf_001::9e3b42157b99a2359b1a90ba92835702 dedup_wf_001::e908c567ab9a95320696d05c1e7cdc58 r38d07aef7b7::4191d9903a4cf9f293dbbbff63f119c4 datacite____::21886d4c09205efd26436813f063a80b dedup_wf_001::764f113b98a4cb6509c0a1b76c25d000 dedup_wf_001::246fb7bbab5a23a9f0bc79b9853433bc dedup_wf_001::ad4fcd9edae2b298b9b53e4f105db1ce r38d07aef7b7::a3a92e719349dda06de72dac3448e149 dedup_wf_001::3503afb89f7576c2518b2194382c9a23 dedup_wf_001::069203996808cec2bcf9e33b4908f9f4 r38d07aef7b7::cc9b3c69b56df284846bf2432f1cba90 dedup_wf_001::392eb1b988bc2beaacc2b67cbcf9a58d datacite____::97f37d29f1bf06a33595ac89d151c7d6 dedup_wf_001::9a508fc488295a9c3869445ebe0c284b datacite____::35d174fb24b75fb44e7868f1ddaeec7f dedup_wf_001::536ae40810ea1b3ff1237dd4c6e23712 r38d07aef7b7::8e60cfb63ef8bedd98f6868c6accf1c2 datacite____::b66fb9de12cab6a729b9bd4290945707 r38d07aef7b7::0307fec2cef6aec340b8426490977ef0 r38d07aef7b7::a1afc58c6ca9540d057299ec3016d726 r38d07aef7b7::bd9e928c0f0fba89b5c8254bef1f9937 r38d07aef7b7::ec8ce6abb3e952a85b8551ba726a1227 datacite____::6a03aadb1fbf114b543aadcfc9c8d788 dedup_wf_001::e78a21bbab530f289b3691ed174d6924 dedup_wf_001::20a52a3b4dfef95631fec59da40d36c7 dedup_wf_001::5f9af18db2421d982fdfe3cb5458f90c dedup_wf_001::c55d22f5c88cc6f04c0bb2e0025dd70b r38d07aef7b7::46ba9f2a6976570b0353203ec4474217 r38d07aef7b7::f4573fc71c731d5c362f0d7860945b88 dedup_wf_001::c90523b43fa2691f243148f6edd965d6 datacite____::89aa6aa644c692fb2368b9078e4cfe15 dedup_wf_001::b58ca2e39e94f994f0d8eaad788de687 dedup_wf_001::1f64e2558ab55f2618b7c651332bf101 dedup_wf_001::6d4bcfa605eacb74a48e2a0a871be965 datacite____::f15019f80cda9351d959741140dd7f42 r38d07aef7b7::097e26b2ffb0339458b55da17425a71f r38d07aef7b7::60e2126ffb2e2246df6c57b3797f2b48 r38d07aef7b7::4704d0a8754f7cd9619e6a8fab4c1021 r38d07aef7b7::c23da4fc9c3c0a2322caf4fa66762d78 r38d07aef7b7::fff23c80b2468e9402716e56f083ebc8 datacite____::2d218a0ab2e02e726f5e3278760b4fed datacite____::44203d4bcf1b2a1ca77e2dd7e30b8f0e datacite____::e2bfbbd0d9f7d21713560afa1261392d datacite____::2df2fd359b8c784d4839b8b3d709c474 datacite____::a68d51b3b7b4ee0d7488f3b39cc73ec5 dedup_wf_001::33ceb07bf4eeb3da587e268d663aba1a dedup_wf_001::06b5a77135669a22b90a089424743b9f dedup_wf_001::1c10e23137bab2e90f24434c803301dc dedup_wf_001::16d2db4f9c9f181c83c5ce5271e06429 dedup_wf_001::16ec018b232ff81024ae554497ea68c3 datacite____::66c1315f7576edda79dcce364f1ffa78 r38d07aef7b7::02522a2b2726fb0a03bb19f2d8d9524d r38d07aef7b7::f542eae1949358e25d8bfeefe5b199f1 dedup_wf_001::d743be2e035291bcec7abde6ade28cd3 dedup_wf_001::0045efa3a3907101711325a75e00db21 r38d07aef7b7::995665640dc319973d3173a74a03860c r38d07aef7b7::9ca90593821a015f234e9a8195ae5582 r38d07aef7b7::a1d0c6e83f027327d8461063f4ac58a6 datacite____::343eb140e814a0515e112d89ad1a676d dedup_wf_001::958adb57686c2fdec5796398de5f317a dedup_wf_001::524f141e189d2a00968c3d48cadd4159 dedup_wf_001::0cbb516d7d00fa1273df1bc2a768a055 dedup_wf_001::4837ba5cd49c7f03caaa423049e66daf r38d07aef7b7::9188905e74c28e489b44e954ec0b9bca dedup_wf_001::28aaad2dc33d001931d6c1fc41d62d20 dedup_wf_001::008f0dc9bada25da32d1b92f19e552fb dedup_wf_001::bcf5c3115d32f7bd9e840e10725aed81 datacite____::c49ffce94b82f8faac50d14f1307079f dedup_wf_001::5265a86a57edd3bd8507590fa1a08513 datacite____::781474be5f3c71eb6f29cf8877d6a622 datacite____::8d2484a1828e506d7d0c2f8eae9885e7 dedup_wf_001::5ecf9fff829b9cafb060d99880acf7d9 dedup_wf_001::4d73d1729595f9b09fd5d216b95a13da dedup_wf_001::315e873e9b08aa9238a2e74ef9f731a9 r38d07aef7b7::10a5ab2db37feedfdeaab192ead4ac0e r38d07aef7b7::999028872cfff7ae8ee330a33cbd3874 r38d07aef7b7::fc1dc4549df0335d7f506edb5d66af16 dedup_wf_001::09b0d2d491b0e7943dc4bd180135dcb0 r38d07aef7b7::468cbac056133a996283cca7e2976336 r38d07aef7b7::49ca03822497d26a3943d5084ed59130 r38d07aef7b7::50a074e6a8da4662ae0a29edde722179 dedup_wf_001::e17705877fd904fdef65cc892e4e5a6b dedup_wf_001::58ecc182286e496e4702f13edc491e47 dedup_wf_001::6c5bbaa133d30e1df4ac9ee7d0612f20 r38d07aef7b7::4f16c818875d9fcb6867c7bdc89be7eb r38d07aef7b7::5d4ae76f053f8f2516ad12961ef7fe97 r38d07aef7b7::72a36e8158ffceef8dc28aae2880f440 r38d07aef7b7::b2abed343c4faf5d13c585bbf2429538 r38d07aef7b7::cf1f78fe923afe05f7597da2be7a3da8 dedup_wf_001::4a94b9bd5b76175b0571e7e9a0336c3f dedup_wf_001::b84e97b375927518383947b0df7b84e6 dedup_wf_001::0c76a80b0d066e623aaad081aea57c8a r38d07aef7b7::73c4fa58d428d52c2b12e11f3b28e8f5 r38d07aef7b7::9bf521d44fb17b1abc961ebc08c28de0 r38d07aef7b7::f804d21145597e42851fa736e221da3f dedup_wf_001::3dff03d96f8f6d9d22d64c26a08e71f0 dedup_wf_001::6f10b5c4a55511c343a652b3fbb61b29 datacite____::65ce3001e1461d7981931bbf3c896c69 r38d07aef7b7::1e4bb9409318b6d3303cb415a351674f dedup_wf_001::3326718c02f10324bec8f90abe36005f dedup_wf_001::2e0912ea453bee6ca8861f354fb1a491 datacite____::a0d22241c4b116f8b8bf843efde82ce5 datacite____::a1d6e092567d8ed9bea4d16b49cb465a dedup_wf_001::1c1d4df596d01da60385f0bb17a4a9e0 datacite____::84917c8ad5f0284aa465223546e07348 r38d07aef7b7::48f7d3043bc03e6c48a6f0ebc0f258a8 r38d07aef7b7::641e167f974d1dd076c0886d17271975 dedup_wf_001::49ef08ad6e7f26d7f200e1b2b9e6e4ac dedup_wf_001::0a3013c44b99f3a47d5c9462bf6b31bf datacite____::5dc48b7ac45c2422192fea9a9b3dc525 datacite____::fcb29b7a9a5d3b53a7a71557457ebdd9 r38d07aef7b7::354680832fcea7e2b7057a5ac2c489f8 r38d07aef7b7::9a1158154dfa42caddbd0694a4e9bdc8 r38d07aef7b7::1c54985e4f95b7819ca0357c0cb9a09f r38d07aef7b7::4a4b16d454ca9f9075c129f6a0384d3d r38d07aef7b7::95151403b0db4f75bfd8da0b393af853